**Meet the editor**

Tsukasa Hokimoto is an associate professor of Statistical Science and Data Science at Hokkaido Information University, Japan. His main research interests include the development of statistical methodologies for data analysis and their application to the analysis of natural phenomena in the fields of oceanography, meteorology, and biology. His recent research includes developing

statistical models for characterizing directional time series data, based on the theory of nonhomogeneous hidden Markov models

## Contents

## **Preface XI**

	- **Section 4 Applications of Data Analysis in Medicine 215**

## Preface

Chapter 8 **Distributions and Composite Models for Size-Type Data 159**

**Section 3 Applications of Data Analysis in Finance and Economics 185**

Chapter 10 **A Practical Approach to Evaluating the Economic and Technical**

Chapter 11 **Validation of Instrument Measuring Continuous Variable in**

Chapter 12 **On Decoding Brain Electrocorticography Data for Volitional Movement Intention Prediction: Theory and On-Chip**

Chapter 13 **The Usage of Statistical Learning Methods on Wearable**

Mradul Agrawal, Sandeep Vidyashankar and Ke Huang

**Devices and a Case Study: Activity Recognition on**

Roqueline Ametila e Gloria Martins de Freitas Aversi-Ferreira, Hisao

Malcolm J. D'Souza, Edward A. Brandenburg, Derald E. Wentzien, Riza C. Bautista, Agashi P. Nwogbaga, Rebecca G. Miller and Paul E.

Yves Dominicy and Corinne Sinner

Chapter 9 **Modelling Limit Order Book Volume Covariance**

**Feasibility of LED Luminaires 201** Sean Schmidt and Suzanna Long

**Section 4 Applications of Data Analysis in Medicine 215**

**Structures 187** Andrija Mihoci

**VI** Contents

**Medicine 217** Rafdzah Zaki

**Implementation 239**

**Smartwatches 259**

**Analysis 279**

**Analysis 293**

Olsen

Serkan Balli and Ensar Arif Sağbas

Chapter 14 **A Statistic Method for Anatomical and Evolutionary**

Nishijo and Tales Alexandre Aversi-Ferreira

Chapter 15 **Descriptive and Inferential Statistics in Undergraduate Data**

**Section 5 New Approaches for Teaching Statistics and Data**

**Science Research Projects 295**

In recent years, techniques and methods for analyzing data have advanced rapidly in a wide range of research areas, such as technology, economics, biology, medicine, pharmacy, ocean‐ ography, and sports science. Theoretical contributions, building on the foundations of statis‐ tics and probability, suggest new methods for analyzing data. At the same time, there have been complementary developments in computational methods of data analysis, which have developed rapidly over the past two or three decades. Thus, technologies for data analysis have advanced significantly from both theoretical and computational standpoints.

At the same time, exponential improvements in sensing technologies, computer processing power, and data storage mean that new datasets tend to be more voluminous and varied in structure. Such datasets often cannot be analyzed using conventional data analysis methods. As a result, researchers and practitioners may lack suitable methods to accurately estimate and evaluate the phenomena they face. For these reasons, the need to develop new theoreti‐ cal approaches and advanced methodologies for analyzing data will continue.

We note recent advances in data analysis techniques from the literature published in various research fields. However, owing to differences in the characteristics of datasets typical of different research fields, awareness of particular statistical and probabilistic approaches may differ across fields. This suggests that researchers and practitioners may gain valuable new insights about challenging data analysis problems by reviewing findings from a broader range of research areas. This book aims to fill this need for cross-disciplinary input by pre‐ senting recent data analysis techniques and applications drawn from a variety of research fields.

The chapters in this book are organized in five parts. Parts 1 and 2 bring together theoretical contributions and methodological developments in statistics and probabilistic data analysis. The remaining Parts 3, 4, and 5 deal with particular applications of these advanced analyti‐ cal methods in different fields of study.

Part 1 includes chapters that deal with the theory and development of advanced statistics or probabilistic methods. Chapters in this part describe the theory and methods of analytical ap‐ proaches including novel time series models, new probability models, criteria for model selec‐ tion, and data smoothing. Illustrative examples demonstrate applications of these methods.

Part 2 contains chapters addressing new probability distribution functions and their use for data analysis. Although each of the probability distributions or models presented here was developed to address a particular problem, all these techniques can be applied more broadly.

In Part 3, the chapters focus on the analysis of financial and economic data. Here, authors demonstrate new models to analyze financial or economic activities. Topics in this part in‐

clude the statistical analysis of stock prices and limit order book volumes, as well as an eco‐ nomic analysis of the feasibility of LED street lighting.

In Part 4, chapters focus on recent data analysis research in medicine and anatomy. To pre‐ vent serious medical errors, we first introduce a comparative study of the various statistical methods used in medicine. Next follows a chapter on the development of statistical learning systems for human brain and human activity. The last chapter describes the development of a statistical method for anatomical analysis.

Lastly, Part 5 includes a chapter on statistical education. In recent years, many universities have introduced new programs and educational curricula aimed at fostering the capacity for critical thinking about phenomena, based on data analysis. This chapter describes such an initiative for undergraduate students at one university.

With the ongoing progress of information technology, we anticipate that the techniques and methods of statistical analysis and reasoning will continue to evolve and be in high demand. I hope that researchers and practitioners of data analysis will benefit from learning from the new statistical methods and perspectives throughout this book.

> **Tsukasa Hokimoto** Associate Professor of Statistical Science and Data Science Hokkaido Information University, Japan

**Advanced Methods for Data Analysis**

clude the statistical analysis of stock prices and limit order book volumes, as well as an eco‐

In Part 4, chapters focus on recent data analysis research in medicine and anatomy. To pre‐ vent serious medical errors, we first introduce a comparative study of the various statistical methods used in medicine. Next follows a chapter on the development of statistical learning systems for human brain and human activity. The last chapter describes the development of

Lastly, Part 5 includes a chapter on statistical education. In recent years, many universities have introduced new programs and educational curricula aimed at fostering the capacity for critical thinking about phenomena, based on data analysis. This chapter describes such an

With the ongoing progress of information technology, we anticipate that the techniques and methods of statistical analysis and reasoning will continue to evolve and be in high demand. I hope that researchers and practitioners of data analysis will benefit from learning from the

**Tsukasa Hokimoto**

Associate Professor of Statistical Science and Data Science

Hokkaido Information University, Japan

nomic analysis of the feasibility of LED street lighting.

initiative for undergraduate students at one university.

new statistical methods and perspectives throughout this book.

a statistical method for anatomical analysis.

VIII Preface

Provisional chapter

## **Why the Decision‐Theoretic Perspective Misrepresents Frequentist Inference: Revisiting Stein's Paradox and Admissibility** Why the Decision-Theoretic Perspective Misrepresents Frequentist Inference: Revisiting Stein's Paradox and

## Aris Spanos

Admissibility

Additional information is available at the end of the chapter Aris Spanos Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/65720

#### Abstract

The primary objective of this paper is to make a case that R.A. Fisher's objections to the decision-theoretic framing of frequentist inference are not without merit. It is argued that this framing is congruent with the Bayesian but incongruent with the frequentist approach; it provides the former with a theory of optimal inference but misrepresents the optimality theory of the latter. Decision-theoretic and Bayesian rules are considered optimal when they minimize the expected loss "for all possible values of θ in Θ" <sup>½</sup>∀θ<sup>∈</sup> <sup>Θ</sup>�; irrespective of what the true value <sup>θ</sup><sup>∗</sup> [state of Nature] happens to be; the value that gave rise to the data. In contrast, the theory of optimal frequentist inference is framed entirely in terms of the capacity of the procedure to pinpoint θ<sup>∗</sup>: The inappro- priateness of the quantifier ∀θ∈ Θ calls into question the relevance of admissibility as a minimal property for frequentist estimators. As a result, the pertinence of Stein's para- dox, as it relates to the capacity of frequentist estimators to pinpoint θ<sup>∗</sup>; needs to be reassessed. The paper also contrasts loss-based errors with traditional frequentist errors, arguing that the former are attached to θ; but the latter to the inference procedure itself.

22 Keywords: decision theoretic inference, Bayesian vs. frequentist inference, Stein's 23 paradox, James-Stein estimator, loss functions, admissibility, error probabilities, loss 24 functions, risk functions, complete class theorem

## 25 1. Introduction

 Wald's [1] decision-theoretic framework is widely viewed as providing a broad enough per- spective to accommodate and compare the frequentist and Bayesian approaches to inference, despite their well-known differences. It is perceived as offering a neutral framing of inference that brings into focus their common features and tones down their differences; see Refs. [2–4].

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

Historically, Wald [5] proposed the original variant of the decision-theoretic framework with a view to unify Neyman's [6] rendering of frequentist interval estimation and testing:

"The problem in this formulation is very general. It contains the problems of testing hypotheses and of statistical estimation treated in the literature." (p. 340)

Among the frequentist pioneers, Jerzy Neyman accepted enthusiastically this broader perspective, primarily because the concepts of decision rules and action spaces seemed to provide a better framing for his behavioristic interpretation of Neyman-Pearson (N-P) testing based on the accept/reject rules; see Refs. [7, 8]. Neyman's attitude towards Wald's [1] framing was also adopted wholeheartedly by some of his most influential students/colleagues at Berkeley, including [9, 10]. In a foreword of a collection of Neyman's early papers, his students/editors described the Wald's framing as ([11], p. vii):

 "A natural but far reaching extension of their [N-P formulation] scope can be found in Abra-ham Wald's theory of statistical decision functions."

 At the other end of the argument, Fisher [12] rejected Wald's framing on the grounds that it seriously distorts his rendering of frequentist statistics:

 "The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of acceptance procedure and led to "decisions" in Wald's sense, originated in several misapprehensions and has led, apparently, to several more." (p. 69)

 With a few exceptions, such as Refs. [13–15], Fisher's [12] viewpoint has been inadequately discussed and evaluated by the subsequent statistics literature. The primary aim of this paper is to revisit Fisher's minority view by taking a closer look at the decision-theoretic framework with a view to reevaluate the claim that it provides a neutral framework for comparing the frequentist and Bayesian approaches. It is argued that Fisher's view that the decision theoretic framing is germane to "acceptance sampling," but misrepresents frequentist inference, is not without merit. The key argument of the discussion that follows is that the decision-theoretic notions of loss function and admissibility are congruent with the Bayesian approach, but incongruent with both the primary objective and the underlying reasoning of the frequentist approach.

 Section 2 introduces the basic elements of the decision theoretic set-up with a view to bring out its links to the Bayesian and frequentist approaches, calling into question the conventional wisdom concerning its neutrality. Section 3 takes a closer look at the Bayesian approach and argues that had the decision-theoretic apparatus not exist, Bayesians would have been forced to invent it in order to establish a theory of optimal Bayesian inference. Section 4 discusses critically the notions of loss functions and admissibility, focusing primarily on their role in giving rise to Stein's paradox and their incompatibility with the frequentist approach. It is argued that the frequentist dimension of the notions of a loss function and admissibility is more apparent than real. Section 5 makes a case that the decision-theoretic framework misrepresents both the pri- mary objective and the underlying reasoning of the frequentist approach. Section 6 revisits the notion of a loss function and its dependence on "information other than the data." It is argued that loss-based errors are both different and incompatible with the traditional frequentist errors because they are attached to the unknown parameters instead of the inference procedures themselves, as the traditional frequentist errors (Type I, II and coverage).

## 2. The decision theoretic set-up

Historically, Wald [5] proposed the original variant of the decision-theoretic framework with a

"The problem in this formulation is very general. It contains the problems of testing hypothe-

Among the frequentist pioneers, Jerzy Neyman accepted enthusiastically this broader perspective, primarily because the concepts of decision rules and action spaces seemed to provide a better framing for his behavioristic interpretation of Neyman-Pearson (N-P) testing based on the accept/reject rules; see Refs. [7, 8]. Neyman's attitude towards Wald's [1] framing was also adopted wholeheartedly by some of his most influential students/colleagues at Berkeley, 10 including [9, 10]. In a foreword of a collection of Neyman's early papers, his students/editors

12 "A natural but far reaching extension of their [N-P formulation] scope can be found in Abra-

14 At the other end of the argument, Fisher [12] rejected Wald's framing on the grounds that it

 "The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of acceptance procedure and led to "decisions" in Wald's sense, originated in several misapprehensions and has led, apparently, to several more." (p. 69) With a few exceptions, such as Refs. [13–15], Fisher's [12] viewpoint has been inadequately discussed and evaluated by the subsequent statistics literature. The primary aim of this paper is to revisit Fisher's minority view by taking a closer look at the decision-theoretic framework with a view to reevaluate the claim that it provides a neutral framework for comparing the frequentist and Bayesian approaches. It is argued that Fisher's view that the decision theoretic framing is germane to "acceptance sampling," but misrepresents frequentist inference, is not without merit. The key argument of the discussion that follows is that the decision-theoretic notions of loss function and admissibility are congruent with the Bayesian approach, but incongruent with both

 Section 2 introduces the basic elements of the decision theoretic set-up with a view to bring out its links to the Bayesian and frequentist approaches, calling into question the conventional wisdom concerning its neutrality. Section 3 takes a closer look at the Bayesian approach and argues that had the decision-theoretic apparatus not exist, Bayesians would have been forced to invent it in order to establish a theory of optimal Bayesian inference. Section 4 discusses critically the notions of loss functions and admissibility, focusing primarily on their role in giving rise to Stein's paradox and their incompatibility with the frequentist approach. It is argued that the frequentist dimension of the notions of a loss function and admissibility is more apparent than real. Section 5 makes a case that the decision-theoretic framework misrepresents both the pri- mary objective and the underlying reasoning of the frequentist approach. Section 6 revisits the notion of a loss function and its dependence on "information other than the data." It is argued that loss-based errors are both different and incompatible with the traditional frequentist errors because they are attached to the unknown parameters instead of the inference procedures

27 the primary objective and the underlying reasoning of the frequentist approach.

themselves, as the traditional frequentist errors (Type I, II and coverage).

view to unify Neyman's [6] rendering of frequentist interval estimation and testing:

ses and of statistical estimation treated in the literature." (p. 340)

4 Advances in Statistical Methodologies and Their Application to Real Problems

described the Wald's framing as ([11], p. vii):

13 ham Wald's theory of statistical decision functions."

15 seriously distorts his rendering of frequentist statistics:

## 2.1. Basic elements of the decision-theoretic framing

The current decision-theoretic set-up has three basic elements:

1. A prespecified (parametric) statistical model MθðxÞ, generically specified by

$$\mathcal{M}\_{\theta}(\mathbf{x}) = [f(\mathbf{x}; \theta), \,\theta \in \Theta], \,\mathbf{x} \in \mathbb{R}\_{X}^{n}, \text{ for } \ \theta \in \Theta \subset \mathbb{R}^{m}, \, m \ll n,\tag{1}$$

where <sup>f</sup>ðx; <sup>θ</sup><sup>Þ</sup> denotes the (joint) distribution of the sample <sup>X</sup> : ¼ðX1;…;XnÞ, <sup>R</sup><sup>n</sup> <sup>X</sup> denotes the sample space and Θ the parameter space. This model represents the stochastic mechanism assumed to have given rise to data x<sup>0</sup> : ¼ðx1;…; xnÞ:

2. A decision space <sup>D</sup> containing all mappings <sup>d</sup>ð:<sup>Þ</sup> : <sup>R</sup><sup>n</sup> <sup>X</sup> ! A; where A denotes the set of all actions available to the statistician.

10 3. A loss function <sup>L</sup>ð:; :<sup>Þ</sup> : <sup>½</sup><sup>D</sup> � <sup>Θ</sup>� ! <sup>R</sup>; representing the numerical loss if the statistician takes action a ∈ A when the state of Nature is θ∈ Θ; see Refs. [2, 16–18].

 The basic idea is that when the decision-maker selects action a, he/she does not know the "true" state of Nature, represented by θ<sup>∗</sup> : However, contingent on each action a ∈ A; the decision maker "knows" the losses (gains and utilities) resulting from different choices <sup>ð</sup>d;θÞ<sup>∈</sup> <sup>½</sup><sup>D</sup> � <sup>Θ</sup>�: The decision maker observes data x0; which provides some information about θ<sup>∗</sup> and then maps each x ∈ R<sup>n</sup> <sup>X</sup> to a certain action a ∈ A guided solely by Lðd;θÞ:

#### 17 2.2. The original Wald framing

18 It is important to bring out the fact that the original Wald [5] framing was much narrower than 19 the above basic elements 2 and 3, due to its original objective to formalize the Neyman-Pearson 20 (N-P) approach; see [19]. What were the key differences?


$$L\_{0-\varepsilon}(\boldsymbol{\theta}, \widehat{\boldsymbol{\theta}}(\mathbf{X})) = \begin{cases} 0 \text{ if } \widehat{\boldsymbol{\theta}}(\mathbf{X}) = \boldsymbol{\theta}^\* \\ \boldsymbol{c}\_{\boldsymbol{\theta}} > 0 \text{ if } \widehat{\boldsymbol{\theta}}(\mathbf{X}) = \boldsymbol{\theta} \neq \boldsymbol{\theta}^\*, \boldsymbol{\theta} \in \boldsymbol{\Theta}, \end{cases} \tag{2}$$

25 where θ<sup>∗</sup> is the true value of θ in Θ: For the discussion that follows, it is important to note that 26 Eq. (2) is nonoperational in practice because θ<sup>∗</sup> is unknown.

 The more general framing, introduced by Wald ([1, 20]) and broadened by Le Cam [21], extended the scope of the original set-up by generalizing the notions of loss functions and decision spaces. In what follows it is argued that these extensions created serious incompati-bilities with both the objective and the underlying reasoning of frequentist inference.

In addition, it is both of historical and methodological interest to note that Wald [5] introduced the notion of a prior distribution, πðθÞ; ∀θ∈ Θ; into the original decision-theoretic machinery reluctantly, and justified it on being a useful tool for proving certain technical results:

"The situation regarding the introduction of an a priori probability distribution of θ is entirely different. First, the objection can be made against it, as Neyman has pointed out, that θ is merely an unknown constant and not a variate, hence it makes no sense to speak of the probability distribution of θ. Second, even if we may assume that θ is a variate, we have in general no possibility of determining the distribution of θ and any assumptions regarding this distribution are of hypothetical character. The reason why we introduce here a hypothetical 10 probability distribution of θ is simply that it proves to be useful in deducing certain theorems and in the calculation of the best system of regions of acceptance." (p. 302)

## 12 2.3. A shared neutral framework?

13 The frequentist, Bayesian, and the decision-theoretic approaches share the notion of a statistical 14 model by viewing data <sup>x</sup><sup>0</sup> :¼ ðx1;…; xn<sup>Þ</sup> as a realization of a sample <sup>X</sup> :¼ ðX1;…; XnÞfrom Eq. (1).

15 The key differences between the three approaches are as follows:


$$L(d(\mathbf{x}), \theta), \forall \theta \in \Theta, \forall \mathbf{x} \in \mathbb{R}\_X^n. \tag{3}$$

19 The loss function is often assumed to be an even, differentiable and convex function of 20 <sup>ð</sup>dðxÞ � <sup>θ</sup><sup>Þ</sup> and can take numerous functional forms; see Refs. [17, 18] inter alia.

The claim that the decision-theoretic perspective provides a neutral ground is often justified [3] 22 on account of the loss function being a function of the sample and parameter spaces through 23 the two universal quantifiers:

24 (i) "∀x ∈ R<sup>n</sup> <sup>X</sup>," associated with the distribution of the sample:

frequentist : <sup>f</sup>ðx; <sup>θ</sup>Þ; <sup>∀</sup><sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> <sup>X</sup>; ð4Þ

25 (ii)"∀θ∈ Θ" associated with the posterior distribution:

$$\mathbf{BAoyesian}: \quad \pi(\theta|\mathbf{x}\_0) = \frac{\pi(\theta) \cdot f(\mathbf{x}\_0|\theta)}{\int\_{\theta \in \Theta} \pi(\theta) \cdot f(\mathbf{x}\_0|\theta) d\theta}, \forall \theta \in \Theta. \tag{5}$$

 The idea is that allowing for all values of x in R<sup>n</sup> <sup>X</sup> goes beyond the Bayesian perspective, which relies exclusively on a single point x0. What is not obvious is whether that is sufficient to do justice to the frequentist approach. A closer scrutiny suggests that frequentist inference is misrepresented by the way both quantifiers are employed in the decision-theoretic framing of inference.

First, the quantifier ∀x ∈ R<sup>n</sup> <sup>X</sup> plays only a minor role in transforming a loss function, say <sup>L</sup>ðθ;θ^ðxÞÞ; into a risk function:

In addition, it is both of historical and methodological interest to note that Wald [5] introduced the notion of a prior distribution, πðθÞ; ∀θ∈ Θ; into the original decision-theoretic machinery

"The situation regarding the introduction of an a priori probability distribution of θ is entirely different. First, the objection can be made against it, as Neyman has pointed out, that θ is merely an unknown constant and not a variate, hence it makes no sense to speak of the probability distribution of θ. Second, even if we may assume that θ is a variate, we have in general no possibility of determining the distribution of θ and any assumptions regarding this distribution are of hypothetical character. The reason why we introduce here a hypothetical 10 probability distribution of θ is simply that it proves to be useful in deducing certain theorems

13 The frequentist, Bayesian, and the decision-theoretic approaches share the notion of a statistical 14 model by viewing data <sup>x</sup><sup>0</sup> :¼ ðx1;…; xn<sup>Þ</sup> as a realization of a sample <sup>X</sup> :¼ ðX1;…; XnÞfrom Eq. (1).

<sup>L</sup>ðdðxÞ;θÞ;∀θ<sup>∈</sup> <sup>Θ</sup>; <sup>∀</sup><sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>

The claim that the decision-theoretic perspective provides a neutral ground is often justified [3] 22 on account of the loss function being a function of the sample and parameter spaces through

frequentist : <sup>f</sup>ðx; <sup>θ</sup>Þ; <sup>∀</sup><sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>

θ∈ Θ

27 relies exclusively on a single point x0. What is not obvious is whether that is sufficient to do 28 justice to the frequentist approach. A closer scrutiny suggests that frequentist inference is 29 misrepresented by the way both quantifiers are employed in the decision-theoretic framing of

πðθÞ � fðx0jθÞdθ

Bayesian : <sup>π</sup>ðθjx0Þ ¼ <sup>π</sup>ðθÞ � <sup>f</sup>ðx0jθ<sup>Þ</sup> ð

19 The loss function is often assumed to be an even, differentiable and convex function of

<sup>X</sup>: ð3Þ

<sup>X</sup>; ð4Þ

; <sup>∀</sup>θ<sup>∈</sup> <sup>Θ</sup>: <sup>ð</sup>5<sup>Þ</sup>

<sup>X</sup> goes beyond the Bayesian perspective, which

reluctantly, and justified it on being a useful tool for proving certain technical results:

and in the calculation of the best system of regions of acceptance." (p. 302)

17 b. The Bayesian approach adds a prior distribution, <sup>π</sup>ðθÞ; <sup>∀</sup>θ<sup>∈</sup> <sup>Θ</sup> (for all <sup>θ</sup><sup>∈</sup> <sup>Θ</sup>) 18 c. The decision-theoretic framing revolves around a loss (gain or utility) function:

20 <sup>ð</sup>dðxÞ � <sup>θ</sup><sup>Þ</sup> and can take numerous functional forms; see Refs. [17, 18] inter alia.

<sup>X</sup>," associated with the distribution of the sample:

25 (ii)"∀θ∈ Θ" associated with the posterior distribution:

26 The idea is that allowing for all values of x in R<sup>n</sup>

15 The key differences between the three approaches are as follows:

6 Advances in Statistical Methodologies and Their Application to Real Problems

16 a. The frequentist approach relies exclusively on <sup>M</sup>θðx<sup>Þ</sup>

12 2.3. A shared neutral framework?

23 the two universal quantifiers:

24 (i) "∀x ∈ R<sup>n</sup>

30 inference.

$$R(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}) = \operatorname{Exp}[L(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}(\mathbf{X}))] = \int\_{\mathbf{x} \in \mathbb{R}\_{\boldsymbol{X}}^{\boldsymbol{\pi}}} L(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}(\mathbf{x})) f(\mathbf{x}; \boldsymbol{\theta}) d\mathbf{x}, \forall \boldsymbol{\theta} \in \Theta. \tag{6}$$

This is the only place where the distribution of the sample, <sup>f</sup>ðx; <sup>θ</sup>Þ; <sup>∀</sup>x<sup>∈</sup> <sup>R</sup><sup>n</sup> <sup>X</sup> enters the decisiontheoretic framing, and the only relevant part of the behavior of <sup>θ</sup>^ðX<sup>Þ</sup> is how it affects the risk function for different values of θ in Θ: In frequentist inference, however, the distribution of the sample takes center stage for the theory of optimal frequentist inference. It determines the sampling distribution of any statistic Yn¼gðXÞ (estimator, test, and predictor) through:

$$F(\mathbf{y}; \boldsymbol{\theta}) := \mathbb{P}(Y\_n \le \mathbf{y}; \boldsymbol{\theta}) = \underbrace{\int \bigcup \cdots \bigwedge^\* f(\mathbf{x}; \boldsymbol{\theta}) d\mathbf{x}}\_{\{\mathbf{x} : g(\mathbf{x}) \le t\_\star^\* \ \mathbf{x} \in \mathbb{R}\_X^n\}},\tag{7}$$

and that, in turn, yields the relevant error probabilities that determine optimal inference procedures.

 Second, the decision-theoretic notion of optimality revolves around the universal quantifier "∀θ∈ Θ," rendering it congruent with the Bayesian but incongruent with the frequentist approach. To be more specific, since different risk functions often intersect over Θ; an optimal rule is usually selected after the risk function is reduced to a scalar. Two such choices of risk are:

$$\begin{aligned} \textbf{Maximum risk}: & \quad R\_{\text{max}}(\hat{\theta}) = \sup\_{\theta \in \Theta} R(\theta, \hat{\theta}), \\ \textbf{Bayes risk}: & \quad R\_{\theta}(\hat{\theta}) = \underset{\theta \in \Theta}{\text{R}(\theta, \hat{\theta})} \pi(\theta) d\theta. \end{aligned} \tag{8}$$

14 Hence, an obvious way to choose among different rules is to find the one that minimizes the 15 relevant risk with respect to all possible estimates <sup>θ</sup>~ðxÞ. In the case of Eq. (8), this gives rise to 16 two corresponding decision rules:

$$\begin{aligned} \mathbf{Minimax \, rule: } & \quad \inf\_{\boldsymbol{\theta}(\mathbf{x})} R\_{\text{max}}(\boldsymbol{\hat{\theta}}) = \inf\_{\boldsymbol{\theta}(\mathbf{x})} [\sup\_{\boldsymbol{\theta} \in \Theta} R(\boldsymbol{\theta}, \boldsymbol{\hat{\theta}})], \\ & \mathbf{Bayes \, rule: } & \inf\_{\boldsymbol{\hat{\theta}}(\mathbf{x})} R\_{\boldsymbol{\theta}}(\boldsymbol{\hat{\theta}}) = \inf\_{\boldsymbol{\theta}(\mathbf{x})} \int\_{\boldsymbol{\theta} \in \Theta} R(\boldsymbol{\theta}, \boldsymbol{\hat{\theta}}) \pi(\boldsymbol{\theta}) d\boldsymbol{\theta}. \end{aligned} \tag{9}$$

 In this sense, a decision or a Bayes rule <sup>θ</sup>~ðx<sup>Þ</sup> will be considered optimal when it minimizes the relevant risk, no matter what the true state of Nature θ<sup>∗</sup> happens to be. The last clause, "irrespective of θ<sup>∗</sup> " constitutes a crucial caveat that is often ignored in discussions of these approaches. When viewed as a game against Nature, the decision maker selects action a from A; irrespective of what value θ<sup>∗</sup> Nature has chosen. That is, θ<sup>∗</sup> plays no role in selecting the optimal rules since the latter have nothing to do with the true value θ<sup>∗</sup> of θ. To avoid any misreading of this line of reasoning, it is important to emphasize that "the true value θ<sup>∗</sup>" is shorthand for saying that "data <sup>x</sup><sup>0</sup> constitute a typical realization of the sample <sup>X</sup> with distribution <sup>f</sup>ðx; <sup>θ</sup><sup>∗</sup> Þ"; see Ref. [22].

This should be contrasted with the notion of optimality in frequentist inference that gives θ<sup>∗</sup> center stage, in the sense that it evaluates the capacity of the inference procedure to inform the modeler about θ<sup>∗</sup> ; no other value is relevant. According to Reid [23]:

"A statistical model is a family of probability distributions [MθðxÞ], the central problem of statistical inference being to identify which member of the family [θ<sup>∗</sup> ] generated the data of interest." (p. 418)

## 3. The Bayesian approach

To shed further light on the affinity between the decision-theoretic framework and the Bayesian approach, let us take a closer look at the latter.

## 3.1. Bayesian inference and its primary objective

A key argument in favor of the Bayesian approach is often its simplicity in the sense that all forms of inference revolve around a single function, the posterior distribution: πðθjx0Þ∝πðθÞ � fðx0jθÞ; ∀θ∈ Θ: Hence, an outsider looking at Bayesian approach might natu- rally surmise that its primary objective is to yield "a probabilistic ranking" (ordering) of all values of θ in Θ. According to O'Hagan [4]:

 "Having obtained the posterior density <sup>π</sup>ðθjx0Þ, the final step of the Bayesian method is to derive from it suitable inference statements. The most usual inference question is this: After seeing the data x0, what do we now know about the parameter θ. The only answer to this question is to present the entire posterior distribution." (p. 6)

 The idea is that the modeling begins with an a priori probabilistic ranking based on πðθÞ; ∀θ∈ Θ; which is revised after observing x<sup>0</sup> to derive πðθjx0Þ; ∀θ∈ Θ; hence the key role of the quantifier ∀θ∈ Θ. O'Hagan [4], echoing earlier views in [24, 25], contrast the frequentist (classical) inferences with the Bayesian inference arguing:

 "Classical inference theory is very concerned with constructing good inference rules. The primary concern of Bayesian inference, …, is entirely different. The objective is to extract information concerning θ from the posterior distribution, and to present it helpfully via effective summaries. There are two criteria in this process. The first is to identify interesting features of the posterior distribution. … The second criterion is good communication. Summa- ries should be chosen to convey clearly and succinctly all the features of interest. … In Bayesian terms, therefore, a good inference is one which contributes effectively to appropriating the information about θ which is conveyed by the posterior distribution." (p. 14)

 Clearly, O'Hagan's [4] attempt to define what is a "good" Bayesian inference begs the question: what does constitute "effective appropriation of information about θ" mean, beyond the probabilistic ranking? That is, the issue of optimality is inextricably bound up with what the primary objective of Bayesian inference is. If the primary objective of Bayesian inference is not the revised probabilistic ranking, what is it? The answer is that the ranking is only half the story. The other half is concerned with the optimality for Bayesian inference which cannot be framed exclusively in terms of the posterior distribution. The decision-theoretic perspective provides the Bayesian approach with a theory of optimal inference as well as a primary objective: minimize expected losses for all values of θ in Θ.

In his attempt to defend his stance that the entire posterior distribution is the inference, O'Hagan [4] argues that criteria for "optimal" Bayesian inferences are only parasitical on the Bayesian approach and enter the picture through the decision theoretic perspective:

"… a study of decision theory has two potential benefits. First, it provides a link to classical inference. It thereby shows to what extent classical estimators, confidence intervals and hypotheses tests can be given a Bayesian interpretation or motivation. Second, it helps identify 10 suitable summaries to give Bayesian answers to stylized inference questions which classical theory addresses." (p. 14)

 Both of the above mentioned potential benefits to the Bayesian approach, are questionable for two reasons. First, the link between the decision-theoretic and the classical (frequentist) infer- ence is more apparent than real because it is fraught with misleading definitions and unclari- ties pertaining to the reasoning and objectives of the latter. As argued in the sequel, the quantifier "∀θ∈ Θ" used to define "optimal" decision-theoretic or Bayes rules is at odds with and misrepresents frequentist inference. Second, the claim concerning Bayesian answers to frequentist questions of interest is misplaced because the former provides no real answers to the frequentist primary question of interest which pertains to learning about θ<sup>∗</sup> : An optimal Bayes rule offers very little, if anything, relevant for learning about the value θ<sup>∗</sup> that gave rise to x0. Let us unpack this answer in some more detail.

#### 22 3.2. Optimality for Bayesian inference

This should be contrasted with the notion of optimality in frequentist inference that gives θ<sup>∗</sup> center stage, in the sense that it evaluates the capacity of the inference procedure to inform the

"A statistical model is a family of probability distributions [MθðxÞ], the central problem of

To shed further light on the affinity between the decision-theoretic framework and the Bayes-

A key argument in favor of the Bayesian approach is often its simplicity in the sense that all forms of inference revolve around a single function, the posterior distribution: πðθjx0Þ∝πðθÞ � fðx0jθÞ; ∀θ∈ Θ: Hence, an outsider looking at Bayesian approach might natu-rally surmise that its primary objective is to yield "a probabilistic ranking" (ordering) of all

16 "Having obtained the posterior density <sup>π</sup>ðθjx0Þ, the final step of the Bayesian method is to 17 derive from it suitable inference statements. The most usual inference question is this: After 18 seeing the data x0, what do we now know about the parameter θ. The only answer to this

20 The idea is that the modeling begins with an a priori probabilistic ranking based on πðθÞ; ∀θ∈ Θ; which is revised after observing x<sup>0</sup> to derive πðθjx0Þ; ∀θ∈ Θ; hence the key role 22 of the quantifier ∀θ∈ Θ. O'Hagan [4], echoing earlier views in [24, 25], contrast the frequentist

 "Classical inference theory is very concerned with constructing good inference rules. The primary concern of Bayesian inference, …, is entirely different. The objective is to extract information concerning θ from the posterior distribution, and to present it helpfully via effective summaries. There are two criteria in this process. The first is to identify interesting features of the posterior distribution. … The second criterion is good communication. Summa- ries should be chosen to convey clearly and succinctly all the features of interest. … In Bayesian terms, therefore, a good inference is one which contributes effectively to appropriating the

 Clearly, O'Hagan's [4] attempt to define what is a "good" Bayesian inference begs the question: what does constitute "effective appropriation of information about θ" mean, beyond the probabilistic ranking? That is, the issue of optimality is inextricably bound up with what the primary objective of Bayesian inference is. If the primary objective of Bayesian inference is not the revised probabilistic ranking, what is it? The answer is that the ranking is only half the story. The other half is concerned with the optimality for Bayesian inference which cannot be

information about θ which is conveyed by the posterior distribution." (p. 14)

] generated the data of

; no other value is relevant. According to Reid [23]:

statistical inference being to identify which member of the family [θ<sup>∗</sup>

8 Advances in Statistical Methodologies and Their Application to Real Problems

modeler about θ<sup>∗</sup>

interest." (p. 418)

3. The Bayesian approach

ian approach, let us take a closer look at the latter.

19 question is to present the entire posterior distribution." (p. 6)

23 (classical) inferences with the Bayesian inference arguing:

10 3.1. Bayesian inference and its primary objective

15 values of θ in Θ. According to O'Hagan [4]:

23 What does minimizing the Bayes risk amount to? Substituting the risk function in Eq. (6) into 24 the Bayes risk in Eq. (8), one can show that:

$$R\_{\mathcal{B}}(\hat{\theta}) = \int\_{\theta \in \Theta} \left( \int\_{\mathbf{x} \in \mathbb{R}\_{\mathcal{X}}^{n}} L(\theta, \hat{\theta}(\mathbf{x})) f(\mathbf{x}; \theta) d\mathbf{x} \right) \pi(\theta) d\theta$$

$$= \int\_{\mathbf{x} \in \mathbb{R}\_{\mathcal{X}}^{n}} \int\_{\theta \in \Theta} L(\theta, \theta(\mathbf{x})) f(\mathbf{x}|\theta) \pi(\theta) d\theta d\mathbf{x} \tag{10}$$

$$= \int\_{\mathbf{x} \in \mathbb{R}\_{\mathcal{X}}^{n}} \left\{ \int\_{\theta \in \Theta} L(\theta, \theta(\mathbf{x})) \pi(\theta|\mathbf{x}) d\theta \right\} m(\mathbf{x}) d\mathbf{x}\_{\mathbf{r}}$$

<sup>25</sup> where <sup>m</sup>ðxÞ ¼ <sup>ð</sup> θ ∈ Θ fðx; θÞdθ; see Ref. [18]. The second and third equalities presume that one 26 can reverse the order of integration (a technical issue), and treat <sup>f</sup>ðx; <sup>θ</sup><sup>Þ</sup> as the joint distribution 27 of X and θ so that the following equalities hold:

$$f(\mathbf{x}; \boldsymbol{\theta}) = f(\mathbf{x}|\boldsymbol{\theta})\pi(\boldsymbol{\theta}) = \pi(\boldsymbol{\theta}|\mathbf{x})m(\mathbf{x}).\tag{11}$$

28 In this case, these equalities are questionable due to the blurring of the distinction between x; a 29 generic value of R<sup>n</sup> <sup>X</sup>; and the particular value x0; see Ref. [26].

In light of Eq. (10), a Bayesian estimate is "optimal" relative to a particular loss function <sup>L</sup>ðθ^ðXÞ;θÞ; when it minimizes RBðθ^Þ, or equivalently <sup>ð</sup> θ∈ Θ <sup>L</sup>ðθ; <sup>θ</sup>^ðxÞÞπðθjxÞdθ: This makes it clear that what constitutes an "optimal" Bayesian estimate is primarily determined by <sup>L</sup>ðθ^ðXÞ;θ<sup>Þ</sup> [27]:


In practice, the most widely used loss function is the square:

$$L\_2(\hat{\theta}(\mathbf{X}); \theta) = (\hat{\theta}(\mathbf{X}) - \theta)^2, \forall \theta \in \Theta,\tag{12}$$

10 whose risk function is the decision-theoretic Mean Square Error (MSE1):

$$R(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}) = \mathrm{E}(\hat{\boldsymbol{\theta}}(\mathbf{X}) - \boldsymbol{\theta})^2 = \mathrm{MSE}\_1(\hat{\boldsymbol{\theta}}(\mathbf{X}); \boldsymbol{\theta}), \forall \boldsymbol{\theta} \in \boldsymbol{\Theta}. \tag{13}$$

Surprising, however, this definition of the MSE, denoted by MSE1, is different from the 12 frequentist MSE, which is defined by:

$$MSE(\hat{\theta}\_n(\mathbf{X}); \theta^\*) = E(\hat{\theta}\_n(\mathbf{X}) - \theta^\*)^2. \tag{14}$$

 The key difference is that Eq. (14) is defined at the point <sup>θ</sup> <sup>¼</sup> <sup>θ</sup><sup>∗</sup> ; as opposed to ∀θ∈ Θ. Unfortunately, statistics textbooks adopt one of the two definitions of the MSE— either at <sup>θ</sup>¼θ<sup>∗</sup> or <sup>∀</sup>θ<sup>∈</sup> <sup>Θ</sup>—and ignore (or seem unaware) of the other. At first sight, his difference might appear pedantic, but it turns out that it has very serious implications for the relevant theory of optimality for the frequentist vs. Bayesian inference procedures. Indeed, reliance on ∀θ∈ Θ undermines completely the relevance of admissibility as a minimal property for estimators in frequentist inference.

20 Admissibility. An estimator <sup>θ</sup>~ðX<sup>Þ</sup> is inadmissible if there exists another estimator <sup>θ</sup>^ðX<sup>Þ</sup> such that:

$$R(\theta, \hat{\theta}) \le R(\theta, \tilde{\theta}), \forall \theta \in \Theta,\tag{15}$$

and the strict inequality (<) holds for at least one value of <sup>θ</sup>. Otherwise, <sup>θ</sup>~ðX<sup>Þ</sup> is said to be 22 admissible with respect to the loss function <sup>L</sup>ðθ;θ^Þ:

 The objective of minimizing losses weighted by <sup>π</sup>ðθjx0<sup>Þ</sup> for all value of <sup>θ</sup> in <sup>Θ</sup>; is in direct contrast to the frequentist primary objective, which is to learn from data about the true value θ<sup>∗</sup> underlying the generation of x0: Hence, the question that naturally arises is: what does an optimal Bayes rule, stemming from Eq. (17) convey about the underlying data generating mechanism in Eq. (1)? It is not obvious why the highest ranked value <sup>θ</sup>~ðx0<sup>Þ</sup> (mode), or some other feature of the posterior distribution, has any value in pinpointing <sup>θ</sup><sup>∗</sup> knowing that <sup>θ</sup>~ðx0<sup>Þ</sup> is selected irrespective of θ<sup>∗</sup> the true state of Nature.

#### 3.3. The duality between loss functions and priors

The derivation in Eq. (10) brings out the built-in affinity between the decision-theoretic framing of inference and the Bayesian approach. As shown above, minimizing the Bayes risk:

$$R\_{\mathcal{B}}(\hat{\theta}) = \int\_{\theta \in \Theta} R(\hat{\theta}, \theta) \pi(\theta) d\theta,\tag{16}$$

is equivalent to minimizing the integral:

In light of Eq. (10), a Bayesian estimate is "optimal" relative to a particular loss function

clear that what constitutes an "optimal" Bayesian estimate is primarily determined by

, the Bayes estimate <sup>θ</sup>^ is the mean of <sup>π</sup>ðθjx0Þ.

2

<sup>n</sup>ðXÞ � <sup>θ</sup><sup>∗</sup>

Þ 2

<sup>R</sup>ðθ;θ^<sup>Þ</sup> <sup>≤</sup>Rðθ;θ~Þ;∀θ<sup>∈</sup> <sup>Θ</sup>; <sup>ð</sup>15<sup>Þ</sup>

ð

θ∈ Θ

<sup>L</sup>ðθ; <sup>θ</sup>^ðxÞÞπðθjxÞdθ: This makes it

for ε > 0; the Bayes estimate θ is the mode

; ∀θ∈ Θ; ð12Þ

: ð14Þ

; as opposed to ∀θ∈ Θ.

<sup>¼</sup>MSE1ðθ^ðXÞ; <sup>θ</sup>Þ; <sup>∀</sup>θ<sup>∈</sup> <sup>Θ</sup>: <sup>ð</sup>13<sup>Þ</sup>

<sup>L</sup>ðθ^ðXÞ;θÞ; when it minimizes RBðθ^Þ, or equivalently

10 Advances in Statistical Methodologies and Their Application to Real Problems

2

�

In practice, the most widely used loss function is the square:

10 whose risk function is the decision-theoretic Mean Square Error (MSE1):

<sup>R</sup>ðθ; <sup>θ</sup>^Þ¼Eðθ^ðXÞ � <sup>θ</sup><sup>Þ</sup>

MSEðθ^

13 The key difference is that Eq. (14) is defined at the point <sup>θ</sup> <sup>¼</sup> <sup>θ</sup><sup>∗</sup>

iii. When <sup>L</sup><sup>0</sup>�<sup>1</sup>ðθ;θÞ¼δðθ;θÞ¼ 0 forj<sup>θ</sup> � <sup>θ</sup><sup>j</sup> <sup>&</sup>lt; <sup>ε</sup>

ii. When <sup>L</sup>1ðθ~;θÞ¼jθ<sup>~</sup> � <sup>θ</sup>j, the Bayes estimate <sup>θ</sup>^ is the median of <sup>π</sup>ðθjx0Þ.

1 forj<sup>θ</sup> � <sup>θ</sup><sup>j</sup> <sup>≥</sup> <sup>ε</sup> ;

<sup>L</sup>2ðθ^ðXÞ; <sup>θ</sup>Þ¼ðθ^ðXÞ � <sup>θ</sup><sup>Þ</sup>

<sup>n</sup>ðXÞ; <sup>θ</sup><sup>∗</sup>

 Unfortunately, statistics textbooks adopt one of the two definitions of the MSE— either at <sup>θ</sup>¼θ<sup>∗</sup> or <sup>∀</sup>θ<sup>∈</sup> <sup>Θ</sup>—and ignore (or seem unaware) of the other. At first sight, his difference might appear pedantic, but it turns out that it has very serious implications for the relevant theory of optimality for the frequentist vs. Bayesian inference procedures. Indeed, reliance on ∀θ∈ Θ undermines completely the relevance of admissibility as a minimal property for estimators in

20 Admissibility. An estimator <sup>θ</sup>~ðX<sup>Þ</sup> is inadmissible if there exists another estimator <sup>θ</sup>^ðX<sup>Þ</sup> such that:

and the strict inequality (<) holds for at least one value of <sup>θ</sup>. Otherwise, <sup>θ</sup>~ðX<sup>Þ</sup> is said to be

 The objective of minimizing losses weighted by <sup>π</sup>ðθjx0<sup>Þ</sup> for all value of <sup>θ</sup> in <sup>Θ</sup>; is in direct contrast to the frequentist primary objective, which is to learn from data about the true value θ<sup>∗</sup> underlying the generation of x0: Hence, the question that naturally arises is: what does an optimal Bayes rule, stemming from Eq. (17) convey about the underlying data generating mechanism in Eq. (1)? It is not obvious why the highest ranked value <sup>θ</sup>~ðx0<sup>Þ</sup> (mode), or some

2

Surprising, however, this definition of the MSE, denoted by MSE1, is different from the

Þ¼Eðθ^

<sup>L</sup>ðθ^ðXÞ;θ<sup>Þ</sup> [27]:

i. When <sup>L</sup>2ðθ^;θÞ¼ðθ^ � <sup>θ</sup><sup>Þ</sup>

12 frequentist MSE, which is defined by:

22 admissible with respect to the loss function <sup>L</sup>ðθ;θ^Þ:

19 frequentist inference.

of πðθjx0Þ.

$$\int\_{\theta \in \Theta} L(\hat{\theta}(\mathbf{X}), \theta) \pi(\theta|\mathbf{x}) d\theta. \tag{17}$$

This result brings out two important features of optimal Bayesian inference.

First, it confirms the minor role played by the quantifier x ∈ R<sup>n</sup> <sup>X</sup> in both the Bayesian and decision-theoretic optimality theory of inference.

10 Second, it indicates that <sup>L</sup>ðθ;θ^<sup>Þ</sup> and <sup>π</sup>ðθ<sup>Þ</sup> are perfect substitutes with respect to any weight function wðθÞ > 0; ∀θ∈ Θ, in the derivation of Bayes rules. Modifying the loss function or the 12 prior yields the same result:

13 "… the problem of estimating θ with a modified (weighted) loss function is identical to the 14 problem with a simple loss but with modified hyperparameters of the prior distribution while 15 the form of the prior distribution does not change." ([28], p. 522)

16 This implies that in practice a Bayesian could derive a particular Bayes rule by attaching the 17 weight to the loss function or to the prior distribution depending on which derivation is easier; 18 see Refs. [18, 28].

#### 19 3.4. Revisiting the complete class theorem

20 The issue of contrasting objectives highlights the key built-in tension between the frequentist and Bayesian approaches to optimality, which in turn undermines several important results, 22 including the complete class theorem, first proved in Ref. [20]:

 "Wald showed that under fairly general conditions the class of Bayes decision functions forms an essentially complete class; in other words, for any decision function that is not Bayesian, there exists one that is Bayes and is at least as good no matter what the true state of Nature may be." ([19], p. 341)

 As argued in the sequel, it should come as no surprise to learn that Bayes rules dominate all other rules when admissibility is given center stage. The key result is that a Bayes rule θ^ <sup>B</sup>ðxÞ with respect to a prior distribution <sup>π</sup>ðθ<sup>Þ</sup> is:


Ignoring the contrasting objectives, these results have been interpreted as evidence for the superiority of the Bayesian perspective, and led to the intimation that an effective way to generate optimal frequentist procedures is to find the Bayes solution using a reasonable prior and then examine their frequentist properties to see whether it is satisfactory from the latter 10 viewpoint; see Refs. [29, 30].

As argued next, even if one were to agree that Bayes rules and admissible estimators largely 12 coincide, the importance of such a result hinges on the relevance of admissibility as a key 13 property for frequentist estimators.

## 14 4. Loss functions and admissibility revisited

15 The claim to be discussed in this section is that the notions of a "loss function" and "admissibil-16 ity" are incompatible with the optimal theory of frequentist estimation as framed by Fisher; see 17 Ref. [31].

#### 18 4.1. Admissibility as a minimal property

19 The following example brings out the inappropriateness of admissibility as a minimal property 20 for optimal frequentist estimators.

Example. In the context of the simple Normal model:

$$X\_k \sim \text{NIID}(\theta, 1), k = 1, 2, \dots, n, \text{ for } n > 2,\tag{18}$$

22 consider the decision-theoretic notion of MSE1 in Eq. (13) to compare two estimators of θ:

23 i. The maximum likelihood estimator (MLE): Xn<sup>¼</sup> <sup>1</sup> n X<sup>n</sup> k¼1 Xk

24 ii. The "crystalball" estimator: <sup>θ</sup>cb¼7405926; <sup>∀</sup>x<sup>∈</sup> <sup>R</sup><sup>n</sup> X

 When compared on admissibility grounds, both estimators are admissible and thus equally acceptable. Common sense, however, suggests that if a particular criterion of optimality cannot distinguish between Xn [a strongly consistent, unbiased, fully efficient and sufficient estimator] and θcb; an arbitrarily chosen real number that ignores the data altogether, is not much of a minimal property.

30 A moment's reflection suggests that the inappropriateness of admissibility stems from its reliance on the quantifier "∀θ∈ Θ." The admissibility of θcb arises from the fact that for certain values of <sup>θ</sup> close enough to <sup>θ</sup>cb, say <sup>θ</sup>∈ðθcb � <sup>λ</sup>ffiffi <sup>n</sup> <sup>p</sup> Þ; for 0 < λ < 1; θcb is "better" than Xn on MSE1 grounds:

$$MSE\_1\left(\overline{X}\_n; \theta\right) = \frac{1}{n} > MSE\_1\left(\theta\_{cb}; \theta\right) \le \frac{\lambda^2}{n} \text{ for } \theta \in \left(\theta\_{cb} \pm \frac{\lambda}{\sqrt{n}}\right). \tag{19}$$

Given that the primary objective of a frequentist estimator is to pin-point θ<sup>∗</sup> ; the result in Eq. (19) seems totally irrelevant as a gauge of its capacity to achieve that!

This example indicates that admissibility is totally ineffective as a minimal property because it does not filter out θcb; the worst possible estimator! Instead, it excludes potentially good estimators like the sample median; see Ref. [32]. This highlights the "extreme relativism" of admissibility to the particular loss function, <sup>L</sup>2ðθ^ðXÞ; <sup>θ</sup>Þ, in this case. For the absolute loss function <sup>L</sup>1ðθ^ðXÞ; <sup>θ</sup>Þ¼jθ^ðXÞ � <sup>θ</sup>j, however, the sample median would have been the optimal 10 estimator. Despite his wholehearted embrace of the decision-theoretic framing, Lehmann [33] warned statisticians about the perils of arbitrary loss functions:

 "It is argued that the choice of a loss function, while less crucial than that of the model, exerts an important influence on the nature of the solution of a statistical decision problem, and that an arbitrary choice such as squared error may be baldly misleading as to the relative desirabil-ity of the competing procedures." (p. 425)

 A strong case can be made that the key minimal property (necessary but not sufficient) for frequentist estimation is consistency, an extension of the Law of Large Numbers (LLN) to estimators, more generally. For instance, consistency would have eliminated θcb from consid- eration because it is inconsistent. This makes intuitive sense because if an estimator <sup>θ</sup>^ðX<sup>Þ</sup> cannot pinpoint θ<sup>∗</sup> with an infinite data information, it should be considered irrelevant for learning about θ<sup>∗</sup> . Indeed, there is nothing in the notion of admissibility that advances learning from data about θ<sup>∗</sup> .

23 Further to relative (to particular loss functions) efficiency being a dubious property for 24 frequentist estimators, the pertinent measure of finite sample precision for frequentist estima-25 tors is full efficiency, which is defined relative to the assumed statistical model (1).

#### 26 4.2. Stein's paradox and admissibility

i. Admissible, under certain regularity conditions, including when θ^

BÞ.

<sup>B</sup>Þ; estimate <sup>θ</sup>^ðx<sup>Þ</sup> is either Bayes <sup>θ</sup>^

equivalence relative to the same risk function <sup>R</sup>ðθ;θ^

12 Advances in Statistical Methodologies and Their Application to Real Problems

<sup>B</sup>Þ ¼ c < ∞:

the limit of a sequence of Bayes rules; see Refs. [2, 17, 28].

Ignoring the contrasting objectives, these results have been interpreted as evidence for the superiority of the Bayesian perspective, and led to the intimation that an effective way to generate optimal frequentist procedures is to find the Bayes solution using a reasonable prior and then examine their frequentist properties to see whether it is satisfactory from the latter

As argued next, even if one were to agree that Bayes rules and admissible estimators largely 12 coincide, the importance of such a result hinges on the relevance of admissibility as a key

15 The claim to be discussed in this section is that the notions of a "loss function" and "admissibil-16 ity" are incompatible with the optimal theory of frequentist estimation as framed by Fisher; see

19 The following example brings out the inappropriateness of admissibility as a minimal property

22 consider the decision-theoretic notion of MSE1 in Eq. (13) to compare two estimators of θ:

 When compared on admissibility grounds, both estimators are admissible and thus equally acceptable. Common sense, however, suggests that if a particular criterion of optimality cannot distinguish between Xn [a strongly consistent, unbiased, fully efficient and sufficient estimator] and θcb; an arbitrarily chosen real number that ignores the data altogether, is not much of a

30 A moment's reflection suggests that the inappropriateness of admissibility stems from its reliance on the quantifier "∀θ∈ Θ." The admissibility of θcb arises from the fact that for certain

Xk <sup>e</sup> NIIDðθ; <sup>1</sup>Þ; <sup>k</sup>¼1; <sup>2</sup>;…; <sup>n</sup>; for <sup>n</sup> <sup>&</sup>gt; <sup>2</sup> , <sup>ð</sup>18<sup>Þ</sup>

n X<sup>n</sup> k¼1 Xk

X

iii. An admissible, relative to a risk function <sup>R</sup>ðθ;θ^

14 4. Loss functions and admissibility revisited

Example. In the context of the simple Normal model:

23 i. The maximum likelihood estimator (MLE): Xn<sup>¼</sup> <sup>1</sup>

24 ii. The "crystalball" estimator: <sup>θ</sup>cb¼7405926; <sup>∀</sup>x<sup>∈</sup> <sup>R</sup><sup>n</sup>

ii. Minimax when <sup>R</sup>ðθ;θ^

10 viewpoint; see Refs. [29, 30].

17 Ref. [31].

29 minimal property.

13 property for frequentist estimators.

18 4.1. Admissibility as a minimal property

20 for optimal frequentist estimators.

<sup>B</sup>ðxÞ is unique up to

<sup>B</sup>ðxÞ or

27 The quintessential example that has bolstered the appeal of the Bayesian claims concerning 28 admissibility is the James-Stein estimator [34], which gave rise to an extensive literature on 29 shrinkage estimators, see Ref. [35].

30 Let <sup>X</sup> :¼ ðX1;X2; …;Xm<sup>Þ</sup> be independent sample from a Normal distribution:

$$(X\_k \text{--} \text{NI}(\theta\_k, \sigma^2), k = 1, 2, \dots, m,\tag{20}$$

where <sup>σ</sup><sup>2</sup> is known. Using the notation <sup>θ</sup>:¼ðθ1;θ2;…;θm<sup>Þ</sup> and <sup>I</sup>m:¼diag(1; <sup>1</sup>; …; 1), this can be 32 denoted by:

$$\mathbf{X} \text{--} \mathbf{N}(\theta, \sigma^2 \mathbf{I}\_m) \text{-}$$

Find an optimal estimator <sup>θ</sup>~ðX<sup>Þ</sup> of <sup>θ</sup> with respect to the square "overall" loss function:

$$L\_2(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}(\mathbf{X})) = \left( \|\hat{\boldsymbol{\theta}}(\mathbf{X}) - \boldsymbol{\theta}\|^2 \right) = \sum\_{k=1}^{m} (\hat{\theta}\_k(\mathbf{X}) - \theta\_k)^2. \tag{21}$$

Stein [36] astounded the statistical world by showing that for m¼2 the least-squares (LS) estimator <sup>θ</sup>^LSðXÞ ¼ <sup>X</sup> is admissible, but for <sup>m</sup> <sup>&</sup>gt; <sup>2</sup> <sup>θ</sup>^LSðX<sup>Þ</sup> is inadmissible. Indeed, James and Stein [37] were able to come up with a nonlinear estimator:

$$\hat{\boldsymbol{\Theta}}\_{\parallel \mathbf{S}}(\mathbf{X}) = \left(1 - \frac{(m-2)\sigma^2}{\|\mathbf{X}\|^2}\right) \mathbf{X},\tag{22}$$

that became known as the James-Stein estimator, which dominates <sup>θ</sup>^LSðXÞ ¼ <sup>X</sup> in MSE1 terms by demonstrating that:

$$\text{MSE}\_1(\hat{\theta}\_{\restriction \mathbb{S}}(\mathbf{X}); \theta) < \text{MSE}\_1(\hat{\theta}\_{\restriction \mathbb{S}}(\mathbf{X}); \theta), \forall \theta \in \mathbb{R}^m. \tag{23}$$

It turns out that <sup>θ</sup>^JMðX<sup>Þ</sup> is also inadmissible for <sup>m</sup> <sup>&</sup>gt; 2 and dominated by the modified James-Stein estimator that is admissible:

$$\hat{\boldsymbol{\Theta}}\_{\text{f}|\text{S}}^{+}\left(\mathbf{X}\right) = \left(\mathbf{1} - \frac{(m-2)\boldsymbol{\sigma}^{2}}{\|\mathbf{X}\|^{2}}\right)^{+}\mathbf{X},\tag{24}$$

where ðzÞ <sup>þ</sup> ¼ maxð0; zÞ; see Ref. [17].

 The traditional interpretation of this result is that for the Normal, Independent model in Eq. (20), the James-Stein estimator (15) of θ :¼ ðθ1;θ2;…;θmÞ; for m > 2; reduces the overall MSE1 in Eq. (21). This result seems to imply that one will "do better" (in overall MSE1 terms) by using a combined nonlinear (shrinkage) estimator, instead of estimating these means separately. What is surprising about this result is that there is no statistical reason (due to independence) to connect the inferences pertaining to the different individual means, and yet the obvious estimator (LS) is inadmissible.

17 As argued next, this result calls into question the appropriateness of the notion of admissibility 18 with respect to a particular loss function, and not the judiciousness of frequentist estimation.

## 19 5. Frequentist inference and learning from data

20 The objectives and underlying reasoning of frequentist inference are inadequately discussed in the statistics literature. As a result, some of its key differences with Bayesian inference remain 22 beclouded.

### 5.1. Frequentist approach: primary objective and reasoning

X

<sup>L</sup>2ðθ;θ^ðXÞÞ¼ð∥θ^ðXÞ � <sup>θ</sup>∥<sup>2</sup>

θ^ JS

Stein [37] were able to come up with a nonlinear estimator:

14 Advances in Statistical Methodologies and Their Application to Real Problems

by demonstrating that:

where ðzÞ

16 inadmissible.

22 beclouded.

Stein estimator that is admissible:

<sup>þ</sup> ¼ maxð0; zÞ; see Ref. [17].

19 5. Frequentist inference and learning from data

Find an optimal estimator <sup>θ</sup>~ðX<sup>Þ</sup> of <sup>θ</sup> with respect to the square "overall" loss function:

<sup>e</sup> <sup>N</sup>ðθ; <sup>σ</sup><sup>2</sup>ImÞ:

Stein [36] astounded the statistical world by showing that for m¼2 the least-squares (LS) estimator <sup>θ</sup>^LSðXÞ ¼ <sup>X</sup> is admissible, but for <sup>m</sup> <sup>&</sup>gt; <sup>2</sup> <sup>θ</sup>^LSðX<sup>Þ</sup> is inadmissible. Indeed, James and

<sup>θ</sup>^JSðXÞ¼ <sup>1</sup> � <sup>ð</sup><sup>m</sup> � <sup>2</sup>Þσ<sup>2</sup>

that became known as the James-Stein estimator, which dominates <sup>θ</sup>^LSðXÞ ¼ <sup>X</sup> in MSE1 terms

It turns out that <sup>θ</sup>^JMðX<sup>Þ</sup> is also inadmissible for <sup>m</sup> <sup>&</sup>gt; 2 and dominated by the modified James-

<sup>þ</sup> <sup>ð</sup>XÞ¼ <sup>1</sup> � <sup>ð</sup>m�2Þσ<sup>2</sup>

 The traditional interpretation of this result is that for the Normal, Independent model in Eq. (20), the James-Stein estimator (15) of θ :¼ ðθ1;θ2;…;θmÞ; for m > 2; reduces the overall MSE1 in Eq. (21). This result seems to imply that one will "do better" (in overall MSE1 terms) by using a combined nonlinear (shrinkage) estimator, instead of estimating these means separately. What is surprising about this result is that there is no statistical reason (due to independence) to connect the inferences pertaining to the different individual means, and yet the obvious estimator (LS) is

17 As argued next, this result calls into question the appropriateness of the notion of admissibility 18 with respect to a particular loss function, and not the judiciousness of frequentist estimation.

20 The objectives and underlying reasoning of frequentist inference are inadequately discussed in the statistics literature. As a result, some of its key differences with Bayesian inference remain

∥X∥<sup>2</sup> � �<sup>þ</sup>

Þ ¼ <sup>X</sup><sup>m</sup> k¼1 ðθ^

∥X∥<sup>2</sup> � �

MSE1ðθ^JSðXÞ; <sup>θ</sup><sup>Þ</sup> <sup>&</sup>lt; MSE1ðθ^LSðXÞ; <sup>θ</sup>Þ;∀θ<sup>∈</sup> <sup>R</sup><sup>m</sup>: <sup>ð</sup>23<sup>Þ</sup>

<sup>k</sup>ðXÞ � θkÞ

2

: ð21Þ

X, ð22Þ

X, ð24Þ

All forms of parametric frequentist inference begin with a prespecified statistical model <sup>M</sup>θðxÞ¼{fðx; <sup>θ</sup>Þ; <sup>θ</sup><sup>∈</sup> <sup>Θ</sup>}; <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> <sup>X</sup>: This model is chosen from the set of all possible models that could have given rise to data x<sup>0</sup> : ¼ðx1;…;xnÞ; by selecting the probabilistic structure for the underlying stochastic process {Xt; t∈ N :¼ ð1; 2;…;n;…Þ} in such a way so as to render the observed data x<sup>0</sup> a "typical" realization thereof. In light of the fact that each value of θ∈ Θ represents a different element of the family of models represented by MθðxÞ; the primary objective of frequentist inference is to learn from data about the "true" model:

$$\mathcal{M}^\*(\mathbf{x}) = \{ f(\mathbf{x}; \boldsymbol{\Theta}^\*) \}, \mathbf{x} \in \mathbb{R}\_X^n,\tag{25}$$

where θ<sup>∗</sup> denotes the true value of θ in Θ. The "typicality" is testable vis-a-vis the data x<sup>0</sup> 10 using misspecification testing; see Ref. [38].

The frequentist approach relies on two modes of reasoning for inference purposes:

$$\begin{aligned} &\text{Factual (estimation, prediction)}: & f(\mathbf{x}; \boldsymbol{\theta}^\*), \forall \mathbf{x} \in \mathbb{R}\_X^n, \\ &\text{- } Hypothetical \ (hypotheesistenting): & f(\mathbf{x}; \boldsymbol{\theta}\_0), f(\mathbf{x}; \boldsymbol{\theta}\_1), \forall \mathbf{x} \in \mathbb{R}\_X^n, \end{aligned} \tag{26}$$

12 where <sup>θ</sup><sup>∗</sup> denotes the true value of <sup>θ</sup> in <sup>Θ</sup>, and <sup>θ</sup>i; <sup>i</sup> <sup>¼</sup> <sup>0</sup>; 1 denote hypothesized values of <sup>θ</sup> 13 associated with the hypotheses, H0: θ<sup>0</sup> ∈ Θ0, H1: θ<sup>1</sup> ∈ Θ1; where Θ<sup>0</sup> and Θ<sup>1</sup> constitute a 14 partition of Θ:

 A frequentist estimator θ^ aims to pinpoint θ<sup>∗</sup> , and its optimality is evaluated by how effec- tively it achieves that. Similarly, a test statistic usually compares a good estimator θ^ of θ with a prespecified value θ0; but behind θ^ is the value θ<sup>∗</sup> assumed to have generated data x0: Hence, the hypothetical reasoning is used in testing to learn about θ<sup>∗</sup> ; and has nothing to do with all possible values of θ in Θ:

20 This contradicts misleading claims by Bayesian textbooks ([3], p. 61):

"The frequentist paradigm relies on this criterion [risk function] to compare estimators and, if 22 possible, to select the best estimator, the reasoning being that estimators are evaluated on their 23 long-run performance for all possible values of the parameter θ:"

 Contrary to this claim, the only relevant value of θ in evaluating the "optimality" of θ^ is θ<sup>∗</sup> : Such misleading claims stem from an apparent confusion between the existential and universal quantifiers in framing certain inferential assertions.

27 The existence of θ<sup>∗</sup> can be formally defined using the existential quantifier:

$$
\exists \partial^\* \in \Theta \, : \quad \text{there exists a } \partial^\* \in \Theta \text{ such that} \tag{27}
$$

28 This introduces a potential conflict between the existential and the universal quantifier "∀θ∈ Θ" 29 because neither the decision theoretic nor the Bayesian approach explicitly invoke θ<sup>∗</sup> . Deci-30 sion-theoretic and Bayesian rules are considered optimal when they minimize the expected loss ∀θ∈ Θ; no matter what θ<sup>∗</sup> happens to be.

Any attempt to explain away the crucial differences between the two quantifiers can be easily scotched using elementary logic. The two quantifiers could not be more different since, using the logical connective for negation (¬), the equivalence between the two involves double negations:

$$(i)\ \exists\,\theta^\*\in\Theta\Leftrightarrow\neg\forall\,\theta\notin\Theta,\quad(ii)\forall\,\theta\in\Theta\Leftrightarrow\neg\exists\,\theta^\*\notin\Theta.\tag{28}$$

Similarly, invoking intuition to justify the quantifier ∀θ∈ Θ as innocuous and natural on the grounds that one should care about the behavior of an estimator θ^ for all possible values of θ; is highly misleading. The behavior of θ^; for all θ∈ Θ, although relevant, is not what determines how effective a frequentist estimator is at pinpointing θ<sup>∗</sup> ; what matters is its sampling behavior around θ<sup>∗</sup> . Assessing its effectiveness calls for evaluating (deductively) the sampling 10 distribution of <sup>θ</sup>^ under factual <sup>θ</sup> <sup>¼</sup> <sup>θ</sup><sup>∗</sup>; or hypothetical values <sup>θ</sup><sup>0</sup> and <sup>θ</sup>1; and not for all possible values of θ in Θ: Let's unpack the details of this claim.

#### 12 5.2. Frequentist estimation

13 The underlying reasoning for frequentist estimation is factual, in the sense the optimality of an 14 estimator is appraised in terms of its generic capacity of θ^ <sup>n</sup>ðXÞ to zero-in on (pinpoint) the true 15 value θ<sup>∗</sup> , whatever the sample realization X ¼ x0. Optimal properties like consistency, unbi-16 asedness, full efficiency, sufficiency, etc., calibrate this generic capacity using its sampling 17 distribution of θ^ <sup>n</sup>ðX<sup>Þ</sup> evaluated under <sup>θ</sup>¼θ<sup>∗</sup> i.e., in terms of <sup>f</sup>ðθ^ <sup>n</sup>ðxÞ; <sup>θ</sup><sup>∗</sup> <sup>Þ</sup>; for <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> <sup>X</sup>: For 18 instance, strong consistency asserts that as <sup>n</sup> ! <sup>∞</sup>; <sup>θ</sup>^ <sup>n</sup>ðX<sup>Þ</sup> will zero-in on <sup>θ</sup><sup>∗</sup> almost surely:

$$\mathbb{P}(\lim\_{n\to\infty}\hat{\theta}\_{n}(\mathbf{X})=\theta^{\*})=1.\tag{29}$$

19 Similarly, unbiasedness asserts that the mean of θ^ <sup>n</sup>ðX<sup>Þ</sup> is the true value <sup>θ</sup><sup>∗</sup> :

$$E(\hat{\theta}\_n(\mathbf{X})) = \theta^\*. \tag{30}$$

20 In this sense, both of these optimal properties are defined at the point <sup>θ</sup>¼θ<sup>∗</sup>. This is achieved by using factual reasoning, i.e., evaluating the sampling distribution of θ^ <sup>n</sup>ðXÞ under the true 22 state of Nature (θ¼θ<sup>∗</sup> ), without having to know θ<sup>∗</sup> : This is in contrast to using loss functions, 23 such as Eq. (2), which are defined in terms of θ<sup>∗</sup> but are rendered nonoperational without 24 knowing θ<sup>∗</sup> .

25 Example. In the case of the simple Normal model in Eq. (18), the point estimator, Xn, is 26 consistent, unbiased, fully efficient, sufficient, with a sampling distribution:

$$\overline{X}\_n \sim \mathcal{N}\left(\theta, \frac{1}{n}\right). \tag{31}$$

27 What is not usually appreciated sufficiently is that the evaluation of that distribution is factual, 28 i.e., <sup>θ</sup>¼θ<sup>∗</sup>, and should formally be denoted by:

Why the Decision‐Theoretic Perspective Misrepresents Frequentist Inference: Revisiting Stein's… http://dx.doi.org/10.5772/65720 17

$$\overline{X}\_n \stackrel{\theta=\theta^\*}{\sim} \mathbf{N}\left(\theta^\*, \frac{1}{n}\right). \tag{32}$$

When Xn is standardized, it yields the pivotal function:

Any attempt to explain away the crucial differences between the two quantifiers can be easily scotched using elementary logic. The two quantifiers could not be more different since, using the logical connective for negation (¬), the equivalence between the two involves double

<sup>ð</sup>i<sup>Þ</sup> <sup>∃</sup> <sup>θ</sup><sup>∗</sup> <sup>∈</sup> <sup>Θ</sup> <sup>⇔</sup> <sup>¬</sup> <sup>∀</sup> <sup>θ</sup><sup>∉</sup> <sup>Θ</sup>; <sup>ð</sup>iiÞ<sup>∀</sup> <sup>θ</sup> <sup>∈</sup> <sup>Θ</sup> <sup>⇔</sup> <sup>¬</sup> <sup>∃</sup> <sup>θ</sup><sup>∗</sup>

10 distribution of <sup>θ</sup>^ under factual <sup>θ</sup> <sup>¼</sup> <sup>θ</sup><sup>∗</sup>; or hypothetical values <sup>θ</sup><sup>0</sup> and <sup>θ</sup>1; and not for all

13 The underlying reasoning for frequentist estimation is factual, in the sense the optimality of an

16 asedness, full efficiency, sufficiency, etc., calibrate this generic capacity using its sampling

<sup>n</sup>ðX<sup>Þ</sup> evaluated under <sup>θ</sup>¼θ<sup>∗</sup> i.e., in terms of <sup>f</sup>ðθ^

<sup>P</sup>ðlim<sup>n</sup>!<sup>∞</sup> <sup>θ</sup>^

<sup>E</sup>ðθ^

by using factual reasoning, i.e., evaluating the sampling distribution of θ^

), without having to know θ<sup>∗</sup>

26 consistent, unbiased, fully efficient, sufficient, with a sampling distribution:

, and should formally be denoted by:

20 In this sense, both of these optimal properties are defined at the point <sup>θ</sup>¼θ<sup>∗</sup>. This is achieved

23 such as Eq. (2), which are defined in terms of θ<sup>∗</sup> but are rendered nonoperational without

25 Example. In the case of the simple Normal model in Eq. (18), the point estimator, Xn, is

Xn <sup>e</sup> <sup>N</sup> <sup>θ</sup>;

27 What is not usually appreciated sufficiently is that the evaluation of that distribution is factual,

1 n � �

, whatever the sample realization X ¼ x0. Optimal properties like consistency, unbi-

nðXÞ¼θ<sup>∗</sup>

nðXÞÞ¼θ<sup>∗</sup>

mines how effective a frequentist estimator is at pinpointing θ<sup>∗</sup>

16 Advances in Statistical Methodologies and Their Application to Real Problems

possible values of θ in Θ: Let's unpack the details of this claim.

14 estimator is appraised in terms of its generic capacity of θ^

18 instance, strong consistency asserts that as <sup>n</sup> ! <sup>∞</sup>; <sup>θ</sup>^

19 Similarly, unbiasedness asserts that the mean of θ^

Similarly, invoking intuition to justify the quantifier ∀θ∈ Θ as innocuous and natural on the grounds that one should care about the behavior of an estimator θ^ for all possible values of θ; is highly misleading. The behavior of θ^; for all θ∈ Θ, although relevant, is not what deter-

. Assessing its effectiveness calls for evaluating (deductively) the sampling

∉ Θ: ð28Þ

; what matters is its sampling

<sup>n</sup>ðXÞ to zero-in on (pinpoint) the true

<sup>Þ</sup>; for <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>

<sup>n</sup>ðXÞ under the true

<sup>X</sup>: For

<sup>n</sup>ðxÞ; <sup>θ</sup><sup>∗</sup>

Þ¼1: <sup>ð</sup>29<sup>Þ</sup>

: ð30Þ

: This is in contrast to using loss functions,

: ð31Þ

<sup>n</sup>ðX<sup>Þ</sup> will zero-in on <sup>θ</sup><sup>∗</sup> almost surely:

<sup>n</sup>ðX<sup>Þ</sup> is the true value <sup>θ</sup><sup>∗</sup> :

negations:

behavior around θ<sup>∗</sup>

12 5.2. Frequentist estimation

15 value θ<sup>∗</sup>

17 distribution of θ^

22 state of Nature (θ¼θ<sup>∗</sup>

.

24 knowing θ<sup>∗</sup>

28 i.e., <sup>θ</sup>¼θ<sup>∗</sup>

$$d(\mathbf{X}; \theta) := \sqrt{n}(\overline{X}\_n - \theta^\*) \stackrel{\theta = \theta^\*}{\sim} \mathbf{N}(0, 1), \tag{33}$$

whose distribution only holds for the true θ<sup>∗</sup> ; and no other value. This provides the basis for constructing a ð1 � αÞ confidence interval (CI):

$$\mathbb{P}\left(\overline{X}\_m - c\_{\frac{\theta}{2}}(\frac{1}{\sqrt{n}}) \le \theta \le \overline{X}\_n + c\_{\frac{\theta}{2}}(\frac{1}{\sqrt{n}}); \theta = \theta^\*\right) = 1 - a,\tag{34}$$

which asserts that the random interval Xn � c<sup>α</sup> 2 s ffiffi n p � �; Xn <sup>þ</sup> <sup>c</sup><sup>α</sup> 2 s ffiffi n p h i � � will cover (overlay) the true mean θ<sup>∗</sup> , whatever that happens to be, with probability ð1 � αÞ; or equivalently, the error of coverage is α: Hence, frequentist evaluation of the coverage error probability depends only on the sampling distribution of Xn and is attached to random interval for all values <sup>θ</sup> 6¼ <sup>θ</sup><sup>∗</sup> without requiring one to know θ<sup>∗</sup>:

The evaluation at <sup>θ</sup><sup>¼</sup> <sup>θ</sup><sup>∗</sup> calls into question the decision-theoretic definition of unbiasedness:

$$E\_1(\hat{\theta}\_n(\mathbf{X})) = \theta, \forall \theta \in \Theta,\tag{35}$$

10 in the context of frequentist estimation since this assertion makes sense only when defined at <sup>θ</sup>¼θ<sup>∗</sup> : Similarly, the appropriate frequentist definition of the MSE for an estimator, initially 12 proposed by Fisher [39], is defined at the point <sup>θ</sup>¼θ<sup>∗</sup> :

$$MSE(\hat{\theta}\_n(\mathbf{X}); \theta^\*) = E(\hat{\theta}\_n(\mathbf{X}) - \theta^\*)^2, \text{for } \theta^\* \text{ in } \Theta. \tag{36}$$

13 Indeed, the well-known decomposition:

$$MSE(\hat{\boldsymbol{\theta}}(\mathbf{X}); \boldsymbol{\theta}^\*) = Var(\hat{\boldsymbol{\theta}}(\mathbf{X})) + \left[ \mathbb{E}(\hat{\theta}\_n(\mathbf{X})) - \boldsymbol{\theta}^\* \right]^2, \text{for } \boldsymbol{\theta}^\* \text{ in } \boldsymbol{\Theta}, \tag{37}$$

14 is meaningful only when defined at the point <sup>θ</sup>¼θ<sup>∗</sup> (true mean) since by definition:

$$\begin{array}{l}\text{Var}(\hat{\theta}(\mathbf{X})) = E[\hat{\theta}\_n(\mathbf{X}) - \theta\_m]^2, \,\theta\_m = E(\hat{\theta}\_n(\mathbf{X})) \\\text{Bias}(\hat{\theta}\_n(\mathbf{X}); \theta^\*) = E(\hat{\theta}\_n(\mathbf{X})) - \theta^\*,\end{array} \tag{38}$$

 and thus, the variance and the bias involve only two values of θ in Θ; θ<sup>m</sup> and θ<sup>∗</sup> ; and when <sup>θ</sup><sup>m</sup> <sup>¼</sup> <sup>θ</sup><sup>∗</sup> the estimator is unbiased. This implies that the apparent affinity between the MSE1 defined in Eq. (13) and the variance of an estimator is more apparent than real because the latter makes frequentist sense only when <sup>θ</sup><sup>m</sup> <sup>¼</sup> <sup>E</sup>ðθ^ <sup>n</sup>ðXÞÞ is a single point.

#### 5.3. James-Stein estimator from a frequentist perspective

For a proper frequentist evaluation of the above James-Stein result, it is important to bring out the conflict between the overall MSE (14) and the factual reasoning underlying frequentist estimation. From the latter perspective, the James-Stein estimator raises several issues of concern.

First, both the least-squares <sup>θ</sup>^LSðX<sup>Þ</sup> and the James-Stein <sup>θ</sup>^JSðX<sup>Þ</sup> estimators are inconsistent estimators of θ since the underlying model suffers from the incidental parameter problem: there is essentially one observation (Xk) for each unknown parameter (θk), and as m ! ∞ the number of unknown parameters increases at the same rate. To bring out the futility of com-10 paring these two estimators more clearly, consider the following simpler example.

Example. Let X :¼ ðX1;X2;…;XnÞ be a sample from the simple Normal model in Eq. (18). Comparing the two estimators <sup>θ</sup>^1¼Xn and <sup>θ</sup>^2<sup>¼</sup> <sup>1</sup> <sup>ð</sup>X<sup>1</sup> <sup>þ</sup> Xn<sup>Þ</sup> and inferring that <sup>θ</sup>^<sup>2</sup> is relatively more efficient than <sup>θ</sup>^<sup>1</sup> relative to a square loss function, i.e.,

$$\text{MSE}(\hat{\theta}\_2(\mathbf{X}); \theta) = 1 < \text{MSE}(\hat{\theta}\_1(\mathbf{X}); \theta) = \frac{1}{2}, \forall \theta \in \mathbb{R}, \tag{39}$$

14 is totally uninteresting because both estimators are inconsistent!

15 Second, to be able to discuss the role of admissibility in the Stein [37] result, we need to consider 16 a consistent James-Stein estimator, by extending the original data to a panel (longitudinal) data 17 where the sample is:

18 <sup>X</sup>t:¼ðX1t;X2t;…;XmtÞ; <sup>t</sup>¼1; <sup>2</sup>; …;n: In this case, the consistent least-squares and James-Stein 19 estimators are:

$$\begin{split} \hat{\boldsymbol{\theta}}\_{LS}(\mathbf{X}) &= (\overline{\mathbf{X}}\_1, \overline{\mathbf{X}}\_2, \dots, \overline{\mathbf{X}}\_m), \text{where } \overline{\mathbf{X}}\_k = \frac{1}{n} \sum\_{t=1}^n \mathbf{X}\_{kt}, \ k = 1, 2, \dots, m, \\ \hat{\boldsymbol{\theta}}\_{JS}^+(\mathbf{X}) &= \left(1 - \frac{(m-2)\boldsymbol{\rho}^2}{\|\overline{\mathbf{X}}\|^2}\right)^+ \overline{\mathbf{X}}, \text{ where } \overline{\mathbf{X}} \coloneqq (\overline{\mathbf{X}}\_1, \overline{\mathbf{X}}\_2, \dots, \overline{\mathbf{X}}\_m). \end{split} \tag{40}$$

20 This enables us to evaluate the notion of "relatively better" more objectively.

Admissibility relative to the overall loss function in Eq. (21) introduces a trade-off between the accuracy of the estimators for individual parameters <sup>θ</sup> :¼ ðθ1;θ2;…;θm<sup>Þ</sup> and the "overall" expected loss. The question is: "In what sense the overall MSE among a group of mean estimates provides a better measure of "error" in learning about the true values <sup>θ</sup><sup>∗</sup> :¼ ðθ<sup>∗</sup> 1;θ<sup>∗</sup> 2;…;θ<sup>∗</sup> <sup>m</sup>Þ?" The short answer is: it does not. Indeed, the overall MSE will be irrele- vant when the primary objective of estimation is to learn from data about θ<sup>∗</sup> . This is because the particular loss function penalizes the estimator's capacity to pin-point θ<sup>∗</sup> by trading an increase in bias for a decrease in the overall MSE in Eq. (21), when the latter is misleadingly evaluated over all <sup>θ</sup> in <sup>Θ</sup> :<sup>¼</sup> <sup>R</sup><sup>m</sup>. That is, the James-Stein estimator flouts the primary objective of pinpointing θ<sup>∗</sup> in favor of reducing the overall MSE ∀θ ∈ Θ.

In summary, the above discussion suggests that there is nothing paradoxical about Stein's [37] 32 original result. What is problematic is not the least-squares estimator, but the choice of "better" in terms of admissibility relative to an overall MSE in evaluating the accuracy of the estimators of θ.

#### 5.4. Frequentist hypothesis testing

5.3. James-Stein estimator from a frequentist perspective

18 Advances in Statistical Methodologies and Their Application to Real Problems

12 Comparing the two estimators <sup>θ</sup>^1¼Xn and <sup>θ</sup>^2<sup>¼</sup> <sup>1</sup>

θ^

θ^ JS

13 more efficient than <sup>θ</sup>^<sup>1</sup> relative to a square loss function, i.e.,

MSEðθ^

14 is totally uninteresting because both estimators are inconsistent!

<sup>þ</sup> <sup>ð</sup>XÞ¼ <sup>1</sup> � <sup>ð</sup>m�2Þσ<sup>2</sup>

30 of pinpointing θ<sup>∗</sup> in favor of reducing the overall MSE ∀θ ∈ Θ.

concern.

17 where the sample is:

19 estimators are:

25 <sup>θ</sup><sup>∗</sup> :¼ ðθ<sup>∗</sup>

1;θ<sup>∗</sup> 2;…;θ<sup>∗</sup>

For a proper frequentist evaluation of the above James-Stein result, it is important to bring out the conflict between the overall MSE (14) and the factual reasoning underlying frequentist estimation. From the latter perspective, the James-Stein estimator raises several issues of

First, both the least-squares <sup>θ</sup>^LSðX<sup>Þ</sup> and the James-Stein <sup>θ</sup>^JSðX<sup>Þ</sup> estimators are inconsistent estimators of θ since the underlying model suffers from the incidental parameter problem: there is essentially one observation (Xk) for each unknown parameter (θk), and as m ! ∞ the number of unknown parameters increases at the same rate. To bring out the futility of com-

Example. Let X :¼ ðX1;X2;…;XnÞ be a sample from the simple Normal model in Eq. (18).

<sup>1</sup>ðXÞ; <sup>θ</sup>Þ¼ <sup>1</sup>

n Xn t¼1

X; where X:¼ðX1;X2;…;XmÞ:

<sup>m</sup>Þ?" The short answer is: it does not. Indeed, the overall MSE will be irrele-

2

Xkt; k¼1; 2;…;m;

<sup>2</sup> <sup>ð</sup>X<sup>1</sup> <sup>þ</sup> Xn<sup>Þ</sup> and inferring that <sup>θ</sup>^<sup>2</sup> is relatively

; ∀ θ∈ R; ð39Þ

ð40Þ

. This is because

10 paring these two estimators more clearly, consider the following simpler example.

<sup>2</sup>ðXÞ; <sup>θ</sup>Þ¼<sup>1</sup> <sup>&</sup>lt; MSEðθ^

15 Second, to be able to discuss the role of admissibility in the Stein [37] result, we need to consider 16 a consistent James-Stein estimator, by extending the original data to a panel (longitudinal) data

18 <sup>X</sup>t:¼ðX1t;X2t;…;XmtÞ; <sup>t</sup>¼1; <sup>2</sup>; …;n: In this case, the consistent least-squares and James-Stein

Admissibility relative to the overall loss function in Eq. (21) introduces a trade-off between the accuracy of the estimators for individual parameters <sup>θ</sup> :¼ ðθ1;θ2;…;θm<sup>Þ</sup> and the "overall" expected loss. The question is: "In what sense the overall MSE among a group of mean estimates provides a better measure of "error" in learning about the true values

27 the particular loss function penalizes the estimator's capacity to pin-point θ<sup>∗</sup> by trading an 28 increase in bias for a decrease in the overall MSE in Eq. (21), when the latter is misleadingly 29 evaluated over all <sup>θ</sup> in <sup>Θ</sup> :<sup>¼</sup> <sup>R</sup><sup>m</sup>. That is, the James-Stein estimator flouts the primary objective

In summary, the above discussion suggests that there is nothing paradoxical about Stein's [37] 32 original result. What is problematic is not the least-squares estimator, but the choice of "better"

LSðXÞ¼ðX1;X2; …;XmÞ;where Xk<sup>¼</sup> <sup>1</sup>

∥X∥<sup>2</sup> � �<sup>þ</sup>

20 This enables us to evaluate the notion of "relatively better" more objectively.

26 vant when the primary objective of estimation is to learn from data about θ<sup>∗</sup>

Another frequentist inference procedure one can employ to learn from data about θ<sup>∗</sup> is hypothesis testing, where the question posed is whether θ<sup>∗</sup> is close enough to some prespecified value θ0. In contrast to estimation, the reasoning underlying frequentist testing is hypothetical in nature.

#### 5.4.1. Legitimate frequentist error probabilities

For testing the hypotheses:

10 H0:θ ≤ θ0vs:H1:θ > θ0; where θ<sup>0</sup> is a prespecified value;

one utilizes the same sampling distribution Xne<sup>N</sup> <sup>θ</sup>; <sup>1</sup> n � �, but transforms the pivot 12 <sup>d</sup>ðX; <sup>θ</sup><sup>Þ</sup> :<sup>¼</sup> ffiffiffi <sup>n</sup> <sup>p</sup> <sup>ð</sup>Xn � <sup>θ</sup><sup>∗</sup> <sup>Þ</sup> into the test statistic by replacing <sup>θ</sup><sup>∗</sup> with the prespecified value <sup>θ</sup>0; 13 yielding <sup>d</sup>ðX<sup>Þ</sup> :<sup>¼</sup> ffiffiffi <sup>n</sup> <sup>p</sup> <sup>ð</sup>Xn � <sup>θ</sup>0Þ: However, instead of evaluating it under the factual <sup>θ</sup> <sup>¼</sup> <sup>θ</sup><sup>∗</sup>, it is 14 now evaluated under various hypothetical scenarios associated with H<sup>0</sup> and H<sup>1</sup> to yield two 15 types of (hypothetical) sampling distributions:

$$\mathbf{(0)}\qquad d(\mathbf{X}) := \sqrt{n}(\overline{X}\_n - \theta\_0)^{-\theta = \theta^\*} \operatorname{N}(0, 1),$$

$$\mathbf{(II)} \qquad d(\mathbf{X}) := \sqrt{n}(\overline{X}\_{\pi} - \theta\_0)^{-\theta = \theta^\*} \operatorname{N}(\delta\_1, 1), \delta\_1 = \sqrt{n}(\theta\_1 - \theta\_0) \text{for } \theta\_1 > \theta\_0.$$

 In both cases, (I) and (II), the underlying reasoning is hypothetical in the sense that the factual in Eq. (33) is replaced by hypothesized values of <sup>θ</sup>; and the test statistic <sup>d</sup>ðX<sup>Þ</sup> provides a standardized distance between the hypothesized values (θ<sup>0</sup> or θ1) and θ<sup>∗</sup> the true θ; assumed to underlie the generation of the data x0; yielding dðx0Þ: Using the sampling distribution in (I), one can define the following legitimate error probabilities:

$$\begin{aligned} \mathsf{significance level}: & \quad \mathbb{P}(d(\mathbf{X}) > c\_a; H\_0) = \alpha, \\ & \quad \mathbf{p-value}: \quad \mathbb{P}(d(\mathbf{X}) > d(\mathbf{x}\_0); H\_0) = p(\mathbf{x}\_0). \end{aligned} \tag{41}$$

23 Using the sampling distribution in (II), one can define:

$$\begin{aligned} \mathtt{type II error prob.} \colon & \quad \mathbb{P}(d(\mathbf{X}) \le c\_{\boldsymbol{\alpha}}; \theta = \theta\_1) = \boldsymbol{\beta}(\theta\_1), \text{ for } \theta\_1 > \theta\_0, \\\mathtt{power} \colon & \quad \mathbb{P}(d(\mathbf{X}) > c\_{\boldsymbol{\alpha}}; \theta = \theta\_1) = \boldsymbol{\rho}(\theta\_1), \text{ for } \theta\_1 > \theta\_0. \end{aligned} \tag{42}$$

 It can be shown that the test <sup>T</sup>α; defined by the test statistic <sup>d</sup>ðX<sup>Þ</sup> and the rejection region <sup>C</sup>1ðαÞ¼{x :dðx<sup>Þ</sup> <sup>&</sup>gt; <sup>c</sup>α}; constitutes a uniformly most powerful (UMP) test for significance level α; see Ref. [9]. The type I [II] error probability is associated with test T<sup>α</sup> erroneously rejecting [accepting] H0. The type I and II error probabilities evaluate the generic capacity [whatever the sample realization x ∈ R<sup>n</sup>] of a test to reach correct inferences. Contrary to Bayesian claims, these error probabilities have nothing to do with the temporal or the physical dimension of the long-run metaphor associated with repeated samples. The relevant feature of the long-run metaphor is the repeatability (in principle) of the DGM represented by MθðxÞ; this feature can be easily operationalized using computer simulation; see Ref. [40].

The key difference between the significance level α and the p-value is that the former is a predata and the latter a post-data error probability. Indeed, the p-value can be viewed as the smallest significance level α at which H<sup>0</sup> would have been rejected with data x0. The legitimacy of postdata error probabilities underlying the hypothetical reasoning can be used to go beyond the N-P accept/reject rules and provide an evidential interpretation pertaining to the discrepancy γ from the null warranted by data x0; see Ref. [41].

Despite the fact that frequentist testing uses hypothetical reasoning, its main objective is also to 10 learn from data about the true model M<sup>∗</sup> <sup>ð</sup>xÞ¼{fðx; <sup>θ</sup><sup>∗</sup> <sup>Þ</sup>}; <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> <sup>X</sup>: This is because a test statistic like <sup>d</sup>ðXÞ:<sup>¼</sup> ffiffiffi <sup>n</sup> <sup>p</sup> <sup>ð</sup>Xn � <sup>θ</sup>0<sup>Þ</sup> constitutes nothing more than a scaled distance between <sup>θ</sup><sup>∗</sup> <sup>½</sup>the value 12 behind the generation of xn�; and a hypothesized value <sup>θ</sup>0; with <sup>θ</sup><sup>∗</sup> being replaced by its "best" 13 estimator Xn:

## 14 6. Revisiting loss and risk functions

 The above discussion raises serious doubts about the role of loss functions and admissibility in evaluating learning from data x<sup>0</sup> about θ<sup>∗</sup> : To understand why the decision-theoretic framing misrepresents the frequentist approach, one needs to consider the role of loss functions in statistical inference more generally.

## 19 6.1. Where do loss functions come from?

 A closer scrutiny of the decision-theoretic set up reveals that the loss function needs to invoke "information from sources other than the data," which is usually not readily available. Indeed, such information is available in very restrictive situations, such as acceptance sampling in quality control. In light of that, a proper understanding of the intended scope of statistical inference calls for distinguishing the special cases where the loss function is part and parcel of the available substantive information from those that no such information is either relevant or available.

27 Tiao and Box [25], p. 624, reiterated Fisher's [42] distinction:

 "Now it is undoubtedly true that on the one hand that situations exist where the loss function is at least approximately known (for example, certain problems in business) and sampling inspection are of this sort. … On the other hand, a vast number of inferential problems occur, particularly in the analysis of scientific data, where there is no way of knowing in advance to what use the results of research will subsequently be put."

33 Cox [43] went further and questioned this framing even in cases where the inference might 34 involve a decision:

35 "The reasons that the detailed techniques [decision-theoretic] seem of fairly limited applica-36 bility, even when a fairly clear cut decision element is involved, may be (i) that, except in such fields as control theory and acceptance sampling, a major contribution of statistical technique is in presenting the evidence in incisive form for discussion, rather than in providing mechanical presentation for the final decision. This is especially the case when a single major decision is involved. (ii) The central difficulty may be in formulating the elements required for the quantitative analysis, rather than in combining these elements via a decision rule." (p. 45)

Another important aspect of using loss functions in inference is that in practice they seem to be an add-on to the inference itself since they bring to the problem the information other than the data. In particular, the same statistical inference problem can give rise to very different decisions/actions depending on one's loss function. To illustrate that consider an example from [44]:

 "… consider the case of a new drug whose effects are studied by a research scientist attached to the laboratory of a pharmaceutical company. The conclusion of the study may have different bearings on the action to be taken by (a) the scientist whose line of further investigation would depend on it, (b) the company whose business decisions would determined by it, and (c) the Government whose policies as to health care, drug control, etc., would take shape on that basis." (p. 72)

 In practice, each one of these different agents is likely to have a very different loss function, but their inferences should have a common denominator: the scientific evidence pertaining to θ<sup>∗</sup> ; the true θ; that stems solely from the observed data.

## 6.2. Decisions vs. inferences

metaphor is the repeatability (in principle) of the DGM represented by MθðxÞ; this feature can

The key difference between the significance level α and the p-value is that the former is a predata and the latter a post-data error probability. Indeed, the p-value can be viewed as the smallest significance level α at which H<sup>0</sup> would have been rejected with data x0. The legitimacy of postdata error probabilities underlying the hypothetical reasoning can be used to go beyond the N-P accept/reject rules and provide an evidential interpretation pertaining to the discrep-

Despite the fact that frequentist testing uses hypothetical reasoning, its main objective is also to

behind the generation of xn�; and a hypothesized value <sup>θ</sup>0; with <sup>θ</sup><sup>∗</sup> being replaced by its "best"

 The above discussion raises serious doubts about the role of loss functions and admissibility in evaluating learning from data x<sup>0</sup> about θ<sup>∗</sup>: To understand why the decision-theoretic framing misrepresents the frequentist approach, one needs to consider the role of loss functions in

 A closer scrutiny of the decision-theoretic set up reveals that the loss function needs to invoke "information from sources other than the data," which is usually not readily available. Indeed, such information is available in very restrictive situations, such as acceptance sampling in quality control. In light of that, a proper understanding of the intended scope of statistical inference calls for distinguishing the special cases where the loss function is part and parcel of the available substantive information from those that no such information is either relevant or

 "Now it is undoubtedly true that on the one hand that situations exist where the loss function is at least approximately known (for example, certain problems in business) and sampling inspection are of this sort. … On the other hand, a vast number of inferential problems occur, particularly in the analysis of scientific data, where there is no way of knowing in advance to

Cox [43] went further and questioned this framing even in cases where the inference might

 "The reasons that the detailed techniques [decision-theoretic] seem of fairly limited applica-bility, even when a fairly clear cut decision element is involved, may be (i) that, except in

<sup>ð</sup>xÞ¼{fðx; <sup>θ</sup><sup>∗</sup>

<sup>n</sup> <sup>p</sup> <sup>ð</sup>Xn � <sup>θ</sup>0<sup>Þ</sup> constitutes nothing more than a scaled distance between <sup>θ</sup><sup>∗</sup> <sup>½</sup>the value

<sup>Þ</sup>}; <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>

<sup>X</sup>: This is because a test statistic

be easily operationalized using computer simulation; see Ref. [40].

Advances in Statistical Methodologies and Their Application to Real Problems

ancy γ from the null warranted by data x0; see Ref. [41].

learn from data about the true model M<sup>∗</sup>

6. Revisiting loss and risk functions

statistical inference more generally.

6.1. Where do loss functions come from?

Tiao and Box [25], p. 624, reiterated Fisher's [42] distinction:

what use the results of research will subsequently be put."

like <sup>d</sup>ðXÞ:<sup>¼</sup> ffiffiffi

estimator Xn:

available.

involve a decision:

 The above discussion brings out the crucial distinction between a "decision" and an "infer- ence" stemming from data x0. Even before Wald [5] introduced the decision-theoretic perspec-tive, Fisher [42] perceptively argued:

 "In the field of pure research no assessment of the cost of wrong conclusions, or of delay in arriving at more correct conclusions can conceivably be more than a pretence, and in any case such an assessment would be inadmissible and irrelevant in judging the state of the scientific evidence." (pp. 25–26)

Tukey (1960) echoed Fisher's view by contrasting decisions vs. inferences:

 "Like any other human endeavor, science involves many decisions, but it progresses by the building up of a fairly well established body of knowledge. This body grows by the reaching of conclusions — by acts whose essential characteristics differ widely from the making of deci- sions. Conclusions are established with careful regard to evidence, but without regard to consequences of specific actions in specific circumstances." (p. 425)

 Hacking [45] brought out the key difference between an "inference pertaining to evidence" for or against a hypothesis, and a "decision to do something" as a result of an inference:

 "… to conclude that an hypothesis is best supported is, apparently, to decide that the hypoth-esis in question is best supported. Hence it is a decision like any other. But this inference is fallacious. Deciding that something is the case differs from deciding to do something. … Hence deciding to do something falls squarely in the province of decision theory, but deciding that something is the case does not." (p. 31)

This issue was elaborated upon by Birnbaum [15], p. 19:

"Two contrasting interpretations of the decision concept are formulated: behavioral, applicable to "decisions" in a concrete literal sense as in acceptance sampling; and evidential, applicable to "decisions" such as "reject H0" in a research context, where the pattern and strength of statistical evidence concerning statistical hypotheses is of central interest."

#### 6.3. Loss functions vs. inherent distance functions

 The notion of a loss function stemming from "information other than the data" raises another source of potential conflict. This stems from the fact that within each statistical model MθðxÞ there exists an inherent statistic distance function, often relating to the log-likelihood and the score function, which constitutes information contained in the data; see Ref. [46].

14 It is well known that when the distribution underlying <sup>M</sup>θðx<sup>Þ</sup> is normal, the inherent distance 15 function for comparing estimators of the mean (θ) is the square:

$$ND(\hat{\theta}\_n(\mathbf{X}); \theta^\*) = (\hat{\theta}\_n(\mathbf{X}) - \theta^\*)^2. \tag{43}$$

16 On the other hand, when the distribution is Laplace, the relevant statistical distance function is 17 the absolute distance (see Ref. [47]):

$$AD(\ddot{\theta}\_n(\mathbf{X}); \theta^\*) = |\ddot{\theta}\_n(\mathbf{X}) - \theta^\*|. \tag{44}$$

18 Similarly, when the distribution underlying <sup>M</sup>θðx<sup>Þ</sup> is uniform, the inherent distance function is:

$$\text{SLP}(\hat{\theta}\_n(\mathbf{X}); \theta^\*) = \sup\_{\mathbf{x} \in \mathbb{R}\_X^n} |\hat{\theta}\_n(\mathbf{x}) - \theta^\*|. \tag{45}$$

19 Note that these distance functions are defined at the point <sup>θ</sup>¼θ<sup>∗</sup> and not for all <sup>θ</sup> in <sup>Θ</sup>, as 20 traditional loss functions.

The dilemma facing a Bayesian or a decision-theoretic statistician is to decide when it makes sense to override the MLE and select the optimal rule stemming from an externally given loss function. The dilemma is not as trivial as it might seem at first sight for two reasons. First, the key difference between the two is that the assumptions of the likelihood function <sup>L</sup>ðθ<sup>Þ</sup> are testable vis-a-vis the data, but those underlying the loss function are not. Second, the likelihood function renders the notion of efficiency "global," full efficiency, in terms of Fisher's information:

$$\mathcal{CR}(\boldsymbol{\theta}^\*) = \boldsymbol{I}\_n^{-1}(\boldsymbol{\theta}^\*), \boldsymbol{I}\_n(\boldsymbol{\theta}^\*) := \boldsymbol{E}\left(-\frac{\partial^2 \ln \boldsymbol{L}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^\top}\right). \tag{46}$$

27 Hence, the optimality of an estimator can be affirmed using testable information comprising 28 the statistical model <sup>M</sup>θðxÞ. This is in direct contrast with admissibility, which is a property defined in terms of "local" efficiency—relative to a loss function—based on external (nontestable) information.

## 6.4. Acceptance sampling vs. learning from data

fallacious. Deciding that something is the case differs from deciding to do something. … Hence deciding to do something falls squarely in the province of decision theory, but deciding that

"Two contrasting interpretations of the decision concept are formulated: behavioral, applicable to "decisions" in a concrete literal sense as in acceptance sampling; and evidential, applicable to "decisions" such as "reject H0" in a research context, where the pattern and strength of

10 The notion of a loss function stemming from "information other than the data" raises another source of potential conflict. This stems from the fact that within each statistical model MθðxÞ 12 there exists an inherent statistic distance function, often relating to the log-likelihood and the

14 It is well known that when the distribution underlying <sup>M</sup>θðx<sup>Þ</sup> is normal, the inherent distance

16 On the other hand, when the distribution is Laplace, the relevant statistical distance function is

18 Similarly, when the distribution underlying <sup>M</sup>θðx<sup>Þ</sup> is uniform, the inherent distance function is:

19 Note that these distance functions are defined at the point <sup>θ</sup>¼θ<sup>∗</sup> and not for all <sup>θ</sup> in <sup>Θ</sup>, as

<sup>Þ</sup>;Inðθ<sup>∗</sup>

27 Hence, the optimality of an estimator can be affirmed using testable information comprising 28 the statistical model <sup>M</sup>θðxÞ. This is in direct contrast with admissibility, which is a property

Þ¼ðθ^

Þ¼jθ^

Þ ¼ sup x∈ R<sup>n</sup> X jθ^

The dilemma facing a Bayesian or a decision-theoretic statistician is to decide when it makes sense to override the MLE and select the optimal rule stemming from an externally given loss function. The dilemma is not as trivial as it might seem at first sight for two reasons. First, the key difference between the two is that the assumptions of the likelihood function <sup>L</sup>ðθ<sup>Þ</sup> are testable vis-a-vis the data, but those underlying the loss function are not. Second, the likelihood function renders the notion of efficiency "global," full efficiency, in terms of Fisher's information:

<sup>n</sup>ðXÞ � <sup>θ</sup><sup>∗</sup>

<sup>n</sup>ðXÞ � <sup>θ</sup><sup>∗</sup>

<sup>n</sup>ðxÞ � <sup>θ</sup><sup>∗</sup>

<sup>Þ</sup> :<sup>¼</sup> <sup>E</sup> � <sup>∂</sup>2ln <sup>L</sup>ðθ<sup>Þ</sup>

∂ θ ∂ θ<sup>Τ</sup> 

Þ 2 : ð43Þ

j: ð44Þ

<sup>j</sup>: <sup>ð</sup>45<sup>Þ</sup>

: ð46Þ

something is the case does not." (p. 31)

17 the absolute distance (see Ref. [47]):

20 traditional loss functions.

This issue was elaborated upon by Birnbaum [15], p. 19:

22 Advances in Statistical Methodologies and Their Application to Real Problems

6.3. Loss functions vs. inherent distance functions

15 function for comparing estimators of the mean (θ) is the square:

NDðθ^

ADðθ^

SUPðθ^

CRðθ<sup>∗</sup>

Þ¼I �1 <sup>n</sup> <sup>ð</sup>θ<sup>∗</sup>

statistical evidence concerning statistical hypotheses is of central interest."

13 score function, which constitutes information contained in the data; see Ref. [46].

<sup>n</sup>ðXÞ; <sup>θ</sup><sup>∗</sup>

<sup>n</sup>ðXÞ; <sup>θ</sup><sup>∗</sup>

<sup>n</sup>ðXÞ; <sup>θ</sup><sup>∗</sup>

Let us bring out the key features of a situation where the above decision-theoretic set up makes perfectly good sense. This is the situation Fisher [12] called acceptance sampling, such as an industrial production process where the objective is quality control, i.e., to make a decision pertaining to shipping sub-standard products (e.g., nuts and bolts) to a buyer using the expected loss/gain as the ultimate criterion.

In an acceptance sampling context, the MSEðθ^ðXÞ; <sup>θ</sup>Þ; or some other risk function, are relevant 10 because they evaluate genuine losses associated with a decision related to the choice of an estimate <sup>θ</sup>^ðx0Þ, say the cost of the observed percentage of defective products, but that has 12 nothing to do with type I and II error probabilities.

13 Acceptance sampling differs from a scientific enquiry in two crucial respects:


 The key difference between acceptance sampling and a scientific inquiry is that the primary objective of the latter is not to minimize expected loss (costs and utility) associated with different values of θ ∈ Θ; but to use data x<sup>0</sup> to learn about the "true" model (17). The two situations are drastically different mainly because the key notion of a "true θ" calls into question the above acceptance sampling set up. Indeed, the loss function being defined "∀θ ∈ Θ," will penalize θ<sup>∗</sup> ; since there is no reason to expect that the highest ranked θ would coincide with θ<sup>∗</sup> , unless by accident.

29 The extreme relativism of loss function optimality renders decision-theoretic and Bayes rules 30 highly vulnerable to abuse. In practice, one can justify any estimator as optimal, however lame in terms of other criteria, by selecting an "appropriate" loss function.

 Example 1. Consider a manufacturer of high precision bolts and nuts who has information that the buyer only checks the first and last box for quality control when accepting an order. This suggests that to minimize losses, stemming from the return of its products as defective, an appropriate loss function might be:

$$L(\mathbf{X}; \boldsymbol{\theta}) = \left( \left[ (\mathbf{X}\_1 + \mathbf{X}\_n)/2 \right] - \boldsymbol{\theta} \right)^2, \boldsymbol{\theta} \in (0, 1). \tag{47}$$

From the acceptance sampling perspective, the "optimal" estimator <sup>θ</sup><sup>~</sup> ¼ ðX<sup>1</sup> <sup>þ</sup> XnÞ=2 is excellent because it minimizes the expected losses, but it is a terrible estimator for pinpointing θ<sup>∗</sup> because it is inconsistent!

Consider a more general case where acceptance sampling resembles hypothesis testing in so far as final products are randomly selected for inspection during the production process. In such a situation the main objective can be viewed as operationalizing the probabilities of false acceptance/rejection with a view to minimize the expected losses. The conventional wisdom has been that this situation is similar enough to Neyman-Pearson (N-P) testing to render the latter as the appropriate framing for the decision to ship this particular batch or not. However, a closer look at some of the examples used to illustrate such a situation [48], reveals that the decisions are driven exclusively by the risk function and not by any quest to learn from data about the true θ<sup>∗</sup> . For instance, N-P way of addressing the trade-off between the two types of error probabilities, fixing α to a small value and seek a test that minimizes the type II error probability, seems utterly irrelevant in such a context. One can easily think of a loss function where the "optimal" trade-off calls for a much larger type I than type II error probability. As argued in Ref. [14]:

 "Wald's decision theory … has given up fixed probability of errors of the first kind, and has focused on gains, losses or regrets." (p. 433)

 Indeed, Wald [5] was the first to highlight that the decision-theoretic notion of "optimality" revolves around a particular loss function:

 "The "best" system of regions of acceptance … will depend only on the weight function of the errors." ([5], p. 302)

 Given the crucial differences in [a]–[c], one can make a strong case that the objectives and the underlying reasoning of acceptance sampling are drastically different from those pertaining to learning from data in a scientific context.

#### 6.5. Is expected loss a legitimate frequentist error?

 The key question is: what do expected losses and traditional frequentist errors, such as bias, MSE and the type I–II errors, have in common, if anything?

 First, they stem directly from the statistical model <sup>M</sup>θðx<sup>Þ</sup> since the underlying sampling distributions of estimators, test statistics, and predictors are derived exclusively from the distri- bution of the sample fðx;θ<sup>Þ</sup> through Eq. (7). In this sense, the relevant error probabilities are directly related to statistical information pertaining to the data as summarized by the statistical model <sup>M</sup>θðx<sup>Þ</sup> itself.

 Second, they are attached to a particular frequentist inference procedure as they relate to a relevant inferential claim. These error probabilities calibrate the effectiveness of inference pro- cedures in learning from data about the true statistical model M<sup>∗</sup> <sup>ð</sup>xÞ¼{fðx; <sup>θ</sup><sup>∗</sup> <sup>Þ</sup>}; <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> X:

 In light of these features, the question is: "in what sense a risk function could potentially represent relevant frequentist errors?" That argument that the risk function represents legiti- mate frequentist errors because it is derived by taking expectations with respect to <sup>f</sup>ðx; <sup>θ</sup>Þ; x∈ R<sup>n</sup> <sup>X</sup> [3], is misguided for two reasons.


## 16 7. Summary and conclusions

From the acceptance sampling perspective, the "optimal" estimator <sup>θ</sup><sup>~</sup> ¼ ðX<sup>1</sup> <sup>þ</sup> XnÞ=2 is excellent because it minimizes the expected losses, but it is a terrible estimator for pinpointing θ<sup>∗</sup>

Consider a more general case where acceptance sampling resembles hypothesis testing in so far as final products are randomly selected for inspection during the production process. In such a situation the main objective can be viewed as operationalizing the probabilities of false acceptance/rejection with a view to minimize the expected losses. The conventional wisdom has been that this situation is similar enough to Neyman-Pearson (N-P) testing to render the latter as the appropriate framing for the decision to ship this particular batch or not. However, a closer look 10 at some of the examples used to illustrate such a situation [48], reveals that the decisions are driven exclusively by the risk function and not by any quest to learn from data about the true θ<sup>∗</sup>

12 For instance, N-P way of addressing the trade-off between the two types of error probabilities, 13 fixing α to a small value and seek a test that minimizes the type II error probability, seems utterly 14 irrelevant in such a context. One can easily think of a loss function where the "optimal" trade-off

16 "Wald's decision theory … has given up fixed probability of errors of the first kind, and has

18 Indeed, Wald [5] was the first to highlight that the decision-theoretic notion of "optimality"

20 "The "best" system of regions of acceptance … will depend only on the weight function of the

22 Given the crucial differences in [a]–[c], one can make a strong case that the objectives and the 23 underlying reasoning of acceptance sampling are drastically different from those pertaining to

26 The key question is: what do expected losses and traditional frequentist errors, such as bias,

 First, they stem directly from the statistical model <sup>M</sup>θðx<sup>Þ</sup> since the underlying sampling distributions of estimators, test statistics, and predictors are derived exclusively from the distri- bution of the sample fðx;θ<sup>Þ</sup> through Eq. (7). In this sense, the relevant error probabilities are directly related to statistical information pertaining to the data as summarized by the statistical

33 Second, they are attached to a particular frequentist inference procedure as they relate to a 34 relevant inferential claim. These error probabilities calibrate the effectiveness of inference pro-

36 In light of these features, the question is: "in what sense a risk function could potentially 37 represent relevant frequentist errors?" That argument that the risk function represents legiti-38 mate frequentist errors because it is derived by taking expectations with respect to <sup>f</sup>ðx; <sup>θ</sup>Þ;

<sup>ð</sup>xÞ¼{fðx; <sup>θ</sup><sup>∗</sup>

<sup>Þ</sup>}; <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> X:

15 calls for a much larger type I than type II error probability. As argued in Ref. [14]:

.

because it is inconsistent!

24 Advances in Statistical Methodologies and Their Application to Real Problems

17 focused on gains, losses or regrets." (p. 433)

19 revolves around a particular loss function:

24 learning from data in a scientific context.

25 6.5. Is expected loss a legitimate frequentist error?

27 MSE and the type I–II errors, have in common, if anything?

35 cedures in learning from data about the true statistical model M<sup>∗</sup>

<sup>X</sup> [3], is misguided for two reasons.

errors." ([5], p. 302)

32 model <sup>M</sup>θðx<sup>Þ</sup> itself.

39 x∈ R<sup>n</sup>

 'The paper makes a case for Fisher's [12, 42] assertions concerning the appropriateness of the decision-theoretic framing for "acceptance sampling" and its inappropriateness for frequentist inference. A closer look at this framing reveals that it is congruent with the Bayesian approach because it supplements the posterior distribution with a theory of optimal inference. Decisiontheoretic and Bayesian rules are considered optimal when they minimize the expected loss for all possible values of <sup>θ</sup> [∀θ<sup>∈</sup> <sup>Θ</sup>�; irrespective of what the true value <sup>θ</sup><sup>∗</sup> happens to be. In contrast, the theory of optimal frequentist inference revolves around the true value θ<sup>∗</sup>, since it depends entirely on the capacity of the procedure to pinpoint θ<sup>∗</sup> : The frequentist approach relies on factual (estima- tion and prediction), as well as hypothetical (testing) reasoning, both of which revolve around the existential quantifier ∃θ<sup>∗</sup> ∈ Θ. The inappropriateness of the quantifier ∀θ∈ Θ calls into question the relevance of admissibility as a minimal property for frequentist estimators. A strong case can be made that the relevant minimal property for frequentist estimators is consistency. In addition, full efficiency provides the relevant measure of an estimator's finite sample efficiency (accuracy) in pinpointing θ<sup>∗</sup> . Both of these properties stem from the underlying statistical model MθðxÞ; in contrast to admissibility which relies on loss functions based on information other than the data.

 It is argued that Stein's [36] result stems from the fact that admissibility introduces a trade-off between the accuracy of the estimator in pinpointing θ<sup>∗</sup> and the "overall" expected loss. That is, the James-Stein estimator achieves a higher overall MSE by blunting the capacity of a frequentist estimator to pinpoint θ<sup>∗</sup> Why would a frequentist care about the overall MSE defined for all θ in Θ? After all, expected losses are not legitimate errors similar to bias and MSE (when properly defined), as well as coverage, type I and II errors. The latter are attached to the frequentist procedures themselves to calibrate their capacity to achieve learning from data about θ<sup>∗</sup> . In contrast, expected losses are assigned to different values of θ in Θ, using information other than the data.

## Author details

Aris Spanos

Address all correspondence to: aris@vt.edu

Department of Economics, Virginia Tech, Blacksburg, VA, USA

## References

	- [4] O'Hagan, A. Bayesian Inference, London: Edward Arnold; 1994.
	- [9] Lehmann, E.L. Testing Statistical Hypotheses. NY: Wiley; 1959.

Author details

Address all correspondence to: aris@vt.edu

Department of Economics, Virginia Tech, Blacksburg, VA, USA

Advances in Statistical Methodologies and Their Application to Real Problems

[1] Wald, A. Statistical Decision Functions. NY: Wiley; 1950.

[4] O'Hagan, A. Bayesian Inference, London: Edward Arnold; 1994.

[9] Lehmann, E.L. Testing Statistical Hypotheses. NY: Wiley; 1959.

[14] Tukey, J.W. Conclusions vs Decisions. Technometrics. 1960; 2: 423–433.

tional Implementation, 2nd ed. NY: Springer; 2001.

Annals of Mathematical Statistics. 1939; 10: 299–326.

Washington: U.S. Department of Agriculture; 1952.

University of California Press; 1967.

Society, B, 1955; 17: 69–78.

ical Statistics. 1958; 29: 357–372.

[2] Berger, J.O. Statistical Decision Theory and Bayesian Analysis, 2nd ed., NY: Springer;

[3] Robert, C.P. The Bayesian Choice: From Decision-Theoretic Foundations to Computa-

[5] Wald, A. Contributions to the theory of statistical estimation and testing hypotheses.

 [6] Neyman, J. Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London, Series A, 1937;

[7] Neyman, J. Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed.

 [8] Neyman, J. Foundations of Behavioristic Statistics. Godambe, V. and Sprott, D. eds., Founda-tions of Statistical Inference. Toronto: Holt, Rinehart and Winston of Canada; 1971: pp. 1–13.

[11] Neyman, J.A. Selection of Early Statistical Papers by J. Neyman. Moss Landing, CA:

[12] Fisher, R.A. Statistical methods and scientific induction. Journal of the Royal Statistical

[13] Cox, D.R. Some problems connected with statistical inference. The Annals of Mathemat-

[15] Birnbaum, A. The Neyman-Pearson theory as decision theory, and as inference theory; with

a criticism of the Lindley-Savage argument for Bayesian Theory. Synthese. 1977; 36: 19–49.

[10] LeCam, L. Asymptotic Methods in Statistical Decision Theory. NY: Springer; 1986.

Aris Spanos

References

1985.

236: 333–380.


Provisional chapter

## **A Comparison Study on Performance of an Adaptive Filter with Other Estimation Methods for State Estimation in High-Dimensional System** A Comparison Study on Performance of an Adaptive Filter with Other Estimation Methods for State Estimation in High-Dimensional System

Hong Son Hoang and Remy Baraille Hong Son Hoang and Remy Baraille

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/67005

#### Abstract

[35] Saleh, A.K. Md. E. Theory of Preliminary Test and Stein-Type Estimation with Applica-

[36] Stein, C. Inadmissibility of the usual estimator for the mean of a multivariate distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probabil-

[37] James, W. and Stein, C. Estimation with quadratic loss. Proceedings of the Fourth Berke-

[38] Spanos, A. Where Do Statistical Models Come From? Revisiting the Problem of Specification. The Second Erich L. Lehmann Symposium, edited by Rojo J., Lecture Notes-

10 Monograph Series, vol. 49, Hayward CA: Institute of Mathematical Statistics. 2006; pp.

12 [39] Fisher, R.A. A Mathematical examination of the methods of determining the accuracy of 13 an observation by the mean error, and by the mean square error. Monthly Notices of the

15 [40] Spanos, A. A frequentist interpretation of probability for model-based inductive infer-

17 [41] Mayo, D.G. and Spanos, A. Severe testing as a basic concept in a Neyman-Pearson philos-18 ophy of induction. The British Journal for the Philosophy of Science; 2006; 57: 323–357.

20 [43] Cox, D.R. Foundations of statistical inference: the case for eclecticism. Australian Journal

22 [44] Chatterjee, S.K. Statistical Thought: A Perspective and History. Oxford: Oxford Univer-

24 [45] Hacking, I. Logic of Statistical Inference. Cambridge: Cambridge University Press; 1965.

19 [42] Fisher, R.A. The Design of Experiments. Edinburgh: Oliver and Boyd; 1935.

25 [46] Casella, G. and Berger, R.L. Statistical Inference, 2nd ed. CA: Duxbury. 2002.

26 [47] Shao, J. Mathematical Statistics. 2nd ed. NY: Springer. 2003.

27 [48] Silvey, S.D. Statistical Inference, London: Chapman & Hall. 1975.

ley Symposium on Mathematical Statistics and Probability. 1961; 1: 361–379.

tions. NY: Wiley; 2006.

28 Advances in Statistical Methodologies and Their Application to Real Problems

ity. 1956; 1: 197–206.

14 Royal Astronomical Society. 1920; 80: 758–770.

16 ence. Synthese. 2013; 190: 1555–1585.

of Statistics. 1978; 20: 43–59.

98–119.

23 sity Press. 2002.

In this chapter, performance comparison between the adaptive filter (AF) and other estimation methods, especially with the variational method (VM), is given in the context of data assimilation problem in dynamical systems with (very) high dimension. The emphasis is put on the importance of innovation approach which is a basis for construction of the AF as well as the choice of a set of tuning parameters in the filter gain. It will be shown that the innovation representation for the initial dynamical system plays essential role in providing stability of the assimilation algorithms for stable and unstable system dynamics. Numerical experiments will be given to illustrate the performance of the AF.

Keywords: dynamical system, innovation process, filter stability, minimum mean square prediction error, simultaneous stochastic perturbation

## 1. Introduction

Consider the following data assimilation problem: Given the dynamical system

$$\mathfrak{x}\_{k+1} = \phi(\mathfrak{x}\_k) + w\_k,\tag{1.1}$$

and the observations

$$z\_{k+1} = H\_{k+1} \mathbf{x}\_{k+1} + v\_{k+1}, \\ k = 0, 1, 2, \dots, N \tag{1.2}$$

Here, xk∈R<sup>n</sup> is the system state at <sup>k</sup> instant, <sup>φ</sup>ð:<sup>Þ</sup> : <sup>R</sup><sup>n</sup> ! Rn, zk∈Rp is observation vector, Hk∈Rp · <sup>n</sup> is the observation matrix, wk, vk are the model and observation uncorrelated noise

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

sequences which are mutually uncorrelated and uncorrelated with x0. The statistical characteristics of the entering random variables are given as

$$\begin{aligned} E[\mathbf{x}\_0] &= \overline{\mathbf{x}}\_0, E[\mathbf{x}\_0 \mathbf{x}\_0^T] = M\_0, \\ E[\boldsymbol{w}\_k] &= \mathbf{0}, E[\boldsymbol{w}\_k \boldsymbol{w}\_l^T] = \delta\_{kl} \boldsymbol{Q}, \\ E[\boldsymbol{v}\_k] &= \mathbf{0}, E[\boldsymbol{v}\_k \boldsymbol{v}\_l^T] = \delta\_{kl} \mathbf{R}, E[\boldsymbol{w}\_k \boldsymbol{v}\_l] = \mathbf{0}, \\ E[(\mathbf{x}\_0 - \overline{\mathbf{x}}\_0) \boldsymbol{w}\_k^T] &= \mathbf{0}, E[(\mathbf{x}\_0 - \overline{\mathbf{x}}\_0) \boldsymbol{v}\_k^T] = \mathbf{0}. \end{aligned} \tag{1.3}$$

The problem we consider here is to estimate the system state xk under the conditions that the dimension n of xk is of order 106 – 10<sup>8</sup> , and there are uncertainties in statistics of the model and observational noises. Due to very large n, it is impossible to apply traditional estimation algorithms for producing the estimate x^<sup>k</sup> and that is the reason there exist different approximation algorithms for solving this estimation problem. Theoretically, the optimal in mean square error (MSE) estimate x^<sup>k</sup>=<sup>N</sup> based on the set of observations Z½1, N� :¼ {z1,…; zN} is a filtered estimate for N = k and smoothed estimate for N > k [1, 2]. For the linear dynamical system Eqs. (1.1) and (1.2), the computation of x^<sup>k</sup> :¼ x^<sup>k</sup>=<sup>k</sup> can be efficiently performed using the Kalman filter (KF) [3] which is a sequential procedure. The KF provides also the equations for computation of estimation errors. If we are interested in obtaining x^<sup>k</sup>=<sup>N</sup>—the best estimate for xk based on Z[1, N], the Kalman smoother can serve as an efficient algorithm for its computation. The KF approach, however, is inappropriate for solving estimation problems in highdimensional systems. In this chapter, the high-dimensional system means the system whose state dimension is of order 106 – 108 . At the present and in the near future, the computer capacity, in both computational power and memory, is still very far to be sufficient to implement the KF in real time to produce the filtered estimate and to make corresponding forecast. For suboptimal schemes for atmospheric data assimilation based on the KF, see Ref. [4].

In this chapter, the emphasis is put principally on comparison of the AF with VM. For the review on the data assimilation methods in meteorology and oceanography, see Ref. [5]. To see more the advantages of the AF, we implement the extended KF (EKF) [6] in Section 6 and will compare its performance with that of the AF (the experiment with Lorenz system). The Cooper-Haines filter (CHF) [7], widely used in data assimilation in oceanography, is also applied in Section 7 to produce the estimate for the ocean state. It serves as a reference to be compared with that produced by the AF in high-dimensional setting.

In the next section, the variational method (VM), which is widely used in data assimilation for high-dimensional systems in meteorology and oceanography, is outlined. Section 3 provides the recently developed AF approach to data assimilation. The main idea of the AF is to take the innovation representation for the input-output system as a departure point to formulate the optimization problem, with the parameters of the filter gain as control variables. Section 4 presents the tools to implement the AF in a simple and efficient way which is adapted for highdimensional setting. This includes the objective function, filter stability, structure of the error covariance matrix (ECM), gain parameterization, algorithm for optimization known as simultaneous perturbation stochastic approximation (SPSA). It is shown how the ECM is estimated using an ensemble of samples of prediction error (PE) and the hypothesis on separation of vertical and horizontal variables structure (SeVHS) in the ECM. Computational comparison between the VM and AF is also given here. Section 5 includes a simple numerical experiment showing in detail how work the VM and AF, the difficulties of the VM in searching optimal solution and why no similar difficulties are encountered in the AF. The more complicated experiment with chaotic system known as Lorenz system is done in Section 6. The difficulties encountered here concern extreme sensitivity of its solution to small errors in the initial condition. Section 7 presents the performance of different filters in the data assimilation experiment with the high-dimensional ocean model MICOM with the North Atlantic configuration. The conclusions are given in Section 8.

Notation: In the chapter, AT denotes the transpose of the matrix A; E[.], E[.|.] denote the expectation and conditional expectation, respectively; jjAjj<sup>F</sup> denotes the Frobenius norm of a matrix A.

## 2. Variational method (VM)

sequences which are mutually uncorrelated and uncorrelated with x0. The statistical charac-

<sup>E</sup>½x0� ¼ <sup>x</sup>0, <sup>E</sup>½x0xT

<sup>k</sup> � ¼ <sup>0</sup>, <sup>E</sup>½ðx0−x0ÞvT

<sup>E</sup>½wk� ¼ <sup>0</sup>, <sup>E</sup>½wkw<sup>T</sup>

The problem we consider here is to estimate the system state xk under the conditions that the

observational noises. Due to very large n, it is impossible to apply traditional estimation algorithms for producing the estimate x^<sup>k</sup> and that is the reason there exist different approximation algorithms for solving this estimation problem. Theoretically, the optimal in mean square error (MSE) estimate x^<sup>k</sup>=<sup>N</sup> based on the set of observations Z½1, N� :¼ {z1,…; zN} is a filtered estimate for N = k and smoothed estimate for N > k [1, 2]. For the linear dynamical system Eqs. (1.1) and (1.2), the computation of x^<sup>k</sup> :¼ x^<sup>k</sup>=<sup>k</sup> can be efficiently performed using the Kalman filter (KF) [3] which is a sequential procedure. The KF provides also the equations for computation of estimation errors. If we are interested in obtaining x^<sup>k</sup>=<sup>N</sup>—the best estimate for xk based on Z[1, N], the Kalman smoother can serve as an efficient algorithm for its computation. The KF approach, however, is inappropriate for solving estimation problems in highdimensional systems. In this chapter, the high-dimensional system means the system whose

capacity, in both computational power and memory, is still very far to be sufficient to implement the KF in real time to produce the filtered estimate and to make corresponding forecast. For suboptimal schemes for atmospheric data assimilation based on the KF, see Ref. [4].

In this chapter, the emphasis is put principally on comparison of the AF with VM. For the review on the data assimilation methods in meteorology and oceanography, see Ref. [5]. To see more the advantages of the AF, we implement the extended KF (EKF) [6] in Section 6 and will compare its performance with that of the AF (the experiment with Lorenz system). The Cooper-Haines filter (CHF) [7], widely used in data assimilation in oceanography, is also applied in Section 7 to produce the estimate for the ocean state. It serves as a reference to be compared

In the next section, the variational method (VM), which is widely used in data assimilation for high-dimensional systems in meteorology and oceanography, is outlined. Section 3 provides the recently developed AF approach to data assimilation. The main idea of the AF is to take the innovation representation for the input-output system as a departure point to formulate the optimization problem, with the parameters of the filter gain as control variables. Section 4 presents the tools to implement the AF in a simple and efficient way which is adapted for highdimensional setting. This includes the objective function, filter stability, structure of the error covariance matrix (ECM), gain parameterization, algorithm for optimization known as simultaneous perturbation stochastic approximation (SPSA). It is shown how the ECM is estimated

<sup>E</sup>½vk� ¼ <sup>0</sup>, <sup>E</sup>½vkvT

<sup>E</sup>½ðx0−x0Þw<sup>T</sup>

<sup>0</sup> � ¼ M0,

<sup>l</sup> � ¼ δklQ,

<sup>k</sup> � ¼ 0:

, and there are uncertainties in statistics of the model and

. At the present and in the near future, the computer

(1.3)

<sup>l</sup> � ¼ δklR, E½wkvl� ¼ 0,

teristics of the entering random variables are given as

302 Advances in Statistical Methodologies and their Application to Real Problems Advances in Statistical Methodologies and Their Application to Real Problems

dimension n of xk is of order 106 – 10<sup>8</sup>

state dimension is of order 106 – 108

with that produced by the AF in high-dimensional setting.

Consider the problem of estimating {xk} in Eqs. (1.1)–(1.3). The VM consists of minimizing the following objective function

$$f[\mathbf{x}\_0, \dots, \mathbf{x}\_N] = \mathfrak{e}\_0 M\_0^{-1} \mathfrak{e}\_0 + $$

$$\sum\_{k=1}^N (\mathbf{z}\_k - H\_k \mathbf{x}\_k)^T R^{-1} (\mathbf{z}\_k - H\_k \mathbf{x}\_k), \mathfrak{e}\_0 := \mathbf{x}\_0 - \overline{\mathbf{x}}\_0,\tag{2.1}$$

$$J[\mathbf{x}\_0, \dots, \mathbf{x}\_N] \to \min\_{\left[\mathbf{x}\_0, \dots, \mathbf{x}\_N\right]} \tag{2.2}$$

under the constraints ð1:1Þ (2.3)

Thus, in the VM, we seek optimal solutions in the functional space (space of functions {xk}). For systems of high dimension, this task is impossible to perform. The simplification is required. Suppose the system Eq. (1.1) is linear and perfect, that is,

$$\mathbf{x}\_{k+1} = \Phi\_k \mathbf{x}\_k, k = 0, 1, \dots \tag{2.4}$$

Expressing all xk as functions of the initial state x0,

$$\mathbf{x}\_{k} = \Phi(k, \mathbf{0})\mathbf{x}\_{0},$$

$$\Phi(k, l) = \Phi\_{k-1}\dots\Phi\_{l}, (k > l), \Phi(k, k) = I,\tag{2.5}$$

$$I \text{ is the identity matrix of appropriate dimension}$$

and substituting xk, ∀k Eq. (2.5) into Eq. (1.1), at each k th observation instant, the following set of observations is available for x0,

$$\boldsymbol{z}\_k^1 = H\_k^1 \mathbf{x}\_0 + \boldsymbol{v}\_1^k, k = 1, 2, \dots$$

$$H\_k^1 := [(H\_1 \boldsymbol{\Phi}(\mathbf{1}, \mathbf{0}))^T, \dots, (H\_k \boldsymbol{\Phi}(k, \mathbf{0}))^T]^T,\tag{2.6}$$

$$\boldsymbol{v}\_k^1 = [\boldsymbol{v}\_1^T, \dots, \boldsymbol{v}\_k^T]^T.$$

Under the assumption on perfect model, the optimization problem Eqs. (2.1)–(2.3) is simplified as

$$J[\mathbf{x}\_0] \to \min\_{\left[\mathbf{x}\_0\right]},\tag{2.7}$$

$$J[\mathbf{x}\_0] := \mathbf{e}\_0^T \mathbf{M}\_0^{-1} \mathbf{e}\_0 + \sum\_{k=1}^N (\mathbf{z}\_k \mathbf{-} \boldsymbol{H}\_k^\prime \mathbf{x}\_0)^T \mathbf{R}\_k^{-1} (\mathbf{z}\_k \mathbf{-} \boldsymbol{H}\_k^\prime \mathbf{x}\_0), \tag{2.8}$$

$$\boldsymbol{H}\_k^\prime := \boldsymbol{H}\_k \boldsymbol{\Phi}(\mathbf{k}, \mathbf{0}).$$

We have now the unconstrained optimization problem Eqs. (2.7) and (2.8) with the vector of unknown parameters θ :¼ x0—the initial state. This problem can be solved using standard optimization techniques [8].

It is not hard to write out a solution to the problem Eqs. (2.7) and (2.8). For high-dimensional systems, there is no computational and memory resources to handle such implementation. In practice, the solution to the problem Eqs. (2.7) and (2.8) is found by solving iteratively the equation

$$\begin{array}{c} \nabla\_{\theta} \mathsf{J}[\theta] = \mathbf{0}, \\ \nabla\_{\theta} \mathsf{J}[\theta] := \left[ \eth \mathsf{J} / \partial \theta\_1, \dots, \eth \mathsf{J} / \partial \theta\_1 \right]^T. \end{array} \tag{2.9}$$

Comment 2.1. Usually, finding a solution to Eqs. (2.7) and (2.8) is a heavy task: in addition to storing the model solution produced by the direct model, minimization requires 20–30 iterations to reach a relatively good approximate solution.

Comment 2.2. Writing out ∇θJ½θ� shows that solving Eq. (2.9) requires

$$\Phi \left( H\_k' \right)^T y = \Phi\_k^T \Phi\_{k-1}^T \dots \Phi\_1^T H\_k^T y \tag{2.10}$$

for some y. As Φ<sup>T</sup> <sup>k</sup> is impossible to store, the approach known as adjoint equation (AE) is used which requires to construct AE code for computing the product Φ<sup>T</sup> <sup>k</sup> y. Each iteration in minimization of Eqs. (2.7) and (2.8) thus requires one integration of the model over the assimilation period, followed by one adjoint integration. The cost of one adjoint integration is about twice the cost of one direct integration, so that one minimization requires the equivalent of between 50–100 integrations of the model over the assimilation period (p. 205 [9]). In the next section, we see that the SPSA can also be used to solve this problem at a much lower cost.

Comment 2.3 As θ—initial state—has a physical meaning, it is important to introduce constraints on the appropriate physically realistic structure of the correction for θ during the estimation process. A poor (in the physical sense) structure of the guess for the initial state can lead to large estimation errors.

## 3. Adaptive filtering (AF)

z1 <sup>k</sup> <sup>¼</sup> <sup>H</sup><sup>1</sup>

H1

32 Advances in Statistical Methodologies and Their Application to Real Problems

<sup>J</sup>½x0� :<sup>¼</sup> <sup>e</sup><sup>T</sup>

tions to reach a relatively good approximate solution.

Comment 2.2. Writing out ∇θJ½θ� shows that solving Eq. (2.9) requires

ðH′ kÞ T <sup>y</sup> <sup>¼</sup> <sup>Φ</sup><sup>T</sup> <sup>k</sup> Φ<sup>T</sup> k−1…Φ<sup>T</sup> <sup>1</sup> H<sup>T</sup>

which requires to construct AE code for computing the product Φ<sup>T</sup>

optimization techniques [8].

<sup>0</sup> <sup>M</sup><sup>−</sup><sup>1</sup>

<sup>0</sup> e<sup>0</sup> þ ∑ N k¼1

as

equation

for some y. As Φ<sup>T</sup>

can lead to large estimation errors.

<sup>k</sup>x<sup>0</sup> <sup>þ</sup> vk

v1 <sup>k</sup> ¼ ½vT

<sup>k</sup> :¼ ½ðH1Φð1; <sup>0</sup>ÞÞ<sup>T</sup>,…; <sup>ð</sup>HkΦðk, <sup>0</sup>ÞÞ<sup>T</sup>�

Under the assumption on perfect model, the optimization problem Eqs. (2.1)–(2.3) is simplified

<sup>ð</sup>zk−H′

We have now the unconstrained optimization problem Eqs. (2.7) and (2.8) with the vector of unknown parameters θ :¼ x0—the initial state. This problem can be solved using standard

It is not hard to write out a solution to the problem Eqs. (2.7) and (2.8). For high-dimensional systems, there is no computational and memory resources to handle such implementation. In practice, the solution to the problem Eqs. (2.7) and (2.8) is found by solving iteratively the

∇θJ½θ� :¼ ½∂J=∂θ1,…; ∂J=∂θ1�

Comment 2.1. Usually, finding a solution to Eqs. (2.7) and (2.8) is a heavy task: in addition to storing the model solution produced by the direct model, minimization requires 20–30 itera-

zation of Eqs. (2.7) and (2.8) thus requires one integration of the model over the assimilation period, followed by one adjoint integration. The cost of one adjoint integration is about twice the cost of one direct integration, so that one minimization requires the equivalent of between 50–100 integrations of the model over the assimilation period (p. 205 [9]). In the next section,

Comment 2.3 As θ—initial state—has a physical meaning, it is important to introduce constraints on the appropriate physically realistic structure of the correction for θ during the estimation process. A poor (in the physical sense) structure of the guess for the initial state

we see that the SPSA can also be used to solve this problem at a much lower cost.

kx0Þ TR<sup>−</sup><sup>1</sup> <sup>k</sup> <sup>ð</sup>zk−H′

H′

∇θJ½θ� ¼ 0,

<sup>k</sup> is impossible to store, the approach known as adjoint equation (AE) is used

<sup>1</sup>, k ¼ 1; 2;…

<sup>1</sup> ,…; vT k � T:

T,

J½x0� ! min <sup>½</sup>x0�, (2.7)

kx0Þ,

<sup>T</sup>: (2.9)

<sup>k</sup> y (2.10)

<sup>k</sup> y. Each iteration in minimi-

<sup>k</sup> :¼ HkΦðk, 0Þ:

(2.6)

(2.8)

To overcome the difficulties listed in Comments 2.1–2.3, an adaptive filtering (AF) has been proposed in [10]. The main difference of the AF with the VM is lying in the choice of innovation representation for the original input-output system Eqs. (1.1) and (1.2) as a departure point to formulate the optimization problems. It is well known that under standard conditions, the optimal in MSE estimate x^<sup>k</sup> can be obtained by the KF. As the innovation process for the system output in the KF forms a white sequence, Kailath [11] has developed an innovation approach, in an elegant way, to derive the optimal filter for more general linear systems like nonstationary, filtering problems with Markovian processes for the model and observation errors. The innovation approach to linear least-squares approximation problems is first to "whiten" the output data and then to treat the resulting simpler white-noise observation problem. Consider the observation (output) sequence zk. The innovation process, associated with zk, is written as

$$\zeta\_k = z\_k - \mathbb{E}[z\_k | z\_{k-1}^1],\tag{3.1}$$

Under standard conditions (Gaussianness, uncorrelated noise sequences …), <sup>E</sup>½zkjz<sup>1</sup> <sup>k</sup>−1� ¼ Hkx^<sup>k</sup>=k−<sup>1</sup> hence

$$
\zeta\_k = z\_k - H\_k \hat{\mathbf{x}}\_{k/k-1}, \hat{\mathbf{x}}\_{k/k-1} = \Phi\_{k-1} \hat{\mathbf{x}}\_{k-1}, \tag{3.2}
$$

where x^<sup>k</sup>=k−<sup>1</sup> is an optimal in MSE one-step ahead prediction for xk given z<sup>1</sup> <sup>k</sup>−1. Using ζ<sup>k</sup> instead of zk, one can write out the formula for the estimate x^<sup>k</sup> and the KF under standard conditions. The filter has the form

$$\ddot{\mathbf{x}}\_k = \Phi\_k \ddot{\mathbf{x}}\_{k-1} + \mathbf{K}\_k \zeta\_k,$$

$$\mathbf{K}\_k = \mathbf{M}\_k \mathbf{H}\_k^T \left[ \mathbf{H}\_k \mathbf{M}\_k \mathbf{H}\_k^T + \mathbf{R}\_k \right]^{-1} \tag{3.3}$$

where Mk is the ECM for the prediction x^<sup>k</sup>=k<sup>−</sup>1. This matrix is found as a solution to the Riccati equation

$$M\_k = \Phi\_k P\_k \Phi\_k^T + Q\_k,\\ P\_k = [I - K\_k H\_k] M\_k. \tag{3.4}$$

Due to the very expensive computational burden in time stepping the ECM Mk in Eq. (3.4) as well as insufficient memory storage, the KF is impractical for solving data assimilation problems in very high-dimensional setting. The idea of the AF is based on the fact that when the filter is optimal, the innovation ζ<sup>k</sup> has a minimum variance. If we assume that the gain Kk belongs to a set of parameterized gains, that is,

$$K\_k = K\_k(\theta), \theta \in \Theta,\tag{3.5}$$

the optimal AF can be considered as that in some class of parameterized filters of a given structure. The following objective function is introduced

$$J(\theta) = E[\Psi(\theta)] \to \min\_{\theta \in \Theta} \Psi(\zeta\_k) = ||\zeta\_k||\_{\Sigma\_k^{-1}}^2 ||\zeta\_k||\_{\Sigma\_k^{-1}}^2 := <\zeta\_k, \Sigma\_k^{-1}\zeta\_k > . \tag{3.6}$$

In Ref. [12], the different classes of parameterized filters are found which belong to the class of stable reduced-order filters (ROF) [10, 13].

As an example for one class of ROFs, consider

$$K\_k = P\_{r,k} K\_{e,k} \tag{3.7}$$

where Ke, <sup>k</sup> : <sup>R</sup><sup>p</sup> ! Rne represents the gain, mapping the innovation vector from the observational space to the reduced space Rne of dimension ne≤n; Pr, <sup>k</sup> is mapping the reduced space Rne to the full space R<sup>n</sup>. The choice of a reduced space is of primary importance since it depends on the main characteristics of the filter known as stability. As proved in [12], under detectability condition, stability of the filter is ensured by forming the columns of Pr, <sup>k</sup> from unstable and stable eigenvectors (or singular vectors, Schur vectors) of the fundamental matrix Φk, and one can choose

$$K\_{e,k} = H\_{e,k}^T \left[ H\_{e,k} H\_{e,k}^T(k) + R\_k \right]^{-1}, \\ H\_{e,k} := H\_k P\_{r,k}, \tag{3.8}$$

One class of parameterized filters is (Section 5.2.2 in Ref. [12])

$$\begin{array}{c} \mathcal{K}\_{k}(\boldsymbol{\theta}) = P\_{r,k} \Lambda \mathcal{K}\_{e,k}(\boldsymbol{\theta}),\\ \Lambda = \text{diag}\left[\boldsymbol{\theta}\_{1}, \dots, \boldsymbol{\theta}\_{n\_{\boldsymbol{\epsilon}}}\right],\\ \mathbf{1} - \mathbf{1}/|\boldsymbol{\phi}\_{i}| < \dot{\boldsymbol{\phi}}\_{1}(i) \boldsymbol{\Leftarrow} \boldsymbol{\theta}\_{2}(i) < \mathbf{1} + \mathbf{1}/|\boldsymbol{\phi}\_{i}|.\end{array} \tag{3.9}$$

if φ<sup>i</sup> is an unstable or neutral eigenvector of Φ. For the stable φ<sup>i</sup> , we have

$$0 < \dot{\rho}\_1(\dot{\mathbf{i}}) \le \theta\_i \le \dot{\theta}\_2(\dot{\mathbf{i}}) < \mathcal{2},\tag{3.10}$$

#### 4. Differences between VM and AF

#### 4.1. Batch data formulation

We list now the main differences between two approaches VM and AF from which it becomes clear what are the advantages of the AF over the VM.

To make easier comparison between two approaches, let us write out the objective function Eq. (3.6) using a representation in a sample space

$$f\_N(\theta) \simeq f\_N(\theta) = \frac{1}{N} \Sigma\_{k=1}^N (z\_k - H\Phi\_k \hat{x}\_{k-1}(\theta))^T \Sigma\_k^{-1} (z\_k - H\Phi\_k \hat{x}\_{k-1}(\theta)) \tag{4.1}$$

Mention that in practical implementation of the AF, the optimization algorithm is not constructed on the basis of Eq. (4.1), but on Eq. (3.6). That is, due to the fact that Eq. (4.1) is written in a batch form which requires to make optimization over the time interval ½1, N� resulting in a very high computational burden. Minimizing Eq. (3.6) allows to apply SPSA method which is much less consuming for both computational and memory requirements. Below the main differences between VM and AF are listed:

<sup>J</sup>ðθÞ ¼ <sup>E</sup>½ΨðθÞ� ! min <sup>θ</sup>∈<sup>Θ</sup>, <sup>Ψ</sup>ðζkÞ ¼ jjζkjj<sup>2</sup>

34 Advances in Statistical Methodologies and Their Application to Real Problems

stable reduced-order filters (ROF) [10, 13].

can choose

As an example for one class of ROFs, consider

Ke, <sup>k</sup> <sup>¼</sup> <sup>H</sup><sup>T</sup>

One class of parameterized filters is (Section 5.2.2 in Ref. [12])

1−1=jφ<sup>i</sup>

if φ<sup>i</sup> is an unstable or neutral eigenvector of Φ. For the stable φ<sup>i</sup>

4. Differences between VM and AF

clear what are the advantages of the AF over the VM.

Eq. (3.6) using a representation in a sample space

<sup>J</sup>ðθ<sup>Þ</sup> <sup>≈</sup> JNðθÞ ¼ <sup>1</sup>

<sup>N</sup> <sup>∑</sup><sup>N</sup>

4.1. Batch data formulation

<sup>e</sup>, <sup>k</sup>½He, kH<sup>T</sup>

Σ−<sup>1</sup> k , jjζkjj<sup>2</sup> Σ−<sup>1</sup> k

In Ref. [12], the different classes of parameterized filters are found which belong to the class of

where Ke, <sup>k</sup> : <sup>R</sup><sup>p</sup> ! Rne represents the gain, mapping the innovation vector from the observational space to the reduced space Rne of dimension ne≤n; Pr, <sup>k</sup> is mapping the reduced space Rne to the full space R<sup>n</sup>. The choice of a reduced space is of primary importance since it depends on the main characteristics of the filter known as stability. As proved in [12], under detectability condition, stability of the filter is ensured by forming the columns of Pr, <sup>k</sup> from unstable and stable eigenvectors (or singular vectors, Schur vectors) of the fundamental matrix Φk, and one

<sup>e</sup>, <sup>k</sup>ðkÞ þ Rk�

−1

KkðθÞ ¼ Pr, <sup>k</sup>ΛKe, <sup>k</sup>ðθÞ, Λ ¼ diag ½θ1,…; θne �,

j:

0 < ò1ðiÞ ≤ θ<sup>i</sup> ≤ ò2ðiÞ < 2, (3.10)

JNðθÞ ! min <sup>θ</sup>∈<sup>Θ</sup>,

<sup>k</sup> <sup>ð</sup>zk−HΦkx^k−1ðθÞÞ (4.1)

, we have

j < ò1ðiÞ≤θi≤ò2ðiÞ < 1 þ 1=jφ<sup>i</sup>

We list now the main differences between two approaches VM and AF from which it becomes

To make easier comparison between two approaches, let us write out the objective function

<sup>k</sup>¼<sup>1</sup>ðzk−HΦkx^k−1ðθÞÞ<sup>T</sup>Σ<sup>−</sup><sup>1</sup>

Mention that in practical implementation of the AF, the optimization algorithm is not constructed on the basis of Eq. (4.1), but on Eq. (3.6). That is, due to the fact that Eq. (4.1) is written in a batch form which requires to make optimization over the time interval ½1, N�

:¼<sup>&</sup>lt; <sup>ζ</sup>k, <sup>Σ</sup><sup>−</sup><sup>1</sup>

Kk ¼ Pr, kKe, <sup>k</sup> (3.7)

, He, <sup>k</sup> :¼ HkPr, <sup>k</sup>, (3.8)

(3.9)

<sup>k</sup> ζ<sup>k</sup> > : (3.6)

(D1) Dynamical system (DS): if in Eq. (2.1), the DS is the initial system Eq. (1.1); in Eq. (4.1), the DS is the filtering Eq. (3.3). This difference has an interesting consequence: if in practice, there is very little known about statistics of wk, the sequence ζ<sup>k</sup> is observed, and hence, it is possible to estimate the statistics of ζk.

(D2) The system noise wk in Eq. (1.1) is white, while in Eq. (3.3), ζ<sup>k</sup> is a white sequence only if the filter is optimal. That allows us to easily apply different statistical tests for verifying the optimality of the assimilation procedure.

(D3) Control variable x<sup>0</sup> in the VM is the initial state, whereas the control variable in Eq. (3.6) is the parameter vector θ.

This difference has an important consequence: as x<sup>0</sup> has to be of precise physical meaning (depending, for example, on the ocean domain of interest), the structure for the guess <sup>θ</sup><sup>0</sup> :<sup>¼</sup> <sup>x</sup>^<sup>0</sup> 0 (for the initial state) as well as correction δx^<sup>ν</sup> <sup>0</sup>, generated by iterative algorithm, must be chosen carefully so that at each ν iteration, the estimate x^<sup>ν</sup> <sup>0</sup>, <sup>x</sup>^<sup>ν</sup> <sup>0</sup> <sup>¼</sup> <sup>x</sup>^<sup>ν</sup>−<sup>1</sup> <sup>0</sup> <sup>þ</sup> <sup>δ</sup>x^<sup>ν</sup> <sup>0</sup>, must be of physically realistic structure. This is not an easy task. On the other hand, in the AF, the parameters usually are immaterial [see θ Eq. in (3.10)]; hence, the choice of structure for θ is of no importance.

(D4) Suppose the DS Eq. (1.1) is unstable. It implies that the error in estimating x<sup>0</sup> will grow during integration of the direct and AE. As for the AF, by its construction, the filtering system Eq. (3.3) remains stable. This can be seen by representing the filtering Eq. (3.3) through its fundamental matrix Lk,

$$\begin{aligned} \hat{\mathfrak{x}}\_{k} &= L\_{k}\hat{\mathfrak{x}}\_{k-1} + K\_{k}\upsilon\_{k} \\ L\_{k} &= (I - \mathcal{K}\_{k}H\_{k})\mathfrak{O}\_{k}. \end{aligned} \tag{4.2}$$

As shown in Ref. [12], the filter Eq. (4.2) is stable under the conditions Eqs. (3.9) and (3.10). It means that the filtering error is bounded during model integration since the parameters θ<sup>i</sup> are lying in the interval guaranteeing a stability of the filter Eq. (4.2).

(D5) Return to the objective function Eqs. (2.1)–(2.3). First taking the derivative of the objective function Eqs. (2.1)–(2.3) wrt (with respective to) x0, we have

$$\begin{split} \frac{1}{2} \nabla\_{\mathbf{x},0} \boldsymbol{I}[\hat{\mathbf{x}}\_{0}^{\boldsymbol{\nu}}] &= M\_{0}^{-1} \boldsymbol{e}\_{0}^{\boldsymbol{\nu}} - \sum\_{k=1}^{N} \boldsymbol{\Phi}^{T}(\boldsymbol{k}, \mathbf{0}) H\_{k}^{T} R\_{k}^{-1} (\mathbf{z}\_{k} - H\_{k} \boldsymbol{\Phi}(\boldsymbol{k}, \mathbf{0}) \mathbf{x}\_{0}^{\boldsymbol{\nu}}) \\ &= M\_{0}^{-1} \boldsymbol{e}\_{0}^{\boldsymbol{\nu}} - \sum\_{k=1}^{N} \boldsymbol{\Phi}^{T}(\boldsymbol{k}, \mathbf{0}) H\_{k}^{T} R\_{k}^{-1} (H\_{k} \boldsymbol{\Phi}(\boldsymbol{k}, \mathbf{0}) \mathbf{e}\_{0}^{\boldsymbol{\nu}} + \boldsymbol{\nu}\_{k}), \mathbf{e}\_{0}^{\boldsymbol{\nu}} := \mathbf{x}\_{0} - \hat{\mathbf{x}}\_{0}^{\boldsymbol{\nu}}. \end{split} \tag{4.3}$$

One sees that for a batch of N observations, Eq. (4.3) requires computation of N terms (without counting for the term M<sup>−</sup><sup>1</sup> <sup>0</sup> e<sup>ν</sup> <sup>0</sup>). The k th term is associated with the assimilation instant k, and one needs to compute first <sup>μ</sup><sup>k</sup> :<sup>¼</sup> <sup>Φ</sup>ðk, <sup>0</sup>Þe<sup>ν</sup> <sup>0</sup>, that is, to integrate k times the direct model Φ<sup>κ</sup> for <sup>κ</sup> <sup>¼</sup> <sup>1</sup>,…; <sup>k</sup> and next to integrate backward (<sup>k</sup> times also) the AE <sup>Φ</sup><sup>T</sup> <sup>κ</sup> from κ ¼ k to κ ¼ 1, that is, to compute <sup>Φ</sup><sup>T</sup>ðk, <sup>0</sup>ÞH<sup>T</sup> κR<sup>−</sup><sup>1</sup> <sup>κ</sup> ðHkμ<sup>k</sup> þ vkÞ. The larger the k, the bigger the amplification of the initial error e<sup>ν</sup> <sup>0</sup> and the observation error vk. The error e<sup>ν</sup> <sup>0</sup> is amplified doubly since it is integrated by the direct and adjoint models. But the amplification of vk (and wk when wk=¼ 0) is most worrying since it is integrated in the gradient estimate, making the gradient direction to be, possibly, completely erroneous.

#### 4.2. Implementation of AF

#### 4.2.1. The choice of criteria Eq. (3.6)

The choice of Eq. (3.6) is important in many aspects in order to obtain a simple and efficient data assimilation algorithm. The idea lying in the criteria Eq. (3.6) is to select some pertinent parameters as a control variables for minimizing the mean of the cost function ΨðζkÞ. For example, for the class of filters Eqs. (3.3), (3.7)–(3.10) the vector θ ¼ ðθ1,…; θne Þ <sup>T</sup> can be chosen as control vector for the problem Eq. (3.6).

The solution to the problem Eq. (3.6) can be found iteratively using a stochastic optimization (SA) algorithm

$$
\Theta\_{k+1} = \Theta\_k \mathbf{-} \mathbf{a}\_k \nabla\_{\theta} \Psi(\zeta\_{k+1}) \tag{4.4}
$$

where {ak} is a sequence of positive scalars satisfying some conditions to guarantee a convergence of the estimation procedure. The standard conditions are

$$a\_k \to 0, \sum\_{k=1}^{\infty} a\_k = \Leftrightarrow, \sum\_{k=1}^{\infty} a\_k^2 < \Leftrightarrow \tag{4.5}$$

The algorithm Eq. (4.4) is much more simple [compared to the computation of Eq. (4.3)] since it requires, at the kth assimilation instant, to compute only the gradient of the sample cost function ΨðζkÞ. The gradient ∇θΨðθkÞ of the sample objective function ΨðθkÞ can be computed using the AE approach (in what follows, for simplicity, the subscript k will be omitted to shorten the notations).

$$\frac{1}{2}[\delta\Psi(\zeta\_{k+1})]\_{\theta\_k} = -(H\Phi P\_r \delta\Lambda K\_t \zeta\_k, \zeta\_{k+1}) = -(\delta\Lambda K\_t \zeta\_k, \zeta\_{k+1}'), \zeta\_{k+1}' := P\_r^T \Phi^T H^T \zeta\_{k+1} \tag{4.6}$$

Thus, minimization of Eq. (3.6) by gradient-based SA algorithm requires only one integration of the direct model and one backward integration of the AE code: direct integration of x^<sup>k</sup> for producing the forecast <sup>x</sup>^<sup>k</sup>þ1=<sup>k</sup> <sup>¼</sup> <sup>Φ</sup>x^<sup>k</sup> and backward integration <sup>Φ</sup>THkζk−<sup>1</sup> in computation of ζ′ <sup>k</sup>þ1. For the structure of the gain Eq. (3.9), the objective function Ψ is quadratic wrt θ; hence, one can find easily the optimal parameters.

A less computational burden can be achieved by measuring the sample objective function (but not based on a gradient formula): instead of computing the gradient by Eq. (4.6) based on AE, one can approximate the gradient using the values of the cost function [on the basis of finite difference scheme (FDSA)]. Traditionally, the ith component of the gradient can be approximated by

A Comparison Study on Performance of an Adaptive Filter with Other Estimation Methods for State... http://dx.doi.org/10.5772/67005 37

$$\nabla\_{\theta\_i} \Psi(\theta\_k) = \mathbf{g}\_i = \left[ \Psi(\theta\_k + \mathfrak{c}\_k \mathbf{e}\_i) \neg \Psi(\theta\_k - \mathfrak{c}\_k \mathbf{e}\_i) \right] / (2\mathfrak{c}\_k) \tag{4.7}$$

where ei is the unit vector with 1 in the ith component, 0 otherwise.

It is seen that FDSA algorithms do not require the formula for the gradient. However, for the high-dimensional systems (n≈Oð106 <sup>Þ</sup>−Oð107 Þ), this algorithm is inapplicable due to component-wise derivative approximation: for approximation of each partial derivative of the cost function, we need to make two integrations of the direct model.

In order to overcome the difficulties with very high dimension of θ, recently the class of algorithms known as simultaneous perturbation SA (SPSA) receives a great interest [14, 15]. The algorithm SPSA is of the same structure as that of FDSA Eq. (4.7), with the difference residing in the way to perturb stochastically and simultaneously all the components of θ. Concretely, let Δ<sup>k</sup> ¼ ðΔk, <sup>1</sup>,…; Δk,nÞ <sup>T</sup> be a random vector, <sup>Δ</sup>k,i, <sup>i</sup> <sup>¼</sup> <sup>1</sup>,…; <sup>n</sup> are Bernoulli independent identically distributed (iid). The gradient of the objective function is estimated as

$$\begin{aligned} \nabla\_{\theta} \Psi(\theta\_k) &= \mathbf{g} = (\mathbf{g}\_1, \dots, \mathbf{g}\_n)^T, \\ \mathbf{g} &= [\Psi(\theta\_k + \mathbf{c}\_k \Delta\_k) \mathbf{-} \Psi(\theta\_k \mathbf{-c}\_k \Delta\_k)] \Delta\_k^{-1} / (2\mathbf{c}\_k), \\ \Delta\_k &= (\Delta\_{k,1}, \dots, \Delta\_{k,n})^T, \Delta\_k^{-1} := (1/\Delta\_{k,1}, \dots, 1/\Delta\_{k,n})^T. \end{aligned} \tag{4.8}$$

It is seen that in the SPSA, all the directions are perturbed at the same time (the numerator is identical in all n components). Thus, SPSA uses only two (or three) times integrations of the model, independently on the dimension of θ which makes it possible to apply to high-dimensional optimization problems. Generally, SPSA converges in the same number of iterations as FDSA, and it follows approximately the steepest descent direction, behaving like the gradient method [14]. On the other hand, SPSA, with the random search direction, does not follow exactly the gradient path. On average, though, it tracks the gradient nearly because the gradient approximation is an almost unbiased estimator of the gradient, as shown in Ref. [15].

For the SPSA algorithm, the conditions for {ak} and {ck} are

$$\begin{aligned} \boldsymbol{a}\_{k} > 0, \boldsymbol{c}\_{k} > 0, \quad \boldsymbol{a}\_{k} \to 0, \quad \boldsymbol{c}\_{k} \to 0, \\ \sum\_{k=1}^{\infty} \boldsymbol{a}\_{k} = \rightsquigarrow, \sum\_{k=1}^{\infty} \left( \boldsymbol{a}\_{k} / \mathbf{c}\_{k} \right)^{2} < \rightsquigarrow \end{aligned} \tag{4.9}$$

#### 4.2.2. On the operator Pr

is, to compute <sup>Φ</sup><sup>T</sup>ðk, <sup>0</sup>ÞH<sup>T</sup>

4.2. Implementation of AF

be, possibly, completely erroneous.

4.2.1. The choice of criteria Eq. (3.6)

as control vector for the problem Eq. (3.6).

initial error e<sup>ν</sup>

(SA) algorithm

shorten the notations).

1 2

ζ′

mated by

κR<sup>−</sup><sup>1</sup>

36 Advances in Statistical Methodologies and Their Application to Real Problems

<sup>0</sup> and the observation error vk. The error e<sup>ν</sup>

grated by the direct and adjoint models. But the amplification of vk (and wk when wk=¼ 0) is most worrying since it is integrated in the gradient estimate, making the gradient direction to

The choice of Eq. (3.6) is important in many aspects in order to obtain a simple and efficient data assimilation algorithm. The idea lying in the criteria Eq. (3.6) is to select some pertinent parameters as a control variables for minimizing the mean of the cost function ΨðζkÞ. For

The solution to the problem Eq. (3.6) can be found iteratively using a stochastic optimization

where {ak} is a sequence of positive scalars satisfying some conditions to guarantee a conver-

ak ¼ ∞, ∑ ∞ k¼1 a2

The algorithm Eq. (4.4) is much more simple [compared to the computation of Eq. (4.3)] since it requires, at the kth assimilation instant, to compute only the gradient of the sample cost function ΨðζkÞ. The gradient ∇θΨðθkÞ of the sample objective function ΨðθkÞ can be computed using the AE approach (in what follows, for simplicity, the subscript k will be omitted to

Thus, minimization of Eq. (3.6) by gradient-based SA algorithm requires only one integration of the direct model and one backward integration of the AE code: direct integration of x^<sup>k</sup> for producing the forecast <sup>x</sup>^<sup>k</sup>þ1=<sup>k</sup> <sup>¼</sup> <sup>Φ</sup>x^<sup>k</sup> and backward integration <sup>Φ</sup>THkζk−<sup>1</sup> in computation of

<sup>k</sup>þ1. For the structure of the gain Eq. (3.9), the objective function Ψ is quadratic wrt θ; hence,

A less computational burden can be achieved by measuring the sample objective function (but not based on a gradient formula): instead of computing the gradient by Eq. (4.6) based on AE, one can approximate the gradient using the values of the cost function [on the basis of finite difference scheme (FDSA)]. Traditionally, the ith component of the gradient can be approxi-

example, for the class of filters Eqs. (3.3), (3.7)–(3.10) the vector θ ¼ ðθ1,…; θne Þ

ak ! 0, ∑ ∞ k¼1

gence of the estimation procedure. The standard conditions are

<sup>½</sup>δΨðζ<sup>k</sup>þ<sup>1</sup>Þ�<sup>θ</sup><sup>k</sup> <sup>¼</sup> <sup>−</sup>ðHΦPrδΛKeζk, <sup>ζ</sup><sup>k</sup>þ<sup>1</sup>Þ ¼ <sup>−</sup>ðδΛKeζk, <sup>ζ</sup>′

one can find easily the optimal parameters.

<sup>κ</sup> ðHkμ<sup>k</sup> þ vkÞ. The larger the k, the bigger the amplification of the

θ<sup>k</sup>þ<sup>1</sup> ¼ θk−ak∇θΨðζ<sup>k</sup>þ<sup>1</sup>Þ (4.4)

<sup>k</sup>þ<sup>1</sup>Þ, <sup>ζ</sup>′

<sup>k</sup>þ<sup>1</sup> :<sup>¼</sup> PT

<sup>k</sup> < ∞ (4.5)

<sup>r</sup> <sup>Φ</sup>THTζ<sup>k</sup>þ<sup>1</sup> (4.6)

<sup>0</sup> is amplified doubly since it is inte-

<sup>T</sup> can be chosen

As shown in Ref. [12], span½Pr�–the subspace, spanned by the columns of Pr, must be chosen so that the filter gain K ensures a stability of the filter. Mention that even the KF may suffer from instability. To ensure filter stability, Pr is constructed from all unstable and neutral eigenvectors of the fundamental matrix Φ (or real Schur vectors (ScVs), singular vectors). In practice, we choose Pr to be consisting of the column vectors of S (called S-PE samples)

$$S = \Phi X \tag{4.10}$$

which are results of integration of leading ScVs (columns of X). The columns in S have the meaning of the PE for the system state and are used to approximate the ECM M. As to the ScVs, they are preferred to eigenvectors (or singular vectors) because the ScVs are real, and their computation is numerically stable. Mention that computation of singular vector requires also adjoint code. The ensemble of columns of S plays the same role as an ensemble PE samples in the ensemble-based filtering technique for approximating the background ECM [16, 17].

#### 4.2.3. On separation of vertical and horizontal variables structure in ECM [18]

Let us consider the situation when the DS is described by PDEs. The state vector at the time instant k is xk ¼ xkði, j, lÞ where ði, j, lÞ represents a grid point in three dimensional space. Introduce the stabilizing structure for the filter gain [12]

$$\begin{aligned} \mathbf{K}\_k &= \mathbf{M}\_k \mathbf{H}\_k^T [\mathbf{H}\_k \mathbf{M}\_k \mathbf{H}\_k^T + \mathbf{R}\_k]^{-1}, \\ \mathbf{M} &= \mathbf{M}\_d, \mathbf{M}\_d := \mathbf{P}\_r \boldsymbol{\Lambda} \mathbf{P}\_r^T, \end{aligned} \tag{4.11}$$

where Λ is symmetric positive definitive. As shown in Ref. [12], one can choose Λ to be diagonal with diagonal elements serving to regularize the amplitude of ECM. The matrix M in Eq. (4.11) is ECM, and if it is computed on the basis of the Riccati equation (3.6), Kk is known as the KF gain. For M ¼ Md, computation of M is realizable if the reduced dimension ne is not too large. Actually the number ne of the ensemble size is of order Oð100Þ which is too small for M to be a good approximation for the true ECM. In Ref. [18], it is assumed that the estimated ECM is a member of the class of ECMs with separation of vertical and horizontal variables structure (SeVHS). Mention that this hypothesis is not new and used in modeling the ECM in meteorological data assimilation [19]. The optimal ECM is found as a solution of the minimization problem

$$J(\boldsymbol{\theta}) = E \| |M\_d \mkern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1mu \skern-1.1$$

As the number of vertical layers in the today's numerical models is of order Oð10Þ, all elements of the vertical ECM Mv (included in θ1) can be considered as tuning to be estimated. As to Mh, it is often chosen in analytical form (e.g., the 1st or 2nd order autoregressive models). The parameters like correlation length can be selected as components of the control vector θ<sup>2</sup> in Mh.

Using dominant real Schur vectors has advantages that they are real and their computation is stable [12] while the computation of eigenvectors is unstable, and they may be complex.

#### 4.3. Computational comparison between VM and AF

We give a brief comparison (computational burden) between the VM and AF algorithms. Table 1 shows the number of elementary arithmetic operations required for implementation of VM and AF filtering algorithms based on AE tool. A smaller number of operations are required for the AF if the SPSA method is used (no need of Φ<sup>T</sup> ). Here, for simplicity, n<sup>2</sup>

operations are accounted for the product ΦTy (the same number as that required for Φy). In the AF, we assume that the full ECM <sup>M</sup> is used, whereas in the AROF (adaptive ROF), <sup>M</sup> :<sup>¼</sup> PrP<sup>T</sup> r as shown in Eq. (4.6). These numbers are calculated on the basis of Eqs. (2.1) and (2.3) since they represent the most computational burdens for these two algorithms. These numbers are rounded up to the dimensions of the entry matrices. Here, Nito is a number of iterations required to solve the minimization problem (2.1)–(2.3); Nit is a number of iterations required to solve the equation

$$
\Xi y = \zeta\_k,\\
\Xi := [HMH + R]. \tag{4.13}
$$

In the VM algorithm, the computation of M<sup>−</sup><sup>1</sup> <sup>0</sup> , <sup>R</sup><sup>−</sup><sup>1</sup> is not taken into account. In Table 1, there is also the number of operations required for the AROF, when the ECM M is given in the product decomposition form <sup>M</sup> <sup>¼</sup> PrP<sup>T</sup> <sup>r</sup> , Pr∈Rn · ne . In this situation, instead of n<sup>2</sup> operations in the AF, we need to perform 2nne operations. For ne << n, much less computational and memory requirements are needed to perform the AROF.


Table 1. Number of elementary arithmetic operations.

ScVs, they are preferred to eigenvectors (or singular vectors) because the ScVs are real, and their computation is numerically stable. Mention that computation of singular vector requires also adjoint code. The ensemble of columns of S plays the same role as an ensemble PE samples in the ensemble-based filtering technique for approximating the background

Let us consider the situation when the DS is described by PDEs. The state vector at the time instant k is xk ¼ xkði, j, lÞ where ði, j, lÞ represents a grid point in three dimensional space.

<sup>k</sup> <sup>½</sup>HkMkH<sup>T</sup>

where Λ is symmetric positive definitive. As shown in Ref. [12], one can choose Λ to be diagonal with diagonal elements serving to regularize the amplitude of ECM. The matrix M in Eq. (4.11) is ECM, and if it is computed on the basis of the Riccati equation (3.6), Kk is known as the KF gain. For M ¼ Md, computation of M is realizable if the reduced dimension ne is not too large. Actually the number ne of the ensemble size is of order Oð100Þ which is too small for M to be a good approximation for the true ECM. In Ref. [18], it is assumed that the estimated ECM is a member of the class of ECMs with separation of vertical and horizontal variables structure (SeVHS). Mention that this hypothesis is not new and used in modeling the ECM in meteorological data assimilation [19]. The optimal ECM is found as a solution of the minimi-

<sup>M</sup> <sup>¼</sup> Md, Md :<sup>¼</sup> PrΛP<sup>T</sup>

<sup>J</sup>ðθÞ ¼ <sup>E</sup>jjMd−Mvðθ1Þ⊗Mhðθ2Þjj<sup>2</sup>

jj:jj<sup>F</sup> denotes matrix Frobenious norm , <sup>J</sup>ðθÞ ! min <sup>θ</sup>, <sup>θ</sup> ¼ ðθ<sup>T</sup>

As the number of vertical layers in the today's numerical models is of order Oð10Þ, all elements of the vertical ECM Mv (included in θ1) can be considered as tuning to be estimated. As to Mh, it is often chosen in analytical form (e.g., the 1st or 2nd order autoregressive models). The parameters like correlation length can be selected as components of the control vector θ<sup>2</sup> in Mh. Using dominant real Schur vectors has advantages that they are real and their computation is stable [12] while the computation of eigenvectors is unstable, and they may be complex.

We give a brief comparison (computational burden) between the VM and AF algorithms. Table 1 shows the number of elementary arithmetic operations required for implementation of VM and AF filtering algorithms based on AE tool. A smaller number of operations are required for the AF if the SPSA method is used (no need of Φ<sup>T</sup> ). Here, for simplicity, n<sup>2</sup>

F,

<sup>1</sup> , θ<sup>T</sup> 2 Þ T:

<sup>k</sup> þ Rk� −1 ,

<sup>r</sup> , (4.11)

(4.12)

4.2.3. On separation of vertical and horizontal variables structure in ECM [18]

Kk <sup>¼</sup> MkH<sup>T</sup>

Introduce the stabilizing structure for the filter gain [12]

38 Advances in Statistical Methodologies and Their Application to Real Problems

4.3. Computational comparison between VM and AF

ECM [16, 17].

zation problem

To have the idea on how work in practice the VM and AF, here the examples of experiments with two numerical models MICOM (see Section 7) and HYCOM (Hybrid Coordinate Ocean Model) [20] developed at SHOM, Toulouse, France. The first experiment is performed with the MICOM model. The observations are available each 10 days (ds) during 2 years. For the MICOM (state dimension <sup>n</sup> <sup>¼</sup> <sup>3</sup> · 105 Þ, the 10 ds forecast requires 45 s (supercomputer Caparmor, IFREMER, France, sequential run). The 2-year integration takes 54.45 min (73 · 45 s). The AF needs 54:45 min · 3 ¼ 164 min (or 2 h 45 min) to perform the assimilation experiment for the 2-year period. In this context, the VM requires between 5.7 ds and 11.4 ds to perform the experiment (hypothesis 50–100 times integration of the MICOM over the 2-year window, Section 2.2). As to the HYCOM (state dimension <sup>n</sup> <sup>¼</sup> <sup>7</sup> · 107 Þ, the observations are available each 5 ds. The 5 ds forecast requires 1 h (supercomputer Beaufix, Météo, France, parallel run, 62 processors). Two year integration requires 146 · 1 h = 146 h (or 6 ds). The AF needs, hence 18 ds to make the 2-year experiment. As to the VM, the experiment requires between 304 ds and 608 ds. That is one of the reasons why in operational setting one has to choose a short window for assimilating the observations by the VM.

Comment 4.1. Looking at Table 1, one sees that the dominant numbers nd of operations in the VM and AF are ndðVMÞ ¼ <sup>n</sup><sup>2</sup>N<sup>2</sup> Nito and ndðNAFÞ ¼ <sup>n</sup><sup>2</sup>NNit. If we assume that Nito≈Nit (in fact, the number of iterations for solving the optimization problem Eqs. (2.7) and (2.8) is often larger than the number of iterations for solving the system of equations (4.13); it is more critical for the VM when the DS is nonlinear); the number ndðVMÞ is N times larger than ndðNAFÞ.

## 5. Simple numerical examples: scalar case

#### 5.1. Estimation problem

Consider the simple scalar dynamical process xðtÞ ¼ sinðtÞ. As x\_ðtÞ ¼ cosðtÞ, for tk ¼ kδt, using the approximation xðtk þ δtÞ−xðtkÞ ¼ cosðtkÞδt, one has the following discrete dynamical system

$$\begin{aligned} \mathbf{x}\_{k+1} &= \phi\_k \mathbf{x}\_k + \boldsymbol{\mu}\_k + \boldsymbol{\sigma} \mathbf{w}\_k, \phi\_k = \phi = 1, \boldsymbol{\mu}\_k = \cos(k \boldsymbol{\delta} t), \\ \boldsymbol{E}(\boldsymbol{w}\_k) &= \mathbf{0}, \boldsymbol{E}(\boldsymbol{w}\_k \boldsymbol{\sigma}\_l) = \sigma\_w^2 \delta\_{kl}, k = \mathbf{0}, 1, \dots, \text{N-1}. \end{aligned} \tag{5.1}$$

In Eq. (5.1), wk represents the Gaussian model error. Suppose at each tk moment, we observe the state xk corrupted by the Gaussian noise vk, hence

$$\begin{aligned} \mathbf{z}\_k &= h\mathbf{x}\_k + \mathbf{v}\_k, h = \mathbf{1}, k = 1, \dots, N, \\ E(\mathbf{v}\_k) &= \mathbf{0}, E(\mathbf{v}\_k \mathbf{v}\_l) = \sigma\_v^2 \delta\_{kl}, k \neq l \end{aligned} \tag{5.2}$$

where δkl is the Kronecker symbol. Suppose that the true initial state x� <sup>0</sup> ¼ 0:5. The problem we study here is to estimate the system state xk based on the set of observations zk, k ¼ 1,…; N.

#### 5.1.1. Experiment: cost functions

In the experiment, δt ¼ 0:01, N ¼ 1000. The two methods, VM and AF, will be implemented to produce the system estimates. We study two situations: (S1) the model is considered as perfect, that is, wk <sup>¼</sup> 0; and (S2) there exists the model error wk with the variance <sup>σ</sup><sup>2</sup> <sup>w</sup> ¼ 0:001. As to vk, σ2 <sup>v</sup> ¼ 0:1 in both cases.

Let wk ¼ 0. To see the advantages of the AF over VM in finding optimal solutions, Figures 1 and 2 display the curves (time averaged variance of distances between the true trajectory and those resulting from varying the control variables) as functions of tuning parameters in these two methods: the initial state θ :¼ x<sup>0</sup> and the parameter θ :¼ λ (see Eq. (3.9)). We remark that for the filtering system (5.1), (5.2), as φ ¼ 1, the system has one stable eigenvalue and one stable eigenvector (singular value and Schur decompositions are of the same structure). The filter fundamental matrix L ¼ ð1−KhÞφ ¼ ð1−KÞ is stable if K∈ð0; 2Þ. For the gain structure (4.11), L is stable for any <sup>M</sup> <sup>&</sup>gt; 0 since <sup>K</sup> <sup>¼</sup> <sup>M</sup> Mþσ<sup>2</sup> r for <sup>h</sup> <sup>¼</sup> 1 and <sup>σ</sup><sup>2</sup> <sup>v</sup> ¼ 0:1 > 0. We have then K∈ð0; 1Þ⊂ð0; 2Þ. Thus, one can choose θ :¼ M > 0 as tuning parameter. This structure is of less interest compared to Eq. (3.9) since θ enters in K in a nonlinear way, and in fact, K is allowed to vary only in the interval <sup>ð</sup>0; <sup>1</sup>Þ. For Eq. (3.9), Pr <sup>¼</sup> <sup>1</sup>, He <sup>¼</sup> 1 hence Ke <sup>¼</sup> <sup>1</sup> 1þσ<sup>2</sup> r < 1 and the filter is stable if θ satisfies Eq. (3.10). For this structure, the filter is stable for K∈ð0; 2Þ. We will select the last structure as a departure point to optimize the AF performance.

From Figure 1, it is seen that the curve "noise-free" is equal to 0 when x^<sup>0</sup> ¼ 0:5 for S1, but the "noisy" attains the minimal value 0.121 at x^<sup>0</sup> ¼ 0:6. We note that almost the same picture is obtained for the cost function (2.1) (time averaged variance of the distance between x^k, k ¼ 1;…; 1000 and observations zk, k ¼ 1;…; 1000) subject to x^0∈½−1 : 1�. Despite the fact that the curves in Figure 1 are quadratic, it is impossible to find the true initial state in the noisy situation since almost the same curves are obtained for the cost function Eq. (2.1).

Figure 1. VM: cost functions resulting from perfect model and that with a model error.

5. Simple numerical examples: scalar case

40 Advances in Statistical Methodologies and Their Application to Real Problems

the state xk corrupted by the Gaussian noise vk, hence

Consider the simple scalar dynamical process xðtÞ ¼ sinðtÞ. As x\_ðtÞ ¼ cosðtÞ, for tk ¼ kδt, using the approximation xðtk þ δtÞ−xðtkÞ ¼ cosðtkÞδt, one has the following discrete dynamical system

xkþ<sup>1</sup> ¼ φkxk þ uk þ wk, φ<sup>k</sup> ¼ φ ¼ 1, uk ¼ cosðkδtÞ,

In Eq. (5.1), wk represents the Gaussian model error. Suppose at each tk moment, we observe

zk ¼ hxk þ vk, h ¼ 1, k ¼ 1,…; N,

study here is to estimate the system state xk based on the set of observations zk, k ¼ 1,…; N.

In the experiment, δt ¼ 0:01, N ¼ 1000. The two methods, VM and AF, will be implemented to produce the system estimates. We study two situations: (S1) the model is considered as perfect,

Let wk ¼ 0. To see the advantages of the AF over VM in finding optimal solutions, Figures 1 and 2 display the curves (time averaged variance of distances between the true trajectory and those resulting from varying the control variables) as functions of tuning parameters in these two methods: the initial state θ :¼ x<sup>0</sup> and the parameter θ :¼ λ (see Eq. (3.9)). We remark that for the filtering system (5.1), (5.2), as φ ¼ 1, the system has one stable eigenvalue and one stable eigenvector (singular value and Schur decompositions are of the same structure). The filter fundamental matrix L ¼ ð1−KhÞφ ¼ ð1−KÞ is stable if K∈ð0; 2Þ. For the gain structure (4.11), L is

for <sup>h</sup> <sup>¼</sup> 1 and <sup>σ</sup><sup>2</sup>

Thus, one can choose θ :¼ M > 0 as tuning parameter. This structure is of less interest compared to Eq. (3.9) since θ enters in K in a nonlinear way, and in fact, K is allowed to vary only in

satisfies Eq. (3.10). For this structure, the filter is stable for K∈ð0; 2Þ. We will select the last

From Figure 1, it is seen that the curve "noise-free" is equal to 0 when x^<sup>0</sup> ¼ 0:5 for S1, but the "noisy" attains the minimal value 0.121 at x^<sup>0</sup> ¼ 0:6. We note that almost the same picture is obtained for the cost function (2.1) (time averaged variance of the distance between x^k, k ¼ 1;…; 1000 and observations zk, k ¼ 1;…; 1000) subject to x^0∈½−1 : 1�. Despite the fact

<sup>E</sup>ðvkÞ ¼ <sup>0</sup>, <sup>E</sup>ðvkvlÞ ¼ <sup>σ</sup><sup>2</sup>

where δkl is the Kronecker symbol. Suppose that the true initial state x�

that is, wk <sup>¼</sup> 0; and (S2) there exists the model error wk with the variance <sup>σ</sup><sup>2</sup>

Mþσ<sup>2</sup> r

the interval <sup>ð</sup>0; <sup>1</sup>Þ. For Eq. (3.9), Pr <sup>¼</sup> <sup>1</sup>, He <sup>¼</sup> 1 hence Ke <sup>¼</sup> <sup>1</sup>

structure as a departure point to optimize the AF performance.

<sup>w</sup>δkl, <sup>k</sup> <sup>¼</sup> <sup>0</sup>, <sup>1</sup>,…; <sup>N</sup>−1: (5.1)

<sup>v</sup>δkl, <sup>k</sup> <sup>≠</sup> <sup>l</sup> (5.2)

<sup>v</sup> ¼ 0:1 > 0. We have then K∈ð0; 1Þ⊂ð0; 2Þ.

< 1 and the filter is stable if θ

1þσ<sup>2</sup> r

<sup>0</sup> ¼ 0:5. The problem we

<sup>w</sup> ¼ 0:001. As to vk,

<sup>E</sup>ðwkÞ ¼ <sup>0</sup>, <sup>E</sup>ðwkvlÞ ¼ <sup>σ</sup><sup>2</sup>

5.1. Estimation problem

5.1.1. Experiment: cost functions

stable for any <sup>M</sup> <sup>&</sup>gt; 0 since <sup>K</sup> <sup>¼</sup> <sup>M</sup>

<sup>v</sup> ¼ 0:1 in both cases.

σ2

Figure 2 presents the same curves as those in Figure 1 resulting from application of the filter by letting the parameter θ in the gain vary in θ∈ð0 : 2Þ. Figure 2 shows that for the both curves "noise-free" and "noisy", the minimal values are attained at θ ¼ 1:1 for both situations S1, S2. Moreover, the two minimal values are identical. The same picture is observed for the cost function Eq. (4.1) as function of θ∈ð0 : 2Þ. It means that independently on whether the model is perfect or not, AF formulation allows optimization algorithms to find the optimal value for θ, hence, to ensure optimality of the filter.

Two curves "noisy" in Figures 1 and 2 show that when the model is noisy, the minimal value of the curve "noisy" (VM) in Figure 1 is much higher (it is equal to 0.121) than that in Figure 2 (0.009, for filtering). This fact is in favour of the choice of a short window for assimilating observations by the VM.

Figure 2. Filtering: cost functions resulting from perfect model and that with model error.

## 6. Numerical experiment: Lorenz system

#### 6.1. Lorenz equations

The Lorenz attractor is a chaotic map, noted for its butterfly shape. The map shows how the state of a dynamical system evolves over time in a complex, non-repeating pattern [21].

The attractor itself and the equations from which it is derived were introduced by Edward Lorenz [21], who derived it from the simplified equations of convection rolls arising in the equations of the atmosphere.

The equations that govern the Lorenz attractor are:

$$
\frac{dy\_1}{dt} = -\sigma(y\_1 - y\_2), \\
\frac{dy\_2}{dt} = \rho y\_1 - y\_2 - y\_1 y\_3, \\
\frac{dy\_3}{dt} = y\_1 y\_2 - \beta y\_3,\tag{6.1}
$$

where σ is called the Prandtl number, and ρ is called the Rayleigh number. All σ, β, ρ > 0, but usually σ = 10, β = 8/3 and ρ is varied. The system exhibits chaotic behavior for ρ = 28 but displays knotted periodic orbits for other values of ρ.

#### 6.2. Numerical model

In the experiments, the parameters σ, ρ, β are chosen to have the values 10, 28, and 8/3 for which the "butterfly" attractor exists.

The numerical model is obtained by applying the Euler method (first-order accurate method) to approximate Eq. (6.1). Symbolically, we have

$$y(t\_{k+1}) = F\left(y(t\_k)\right), \newline y(t\_k) := \left(y\_1(t\_k), y\_2(t\_k), y\_3(t\_k)\right)^T,\tag{6.2}$$

where δt :¼ tkþ<sup>1</sup>−tk is the model time step. The observations arrive at the moments Tk and ΔTk :¼ Tkþ<sup>1</sup>−Tk. The experiment setup is similar to that described in Ref. [22].

#### 6.3. Observations: assimilation

The corresponding δt ¼ 0:005, ΔTk ¼ 1, hence the sequence of observations is given by zðkÞ :¼ zðTkÞ, k ¼ 1,…; No. The dynamical system corresponding to the transition of the states between two time instants Tk and Tkþ<sup>1</sup> is denoted as

$$\mathbf{x}\_{k+1} = F(\mathbf{x}\_k) + \mathbf{w}\_k \tag{6.3}$$

In Eq. (6.3), wk simulates the model error. The sequence wk is assumed to be a white noise having variance 2, 12.13 and 12.13 respectively. The observation system is then given by

$$\mathbf{z}\_{k} = \mathbf{H}\mathbf{x}\_{k} + \mathbf{v}\_{k} \tag{6.4}$$

where the operator <sup>H</sup> ¼ ½h<sup>T</sup> <sup>1</sup> , hT 2 � <sup>T</sup>, <sup>h</sup><sup>1</sup> ¼ ð1; <sup>0</sup>; <sup>0</sup>Þ, <sup>h</sup><sup>2</sup> ¼ ð0; <sup>0</sup>; <sup>1</sup>Þ, that is, the first and third components x1, x<sup>3</sup> are observed at each time instant k ¼ 1;…; 100. The noise sequence vk is white with zero mean and variance R ¼ 2I<sup>2</sup> where In is the unit matrix of dimension n. The initial estimate in all filters is given by the initial condition x^ð0Þ¼ð1, −1,24Þ T.

The true system state x� is modeled as the solution of Eq. (6.3) subject to x� <sup>0</sup> ¼ ð1:508870, −1:531271, 25:46091Þ T.

The problem considered in this experiment is to apply the extended KF (EKF), nonadaptive filter (NAF), and adaptive filter (AF) to estimate the true system state using the observations zk, k ¼ 1, 2; No and to compare their performances.

Here, the NAF is in fact the prediction error filter (PEF). Mention that the PEF is developed in Ref. [23] in which the prediction error ECM is estimated on the basis of an ensemble of PE samples, that is,

$$M = \frac{1}{T - 1} \sum\_{k=1}^{T} B\_s(k), B\_s(k) = \sum\_{k=1}^{L} \delta \mathbf{x}\_k^{(l)} \delta \mathbf{x}\_k^{(l), T},\tag{6.5}$$

where δx ðlÞ <sup>k</sup> , l ¼ 1,…; L are members of the set of L S-PE samples obtained by L þ 1 integrations of the model from the reference state and L perturbed states which grow in the directions of the L dominant Schur vectors.

The filter gain is taken in the following form

6. Numerical experiment: Lorenz system

42 Advances in Statistical Methodologies and Their Application to Real Problems

The equations that govern the Lorenz attractor are:

displays knotted periodic orbits for other values of ρ.

dt <sup>¼</sup> <sup>−</sup>σðy1−y2Þ,

dy<sup>2</sup>

dy<sup>1</sup>

The Lorenz attractor is a chaotic map, noted for its butterfly shape. The map shows how the state of a dynamical system evolves over time in a complex, non-repeating pattern [21].

The attractor itself and the equations from which it is derived were introduced by Edward Lorenz [21], who derived it from the simplified equations of convection rolls arising in the

dt <sup>¼</sup> <sup>ρ</sup>y1−y2−y1y3,

where σ is called the Prandtl number, and ρ is called the Rayleigh number. All σ, β, ρ > 0, but usually σ = 10, β = 8/3 and ρ is varied. The system exhibits chaotic behavior for ρ = 28 but

In the experiments, the parameters σ, ρ, β are chosen to have the values 10, 28, and 8/3 for

The numerical model is obtained by applying the Euler method (first-order accurate method)

where δt :¼ tkþ<sup>1</sup>−tk is the model time step. The observations arrive at the moments Tk and

The corresponding δt ¼ 0:005, ΔTk ¼ 1, hence the sequence of observations is given by zðkÞ :¼ zðTkÞ, k ¼ 1,…; No. The dynamical system corresponding to the transition of the states

In Eq. (6.3), wk simulates the model error. The sequence wk is assumed to be a white noise having variance 2, 12.13 and 12.13 respectively. The observation system is then given by

nents x1, x<sup>3</sup> are observed at each time instant k ¼ 1;…; 100. The noise sequence vk is white with

ΔTk :¼ Tkþ<sup>1</sup>−Tk. The experiment setup is similar to that described in Ref. [22].

dy<sup>3</sup>

<sup>ð</sup>yðtkÞÞ, <sup>y</sup>ðtk<sup>Þ</sup> :¼ ðy1ðtkÞ, <sup>y</sup>2ðtkÞ, <sup>y</sup>3ðtkÞÞ<sup>T</sup>, (6.2)

xkþ<sup>1</sup> ¼ FðxkÞ þ wk (6.3)

zk ¼ Hxk þ vk (6.4)

<sup>T</sup>, <sup>h</sup><sup>1</sup> ¼ ð1; <sup>0</sup>; <sup>0</sup>Þ, <sup>h</sup><sup>2</sup> ¼ ð0; <sup>0</sup>; <sup>1</sup>Þ, that is, the first and third compo-

dt <sup>¼</sup> <sup>y</sup>1y2−βy3, (6.1)

6.1. Lorenz equations

equations of the atmosphere.

6.2. Numerical model

which the "butterfly" attractor exists.

6.3. Observations: assimilation

where the operator <sup>H</sup> ¼ ½h<sup>T</sup>

to approximate Eq. (6.1). Symbolically, we have

<sup>y</sup>ðtkþ<sup>1</sup>Þ ¼ <sup>F</sup>′

between two time instants Tk and Tkþ<sup>1</sup> is denoted as

<sup>1</sup> , hT 2 �

$$K = MH^T \Sigma^{-1}, \Sigma = HMH^T + R \tag{6.6}$$

which is time invariant. At the same time, for the comparison purpose, the EKF is also used for assimilating the observations.

Figure 3 shows the evolution of the prediction errors resulting from three filters: NAF, AF, and EKF. It is of no surprise that the NAF has produced the estimates with larger estimation error. By adaptation, however, it is possible to obtain the AF, which improves significantly the PEF and even behaves better than the EKF.

Mention that the VM is much less appropriate for assimilating the observations in the Lorenz model due to the choice of the initial state as control vector. For simplicity, we simulate the situation when all three components of the system state are observed in additive noise, i.e. with H ¼ I3. Figure 4 displays time averaged variances of the difference between the true trajectory and model trajectory (denoted as AVðx�, x^ÞÞ, resulting from varying the third component of the initial state for two situations of noise-free and noisy models. Namely, we initialize the model by the initial state, which is the same as the true one x� <sup>0</sup> ¼ ð1:508870,−1:531271, 25:46091Þ <sup>T</sup>, with the difference, that the third component x^3ð0Þ is varying in the interval ½24:5 : 26:5�. The global minimum is attained at x� <sup>0</sup>ð3Þ ¼ 25:46091 as expected. However, if the system is initialized by the estimate in a vicinity, even not so far from x� <sup>3</sup>ð0Þ, there is no guarantee that the VM can approach the true initial condition. For the noisy model, the global minimum is not attained at x� <sup>0</sup>ð3Þ. As for the PEF, the function AVðx�, x^Þ is quadratic wrt to the gain parameter, for both situations of noise-free or noisy models, as seen in Figure 5: here, the sample cost function (4.1) is computed over all assimilation period, by varying the third parameter θ<sup>3</sup> in the gain (related to the third observed component of the system state).

Figure 3. Prediction errors resulting from three filters: nonadaptive filter (NAF), adaptive filter (AF), and EKF.

Figure 4. Time averaged variance between the true trajectory and model trajectory in the VM as a function of perturbed third component of the initial state. The global minimum is attained at the true initial condition, but there is no guarantee for the VM to approach the true initial state (no-noisy model). For noisy model, the global minimum is not attained at the true initial state. The curve "noisy model" is scaled by the factor C ¼ 1=15.

Figure 5. Cost function in the PEF as a function of perturbed third gain parameter θ3. It is seen that in the PEF, the cost function is quadratic wrt to the gain parameter in both situations of noise-free and noisy models. The curve "noisy model" is scaled by the factor C ¼ 1=50.

## 7. Assimilation in high-dimensional model

#### 7.1. MICOM model and assimilation problem

In this section, we show how the AF can be designed in a simple way to produce the high performance estimates for the ocean state in the high-dimensional ocean model MICOM. For details on the Miami Isopycnal Coordinate Ocean Model (MICOM) used here, see Ref. [24]. The model configuration is a domain situated in the North Atlantic from 30°N to 60°N and 80°W to 44°W; for the exact model domain and some main features of the ocean current (mean, variability of the SSH, velocity) produced by the model, see Ref. [24]. The grid spacing is about 0.2° in longitude and in latitude, requiring Nh = II · JJ = 25200 (II = 140, JJ = 180) horizontal grid points. The number of layers in the model is KK = 4. It is configured in a flat bottom rectangular basin (1860 km · 2380 km · 5 km) driven by a periodic wind forcing. The model relies on one prognostic equation for each component of the horizontal velocity field and one equation for mass conservation per layer. We note that the state of the model is x :¼ ðh, u, vÞ where h ¼ hði, j, lrÞ is the thickness of lrth layer; u ¼ uði, j, lrÞ, v ¼ vði, j, lrÞ are two velocity components. The layer stratification is made in the isopycnal coordinates, that is, the layer is characterized by a constant potential density of water. The model is integrated from the state of rest during 20 years. Averaging the sequence of states over 2 years 17 and 18 gives a so-called climatology. During the period of 2 years 19 and 20, every 10 days (10 ds), we calculate the sea surface height (SSH) from the layer thickness h which will serve as a source for generating observations to be used in the assimilation experiments (in total, there are 72 observations).

## 7.2. Different filters

Figure 4. Time averaged variance between the true trajectory and model trajectory in the VM as a function of perturbed third component of the initial state. The global minimum is attained at the true initial condition, but there is no guarantee for the VM to approach the true initial state (no-noisy model). For noisy model, the global minimum is not attained at the

Figure 3. Prediction errors resulting from three filters: nonadaptive filter (NAF), adaptive filter (AF), and EKF.

Figure 5. Cost function in the PEF as a function of perturbed third gain parameter θ3. It is seen that in the PEF, the cost function is quadratic wrt to the gain parameter in both situations of noise-free and noisy models. The curve "noisy model"

true initial state. The curve "noisy model" is scaled by the factor C ¼ 1=15.

44 Advances in Statistical Methodologies and Their Application to Real Problems

is scaled by the factor C ¼ 1=50.

The filter used for assimilating SSH observations is of the form

$$\hat{\mathbf{x}}\_{k+1} = F[\hat{\mathbf{x}}\_k] + \mathbf{K}\mathbb{zeta}\_{k+1}, \boldsymbol{k} = \mathbf{0}, \mathbf{1}, \dots \tag{7.1}$$

where x^<sup>k</sup>þ<sup>1</sup> is the filtered estimate for xkþ1, xkþ<sup>1</sup> ¼ ½hkþ<sup>1</sup>, ukþ<sup>1</sup>, vkþ<sup>1</sup>� is the system state at ðk þ 1Þ assimilation instant, Fð:Þ represents the integration of the nonlinear MICOM model over 10 days, K is the filter gain, ζ<sup>k</sup>þ<sup>1</sup> is the innovation vector. The gain K is of the form (4.11) where the ECM M will be estimated from the MICOM model. In the experiment, to be closed to realistic situations, only SSH at the grid points i ¼ 1;…; 140, j ¼ 1; ::; 180 are collected as observations. Thus, the observations are available not at all model grid points. The gain K is symbolically written as K ¼ ðKh, Ku, KvÞ <sup>T</sup> with Ku, Kv representing the operators which produce the correction for the velocity ðu, vÞ from the layer thickness correction Khζ using the geostrophy hypothesis. The filter thus is a reduced order which has the gain Kh to be estimated from S-PE samples.

#### 7.2.1. PEF: computation of ECM

In the experiment, two assimilation methods will be implemented. First the PEF is designed. To do that, the data ECM Md (see Eqs. (6.5) and (7.4), below) is performed by generating an ensemble of PE samples (as done in the experiment with Lorenz system, see the sampling procedure in Ref. [23] for more detail). As the number of elements of ECM is of order 1010 (for only the layer thickness component h), it is impossible to simulate a sufficient number of PE samples so that Md would be a good estimate for the ECM. The matrix Md will be used only as data to estimate the parameters in a parametrized ECM as follows (see Ref. [18]):

Let <sup>M</sup>∈Rnh · nh be the ECM for the layer thickness <sup>h</sup>, that is, <sup>M</sup> <sup>¼</sup> <sup>M</sup>ðs,s′ Þ. One useful and efficient way to simplify the filter structure is to assume that the ECM M has a SeVHS. Assuming there exist two covariance matrices, Mv and Mh such that

$$M(\mathbf{s}, \mathbf{s}') = M\_v(\mathbf{s}\_v, \mathbf{s}\_v) \otimes M\_h(\mathbf{s}\_h, \mathbf{s}\_h; \cdot), \mathbf{s}\_v := l, \mathbf{s}\_h := (i, j), \tag{7.2}$$

where ⊗ denotes the Kronecker product between two matrices [25],

$$M\_{\upsilon}(s\_{\upsilon}, s\_{\upsilon}) \otimes M\_{\hbar}(s\_{\imath}, s\_{\imath'}) = M(i, j, l; \mathbf{i}', \mathbf{j}', l') = $$

$$\begin{pmatrix} m\_{\upsilon}(1, 1)M\_{\hbar} & m\_{\upsilon}(1, 2)M\_{\hbar} & \dots & m\_{\upsilon}(1, n\_{\upsilon})M\_{\hbar} \\\\ m\_{\upsilon}(2, 1)M\_{\hbar} & m\_{\upsilon}(2, 2)M\_{\hbar} & \dots & m\_{\upsilon}(2, n\_{\upsilon})M\_{\hbar} \\\\ \dots & \dots & \dots & \dots \\\\ m\_{\upsilon}(n\_{\upsilon}, 1)M\_{\hbar} & m\_{\upsilon}(n\_{\upsilon}, 2)M\_{\hbar} & \dots & m\_{\upsilon}(n\_{\upsilon}, n\_{\upsilon})M\_{\hbar} \end{pmatrix} \tag{7.3}$$

The main advantage of the separability hypothesis is that the number of parameters to be estimated in the covariance matrix is reduced drastically. As a consequence, even an ensemble of PE samples with small size can serve as a large data set for estimating the unknown parameters. This results in a fast convergence of the estimation procedure. In addition, introducing the SeVHS hypothesis allows to avoid the rank deficiency problem in the estimation of the ECM. In fact, as only a few numbers of ScVs can be computed in very high-dimensional systems, approximation of the ECM M by Eq. (6.5) results in rank deficiency for M. With such an ECM, the resulting filter will probably produce worse results, not to say on instability which may occur during the filtering process.

Suppose we are given the ensemble of S-PE samples <sup>S</sup>τ½L�¼½δhð1<sup>Þ</sup> <sup>τ</sup> ,…; <sup>δ</sup>hðL<sup>Þ</sup> <sup>τ</sup> � which are obtained at the τ time instant by applying the sampling procedure in Ref. [23] subject to L perturbations. For τ ¼ 1,…; T, the ECM Md in Eq. (4.12) is estimated as

$$\begin{aligned} M\_d &= \frac{1}{T} \sum\_{\tau=1}^T M\_{\tau}, \\\\ M\_{\tau} &:= \frac{1}{L} \mathcal{S}\_{\overline{\tau}}(L) \mathcal{S}\_{\overline{\tau}}^T(L) \end{aligned} \tag{7.4}$$

For the problem Eqs. (4.3)–(4.4) let us define the vector of unknown parameters in the ECM <sup>M</sup>ðs,s′ Þ as

A Comparison Study on Performance of an Adaptive Filter with Other Estimation Methods for State... http://dx.doi.org/10.5772/67005 47

$$\Theta := (\mathbf{c}\_{11}, \dots, \mathbf{c}\_{1n\_v}, \mathbf{c}\_{21}, \dots, \mathbf{c}\_{2n\_v}, \mathbf{c}\_{n\_l 1}, \dots, \mathbf{c}\_{n\_v n\_v}, L\_d)^T, \mathbf{c}\_{kl} := m\_v(k, l). \tag{7.5}$$

where ckl :¼ mvðk, lÞ. As to the parameter Ld, it represents the correlation length of the horizontal ECM Mh,

$$M\_h(y, y') = \exp(-d(y, y')/L\_d),\tag{7.6}$$

<sup>d</sup>ðy, <sup>y</sup>′ <sup>Þ</sup> is the distance between two horizontal points <sup>y</sup> :¼ ði, <sup>j</sup><sup>Þ</sup> and <sup>y</sup>′ :¼ ð<sup>i</sup> ′ , j ′ Þ.

Considering Md as data matrix, optimization problem for determining the vector θ looks like

$$J[\theta] = E[\Psi(M\_t, \theta)] \to \min \theta,\tag{7.7}$$

$$[\Psi(M\_t, \theta)] := ||M\_t - M\_v(\mathbf{s}\_v, \mathbf{s}\_v) \otimes M\_h(\mathbf{s}\_h, \mathbf{s}\_h)||\_F^2,\tag{7.7}$$

Mention that the problem Eq. (7.4) is somewhat closely related to the Nearest Kronecker Product (NKP) problem [25]. In the experiment, the correlation length Ld is not estimated and is taken identical in two filters PEF and CHF, Ld ¼ 25.

#### 7.2.2. PEF: computation of gain

ensemble of PE samples (as done in the experiment with Lorenz system, see the sampling procedure in Ref. [23] for more detail). As the number of elements of ECM is of order 1010 (for only the layer thickness component h), it is impossible to simulate a sufficient number of PE samples so that Md would be a good estimate for the ECM. The matrix Md will be used only as

efficient way to simplify the filter structure is to assume that the ECM M has a SeVHS.

Mvðsv,sv′Þ⊗Mhðsh,sh′Þ ¼ Mði, j, l; i

mvð1; 1ÞMh mvð1; 2ÞMh … mvð1, nvÞMh mvð2; 1ÞMh mvð2; 2ÞMh … mvð2, nvÞMh … …… …

mvðnv, 1ÞMh mvðnv, 2ÞMh … mvðnv, nvÞMh

The main advantage of the separability hypothesis is that the number of parameters to be estimated in the covariance matrix is reduced drastically. As a consequence, even an ensemble of PE samples with small size can serve as a large data set for estimating the unknown parameters. This results in a fast convergence of the estimation procedure. In addition, introducing the SeVHS hypothesis allows to avoid the rank deficiency problem in the estimation of the ECM. In fact, as only a few numbers of ScVs can be computed in very high-dimensional systems, approximation of the ECM M by Eq. (6.5) results in rank deficiency for M. With such an ECM, the resulting filter will probably produce worse results, not to say on instability

at the τ time instant by applying the sampling procedure in Ref. [23] subject to L perturbations.

<sup>L</sup> <sup>S</sup>τðLÞST

For the problem Eqs. (4.3)–(4.4) let us define the vector of unknown parameters in the ECM

<sup>τ</sup> ðLÞ

Md <sup>¼</sup> <sup>1</sup> <sup>T</sup> <sup>∑</sup> T τ¼1 Mτ,

<sup>M</sup><sup>τ</sup> :<sup>¼</sup> <sup>1</sup>

Þ ¼ Mvðsv,sv′Þ⊗Mhðsh,sh′Þ,sv :¼ l,sh :¼ ði, jÞ, (7.2)

′ , j ′ , l ′ Þ ¼

1

CCCCCCA

<sup>τ</sup> ,…; <sup>δ</sup>hðL<sup>Þ</sup>

<sup>τ</sup> � which are obtained

(7.4)

Þ. One useful and

(7.3)

data to estimate the parameters in a parametrized ECM as follows (see Ref. [18]):

Let <sup>M</sup>∈Rnh · nh be the ECM for the layer thickness <sup>h</sup>, that is, <sup>M</sup> <sup>¼</sup> <sup>M</sup>ðs,s′

Assuming there exist two covariance matrices, Mv and Mh such that

where ⊗ denotes the Kronecker product between two matrices [25],

Mðs,s ′

46 Advances in Statistical Methodologies and Their Application to Real Problems

0

BBBBBB@

which may occur during the filtering process.

<sup>M</sup>ðs,s′

Þ as

Suppose we are given the ensemble of S-PE samples <sup>S</sup>τ½L�¼½δhð1<sup>Þ</sup>

For τ ¼ 1,…; T, the ECM Md in Eq. (4.12) is estimated as

As to the computation of the gain, introduce the notations: at the instant k, let xði, j, lÞ be the value of the system state defined at the grid points ði, j, lÞ. Let x ! ¼ ð<sup>x</sup> !T <sup>1</sup> , x !T <sup>2</sup> ,…; x !T nv Þ <sup>T</sup> be a vector representation for x where x ! <sup>l</sup> is a vector whose components are the values of x at all the horizontal grid points (ordered in some way) at the l th, vertical layer.

Consider the ECM Eq. (3.10) and the observation equation (1.2). Represent the observation matrix H in a block-matrix form

$$H = [H\_1, \dots, H\_{n\_v}] \tag{7.8}$$

which corresponds to the vector representation x !, that is,

$$H\vec{\mathbf{x}} = \sum\_{\nu=1}^{n\_v} H\_{\nu} \vec{\mathbf{x}}\_{\nu} \tag{7.9}$$

Compute the gain according to Eq. (2.4). We have

$$\begin{aligned} MH^T &= M\_v \otimes M\_h H^T = \begin{bmatrix} \Sigma\_1^T, \dots, \Sigma\_{n\_v}^T \end{bmatrix}^T, \\ \Sigma\_l &= M\_h \sum\_{k=1}^{n\_v} c\_{lk} H\_k^T = M\_h G\_{v,l}, \\ MH^T &= M\_d G\_v, M\_d = \text{block diag} \, \text{diag} \, [M\_h, \dots, M\_h], \\ G\_v &= [G\_{v,1}^T, \dots, G\_{v,n\_v}^T]^T, G\_{v,l} := \sum\_{k=1}^{n\_v} c\_{lk} H\_k^T, \\ \Sigma &:= HM H^T + R = \sum\_{k=1, l=1}^{n\_v} c\_{lk} H\_l M\_h H\_k^T + R, \end{aligned} \tag{7.10}$$

As proved in Ref. [18], in this case, the gain has the following form

$$K = M\_d K\_v,\\ K\_v = \begin{bmatrix} K\_{v,1}^T, \dots, K\_{v,n\_v}^T \end{bmatrix}^T = G\_v \Sigma^{-1},\\ \tag{7.11}$$

$$K\_{v,l} = G\_{v,l} \Sigma^{-1}, l = 1, \dots, n\_v,$$

where Gv,l, Σ are defined in Eq. (7.9).

#### 7.2.3. Cooper-Haines filter: CHF

To see the performance of the AF, we implement also a so-called CHF. The CHF [7] is obtained from (7.9), (7.10) under three hypotheses [24]: (H1) the analysis error of the system output is canceled in the case of noise-free observations; (H2) conservation of the linear potential vorticity (PV); (H3) there is no correction for the velocity at the bottom layer. The AF in Ref. [24] is obtained by relaxing one or several hypotheses (H1)–(H3). From the filtering theory, the difference between the PEF and CHF is lying in the way we estimate the elements of the ECM.

For the choice of the tuning parameters in the PEF, see Ref. [24].

#### 7.3. Numerical results

First, we run the model initialized by the climatology. This run is different from that used for modeling the sequence of true ocean states only by changing the initial state by climatology. This run is denoted as model. Figure 6 shows the (spatial) averaged variance between SSH observations and that produced by the model. We see the error grows as time progresses, meaning instability of the numerical model wrt the perturbed initial condition. That fact signifies that the VM will have difficulties in producing high performance estimates if the assimilation window is long.

Figure 6. SSH prediction errors resulting from "model" and CHF: growing of the prediction error in the "model" signifies instability of the numerical model wrt specification of the initial system state.

Next the two filters, CHF and PEF, are implemented under the same initial condition as those carried out with the experiment model. It is seen from Figure 6 that initialized by the same initial condition, and the CHF is much more efficient than the model in reducing the estimation error. The performance comparison between the CHF and PEF is presented in Figure 7. Here, the (spatially averaged variance) SSH prediction errors, resulting from two filters CHF and PEF, are displayed. The superiority of the PEF over the CHF is undoubted. It is clear from Figure 7 that the PEF is capable of producing the better estimates, with lower error level, along all assimilation period. On the other hand, if the estimation error in CHF decreases continuously at the beginning of the assimilation and is stabilized more or less after (during the interval k∈ð10 : 45Þ), it becomes to increase considerably at the end of the assimilation period. It means that the PEF is more efficient than the CHF and the fact that the ECM in the PEF, constructed on the basis of the S-PE samples, has the effect to stabilize its behavior.

<sup>K</sup> <sup>¼</sup> MdKv, Kv ¼ ½K<sup>T</sup>

48 Advances in Statistical Methodologies and Their Application to Real Problems

For the choice of the tuning parameters in the PEF, see Ref. [24].

instability of the numerical model wrt specification of the initial system state.

where Gv,l, Σ are defined in Eq. (7.9).

7.2.3. Cooper-Haines filter: CHF

7.3. Numerical results

assimilation window is long.

<sup>v</sup>,1,…; K<sup>T</sup>

Kv,<sup>l</sup> <sup>¼</sup> Gv,lΣ<sup>−</sup><sup>1</sup>

To see the performance of the AF, we implement also a so-called CHF. The CHF [7] is obtained from (7.9), (7.10) under three hypotheses [24]: (H1) the analysis error of the system output is canceled in the case of noise-free observations; (H2) conservation of the linear potential vorticity (PV); (H3) there is no correction for the velocity at the bottom layer. The AF in Ref. [24] is obtained by relaxing one or several hypotheses (H1)–(H3). From the filtering theory, the difference between the PEF and CHF is lying in the way we estimate the elements of the ECM.

First, we run the model initialized by the climatology. This run is different from that used for modeling the sequence of true ocean states only by changing the initial state by climatology. This run is denoted as model. Figure 6 shows the (spatial) averaged variance between SSH observations and that produced by the model. We see the error grows as time progresses, meaning instability of the numerical model wrt the perturbed initial condition. That fact signifies that the VM will have difficulties in producing high performance estimates if the

Next the two filters, CHF and PEF, are implemented under the same initial condition as those carried out with the experiment model. It is seen from Figure 6 that initialized by the same initial condition, and the CHF is much more efficient than the model in reducing the estimation error. The performance comparison between the CHF and PEF is presented in Figure 7. Here, the (spatially averaged variance) SSH prediction errors, resulting from two

Figure 6. SSH prediction errors resulting from "model" and CHF: growing of the prediction error in the "model" signifies

v,nv �

<sup>T</sup> <sup>¼</sup> GvΣ<sup>−</sup><sup>1</sup>

, l ¼ 1,…; nv,

,

(7.11)

Figure 7. SSH prediction errors resulting from two filters: CHF and PEF. The PEF is capable of producing the estimates with lower estimation errors and is stable along all assimilation period, whereas the CHF has a difficulty to maintain the same performance at the end of the assimilation period. The PEF is much better than the CHF in providing better estimates for the system states.

To see the effect of adaptation, Figure 8 displays the filtered errors for the u-velocity component estimates at the surface, produced by the PEF and APEF, respectively. The APEF is an AF, which is an adaptive version of the PEF. Here, the tuning parameters are optimized by the SPSA method. From the computational point of view, the SPSA requires much less time

Figure 8. Filtered errors for the u-velocity component estimate, resulting from PEF and APEF (AF based on PEF). Optimization is performed by SPSA. By tuning the parameters in the filter gain, one can improve considerably the performance of the PEF.

integration and memory storage compared with the traditional AE method. At each assimilation instant, we have to make only two integrations of the MICOM for approximating the gradient vector. From Figure 8, one sees that the adaptation allows to reduce significantly the estimation errors produced by the PEF.

## 8. Conclusions

In this chapter, a comparison study on the performance of the AF with other existing methods is presented in the context of its application to data assimilation problems in high-dimensional numerical models. As it is seen, in comparison with the standard VM, the AF is much simpler to implement and produces better estimates. The advantages of the AF over other methods such as EKF, CHF are also demonstrated. The principal reason for high performance of the AF is lying in the choice of innovation representation for the initial input-output system and selection of pertinent gain parameters as control variables to minimize the MSE of the innovation process. If in the VM, the choice of the structure for the initial state is the most important thing to do (but that is insufficient for guaranteeing its high performance), in the AF, however, the initial state has a little impact on the performance of the AF. This happens because the AF is selected as optimal in the class of stable filters, and as a consequence, the error in the initial estimate is attenuated during assimilation process. In contrary, in the VM, the error in the specification of the initial guess is amplified during assimilation if the numerical model is unstable.

In conclusion, it is important to emphasize that the AF approach, presented in this chapter, is consolidated by exploiting the following road map: (i) generate data ECM from S-PE samples which grow in the directions of dominant Schur vectors; (ii) select a parametrized structure for the ECM under the hypothesis SeVHS; (iii) make the choice of tuning parameters in the gain by minimizing the distance between the data ECM and that having the SeHVS structure; (iv) adjust the unknown parameters in the gain in order to minimize the PE error of the system output by applying the SPSA algorithm.

There are a wide variety of engineering problems to which the AF is applicable and that could be worthy of further study. Depending on particular problems, undoubtedly, the other modifications would be helpful to improve the filter performance, to simplify its implementation. But the main features of the AF, presented in this chapter, remain as the key points to follow in order to preserve a high performance of the AF.

## Author details

Hong Son Hoang\* and Remy Baraille

\*Address all correspondence to: hhoang@shom.fr

SHOM/HOM/REC, Toulouse, France

## References

integration and memory storage compared with the traditional AE method. At each assimilation instant, we have to make only two integrations of the MICOM for approximating the gradient vector. From Figure 8, one sees that the adaptation allows to reduce significantly the

In this chapter, a comparison study on the performance of the AF with other existing methods is presented in the context of its application to data assimilation problems in high-dimensional numerical models. As it is seen, in comparison with the standard VM, the AF is much simpler to implement and produces better estimates. The advantages of the AF over other methods such as EKF, CHF are also demonstrated. The principal reason for high performance of the AF is lying in the choice of innovation representation for the initial input-output system and selection of pertinent gain parameters as control variables to minimize the MSE of the innovation process. If in the VM, the choice of the structure for the initial state is the most important thing to do (but that is insufficient for guaranteeing its high performance), in the AF, however, the initial state has a little impact on the performance of the AF. This happens because the AF is selected as optimal in the class of stable filters, and as a consequence, the error in the initial estimate is attenuated during assimilation process. In contrary, in the VM, the error in the specification of the initial guess is amplified during assimilation if the numerical model is

In conclusion, it is important to emphasize that the AF approach, presented in this chapter, is consolidated by exploiting the following road map: (i) generate data ECM from S-PE samples which grow in the directions of dominant Schur vectors; (ii) select a parametrized structure for the ECM under the hypothesis SeVHS; (iii) make the choice of tuning parameters in the gain by minimizing the distance between the data ECM and that having the SeHVS structure; (iv) adjust the unknown parameters in the gain in order to minimize the

There are a wide variety of engineering problems to which the AF is applicable and that could be worthy of further study. Depending on particular problems, undoubtedly, the other modifications would be helpful to improve the filter performance, to simplify its implementation. But the main features of the AF, presented in this chapter, remain as the key points to follow in

PE error of the system output by applying the SPSA algorithm.

order to preserve a high performance of the AF.

\*Address all correspondence to: hhoang@shom.fr

Hong Son Hoang\* and Remy Baraille

SHOM/HOM/REC, Toulouse, France

estimation errors produced by the PEF.

50 Advances in Statistical Methodologies and Their Application to Real Problems

8. Conclusions

unstable.

Author details


#### **Applications of the H-Principle of Mathematical Modelling** Applications of the H-Principle of Mathematical Modelling

#### Agnar Höskuldsson Agnar Höskuldsson

[17] Hamill T.M.: Ensemble-based atmospheric data assimilation. In: Palmer T., editor. Pre-

[18] Hoang H.S. and Baraille R.: A low cost filter design for state and parameter estimation in very high dimensional systems. In: Proceedings of the 19th IFAC Congress; August, 2014,

[19] Daley R.: The effect of serially correlated observation and model error on atmospheric

[20] Baraille R.: Modélisation numérique avec le modèle HYCOM au SHOM. At "http://

[22] Kivman G.A.: Sequential parameter estimation for stochastic systems. Nonlinear Process

[23] Hoang H.S. and Baraille R.: Prediction error sampling procedure based on dominant Schur decomposition. Application to state estimation in high dimensional oceanic model.

[24] Hoang H.S., Baraille R. and Talagrand O.: On an adaptive filter for altimetric data assimilation and its application to a primitive equation model MICOM. Tellus, 57A,

[25] Golub G.H. and Van Loan C.F.: Matrix Computations. Johns Hopkins University Press;

dictability of Weather and Climate. Cambridge Press; 2006; pp. 124–156.

[21] Lorenz E.N.: Deterministic non-periodic flow. J Atmos Sci. 1963; 20: 130–141.

data assimilation. Monthly Weather Rev. 1992; 120: 165–177

mathocean.math.cnrs.fr/presentations/Baraille.pdf"

52 Advances in Statistical Methodologies and Their Application to Real Problems

J Appl Math Comput. 2011; 12: 3689–3709.

Geophys. 2003; 10: 253–259. doi:10.5194/npg-10-253-2003

pp. 3256–3261.

2005, 2: 153–170.

1996.

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66153

#### Abstract

Traditional statistical test procedures are briefly reviewed. It is pointed out that significance testing may not always be reliable. The author has formulated a modelling procedure, the H-principle, for how mathematical modelling should be carried out in the case of uncertain data. Here it is applied to linear regression. Using this procedure, the author has developed a common framework for carrying out linear regression. Six regression methods are analysed by this framework, two stepwise methods: principal component regression, ridge regression, PLS regression and an H-method. The same algorithm is used for all methods. It is shown how model validation and graphic analysis, which is popular in chemometrics, apply to all the methods. Validation of the methods is carried out by using numerical measures, cross-validation and test sets. Furthermore, the methods are tested by a blind test, where 40 samples have been excluded. It is shown how procedures in applied statistics and in chemometrics both apply to the present framework.

Keywords: linear regression, common framework for linear regression, stepwise regression, ridge regression, PLS regression, H-methods

## 1. Introduction

Regression analysis is the most studied subject within theoretical statistics. Numerous books have been published. Advanced program packages have been developed that make it easy for users to develop and study advanced models and methods.

Advanced program packages such as SAS and SPSS have been used by students since the 1970s and the program packages have become very advanced. The user fills out a 'menu' similar to those at restaurants. The program then carries out the analysis as requested providing the users' possibility of carrying out highly advanced analysis of the data. Users rely upon that they can make interpretation of the output in the way they learned at school. For example,

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons

distribution, and eproduction in any medium, provided the original work is properly cited.

if a variable/factor is significant, its presence in the model improves the predictions derived from the modelling results.

Today measurement instruments are becoming more and more advanced, e.g. optical instruments are becoming popular in industry. They may give thousands of values each time a sample is measured. In applied research, the tendency is to include as much information as possible in order to cover possible alternatives of the situation. When adding interactions or non-linear terms, we also see data having hundreds or thousands of variables.

There are three basic challenges in applied data analysis, which are as follows: (1) Data in applied sciences and industry are typically generated by instruments that produce many variables. In these cases, the data have latent structure, which means that the data values are geometrically located in a low-dimensional space. However, mathematical models and methods usually assume that the X-data have full rank. In these cases, the models are often incorrect and may give imprecise results.

(2) Typically, scientists develop solutions that are optimal or unbiased. This is a natural approach, because the derived solutions have important properties, when data satisfy the given model. Simulations based on the model confirm the optimality, but often in practice the data often do not satisfy the assumptions of the model. Forcing an optimal solution on the data may give results that have bad or no prediction ability. A better solution may then be obtained by relaxing on the optimality or unbiasedness of the solution.

(3) It is a tradition to base results of regression analysis on testing the significance of the variables in the model. However, the results may not always be reliable. The influence of a variable can be so small that it may be of no importance, see Section 3.3. At professional organizations, it is considered to be serious problem for the researchers in the field.

H-principle is a prescription of how a solution to a mathematical model should be generated, when data are uncertain. It is proposed that the determination of the solution should be carried out in steps and compute the improvement in the solution and the associated precision at each step. It is suggested that the solution at each step should be an optimal balance between the improvement and the associated precision. Determination of the solution continues as long as it is supported by the given (uncertain) data.

The author has developed a framework linear regression, which is inspired by the H-principle. Using this framework, the same algorithm is applied for carrying out different types of regression analysis. Most regression methods based on linear algebra can be carried out within this framework. Here, we carry out the analysis of six different methods. Numerical and graphic analyses of the results are the same for all the regression methods.

In Section 2, we specify the used notation and briefly describe the used data. In Section 3, we consider some basic issues in modelling data. The background for the H-principle is treated. In Section 4, we consider the latent variable regression model. In Section 5, we present the Hprinciple and show some examples of its usage. In Section 6, we present a common framework for linear regression. In Section 7, we discuss model validation. This is an important topic, but we only consider essential aspects that are used in the analysis of the six methods. In Section 8, we present the results for six different regression methods: (1) stepwise (forward) regression maximising covariance, (2) stepwise (forward) regression maximising R<sup>2</sup> ,

(3) principal component regression, (4) ridge regression,

(5) PLS regression and (6) an H-method.

if a variable/factor is significant, its presence in the model improves the predictions derived

Today measurement instruments are becoming more and more advanced, e.g. optical instruments are becoming popular in industry. They may give thousands of values each time a sample is measured. In applied research, the tendency is to include as much information as possible in order to cover possible alternatives of the situation. When adding interactions or

There are three basic challenges in applied data analysis, which are as follows: (1) Data in applied sciences and industry are typically generated by instruments that produce many variables. In these cases, the data have latent structure, which means that the data values are geometrically located in a low-dimensional space. However, mathematical models and methods usually assume that the X-data have full rank. In these cases, the models are often

(2) Typically, scientists develop solutions that are optimal or unbiased. This is a natural approach, because the derived solutions have important properties, when data satisfy the given model. Simulations based on the model confirm the optimality, but often in practice the data often do not satisfy the assumptions of the model. Forcing an optimal solution on the data may give results that have bad or no prediction ability. A better solution may then be obtained

(3) It is a tradition to base results of regression analysis on testing the significance of the variables in the model. However, the results may not always be reliable. The influence of a variable can be so small that it may be of no importance, see Section 3.3. At professional

H-principle is a prescription of how a solution to a mathematical model should be generated, when data are uncertain. It is proposed that the determination of the solution should be carried out in steps and compute the improvement in the solution and the associated precision at each step. It is suggested that the solution at each step should be an optimal balance between the improvement and the associated precision. Determination of the solution continues as long as

The author has developed a framework linear regression, which is inspired by the H-principle. Using this framework, the same algorithm is applied for carrying out different types of regression analysis. Most regression methods based on linear algebra can be carried out within this framework. Here, we carry out the analysis of six different methods. Numerical and

In Section 2, we specify the used notation and briefly describe the used data. In Section 3, we consider some basic issues in modelling data. The background for the H-principle is treated. In Section 4, we consider the latent variable regression model. In Section 5, we present the Hprinciple and show some examples of its usage. In Section 6, we present a common framework for linear regression. In Section 7, we discuss model validation. This is an important topic, but we only consider essential aspects that are used in the analysis of the six methods. In Section 8,

organizations, it is considered to be serious problem for the researchers in the field.

graphic analyses of the results are the same for all the regression methods.

non-linear terms, we also see data having hundreds or thousands of variables.

from the modelling results.

incorrect and may give imprecise results.

it is supported by the given (uncertain) data.

by relaxing on the optimality or unbiasedness of the solution.

54 Advances in Statistical Methodologies and Their Application to Real Problems

Section 9 briefly discusses the results presented. In Section 10, we mention application of the H-principle to multi-block and path modelling, non-linear modelling and extension to multilinear algebra. Section 11 presents conclusions.

## 2. Notation, data and scaling

Matrices and vectors are denoted by bold letters; matrices by upper case and vectors by lower case. In order to facilitate the reading of the equations, different types of indices are used for the steps and for the matrices/vectors. The letters a and b are related to the steps in the algorithm. The letters i, j and k are used for the indices within a matrix/vector. It is assumed that there is given instrumental data X, N +K matrix, and response data Y, N +M matrix. The regression data are denoted by (X, Y). In some cases, only one y-variable, y, is used. This is done to simplify the equations. It is assumed that data are centred, which means that average values are subtracted from each column of X and Y. This also makes it easier to read the equations. x<sup>j</sup> is the j th column of X and x<sup>i</sup> is the i th row of X = (xij). A latent variable τ is a linear combination of the original measured variables,

$$
\pi = a\_1 \mathbf{x}\_1 + a\_2 \mathbf{x}\_2 + \dots + a\_K \mathbf{x}\_K \tag{1}
$$

The data used here are from an optical instrument. The instrument gives 1200 values at each measurement. However, technical knowledge of the instrument suggests that only 40 values should be used for determining the substance in question, the y-values. Two hundred samples are measured. Forty samples are put aside for a blind test. Thus, the calibration analysis is based on 160 samples. This gives X as 160 + 40 matrix and y a 160 vector. These data are challenging and represent a common situation, when working with optical instruments (FTIR, NIR, fluorescence etc.).

It is sometimes important to determine if data should be scaled or not. Scaling can be obtained by multiplying X and Y by diagonal matrices. If C<sup>1</sup> and C<sup>2</sup> are diagonal matrices, the scaling of X is done by the transformation, X←(XC1) and of Y by Y← (YC2). The linear least squares solution for (X, Y) is given by <sup>B</sup> ¼ ðXTX<sup>Þ</sup> −1 XTY. This solution can be obtained from the linear least squares solution for scaled data, B1, as follows,

$$\mathbf{B} = \mathbf{C}\_1[(\mathbf{X}\mathbf{C}\_1)^T(\mathbf{X}\mathbf{C}\_1)]^{-1}(\mathbf{X}\mathbf{C}\_1)^T(\mathbf{Y}\mathbf{C}\_2)]\mathbf{C}\_2^{-1} = \mathbf{C}\_1\mathbf{B}\_1\mathbf{C}\_2^{-1} \tag{2}$$

This shows that when computing the linear least squares solution, we can work with the scaled data. The original solution is obtained by scaling 'back' as shown in the equation. We use this property also, when we compute the approximate solution. The effect of scaling is better numerical precision. Scaling is necessary for the present data. If data are not scaled (e.g. to unit variance), numerical results beyond dimension of around 15 may not be reliable. The reason is that Eqs. (20) and (23) are sensitive to small numerical values. Scaling is much debated among researchers. When scaling is used, one must secure that all variables have values above the 'noise' level of the instrument. This is a difficult topic, which is not considered closer here. If original data follow a normal distribution, the scaled ones will not. However, for large sample number like given here, the differences will be negligible.

#### 3. Linear regression

#### 3.1. Traditional regression model

It is assumed that there are given instrumental data X and response data y. A linear regression model is given by

$$y = \beta\_1 \mathbf{x}\_1 + \beta\_2 \mathbf{x}\_2 + \dots + \beta\_K \mathbf{x}\_K + \varepsilon \tag{3}$$

The x-variables are called independent variables and the y-variable is the dependent one. When the parameters β have been estimated as b, the estimated model is now

$$y = b\_1 \mathbf{x}\_1 + b\_2 \mathbf{x}\_2 + \dots + b\_K \mathbf{x}\_K \tag{4}$$

When there is given a new sample x<sup>0</sup> = (x10, x20, …, xK0), it gives the estimated or predicted value y<sup>0</sup> = b<sup>1</sup> x<sup>10</sup> + … + bK xK0. There can be many estimates for the regression coefficients. Therefore, Greek letters are used for the theoretical parameters and Roman letters for the estimated values. It is common to use the linear least squares method for estimating the parameters. It is based on minimizing the residuals, (y–Xβ) T (y–Xβ) ! minimum, with respect to β. The solution is given by

$$\mathbf{b} = (\mathbf{X}^{\mathsf{T}}\mathbf{X})^{-1}\mathbf{X}^{\mathsf{T}}\mathbf{y}.\tag{5}$$

It is common to assume that the residuals in Eq. (3) are normally distributed. This is often written as y~N(Xβ, σ<sup>2</sup> ). It means that the expected value of y is E(y) = Xβ and the variance is Var(y) = σ<sup>2</sup> I, where I is the identity matrix. The linear least squares procedure coincides with the maximum likelihood method in case the data follow a normal distribution. Assuming the normal distribution, the parameter estimate b has the variance given by

$$Var(b) = \sigma^2 (X^T X)^{-1}.\tag{6}$$

Here σ<sup>2</sup> is estimated by

$$
\sigma^2 \mathbb{1} \mathbb{E}\_1^\mathrm{N} e\_i^2 / (\text{N-K}) \tag{7}
$$

where the residuals, ei' s, are computed from e = y – Xb.

#### 3.2. Assumptions and properties

numerical precision. Scaling is necessary for the present data. If data are not scaled (e.g. to unit variance), numerical results beyond dimension of around 15 may not be reliable. The reason is that Eqs. (20) and (23) are sensitive to small numerical values. Scaling is much debated among researchers. When scaling is used, one must secure that all variables have values above the 'noise' level of the instrument. This is a difficult topic, which is not considered closer here. If original data follow a normal distribution, the scaled ones will not. However, for large sample

It is assumed that there are given instrumental data X and response data y. A linear regression

The x-variables are called independent variables and the y-variable is the dependent one.

When there is given a new sample x<sup>0</sup> = (x10, x20, …, xK0), it gives the estimated or predicted value y<sup>0</sup> = b<sup>1</sup> x<sup>10</sup> + … + bK xK0. There can be many estimates for the regression coefficients. Therefore, Greek letters are used for the theoretical parameters and Roman letters for the estimated values. It is common to use the linear least squares method for estimating the

<sup>b</sup> ¼ ðXTX<sup>Þ</sup>

VarðbÞ ¼ <sup>σ</sup><sup>2</sup>

σ<sup>2</sup>≅Σ<sup>N</sup> 1 e 2

s, are computed from e = y – Xb.

−1

I, where I is the identity matrix. The linear least squares procedure coincides with

<sup>ð</sup>XTX<sup>Þ</sup> −1

It is common to assume that the residuals in Eq. (3) are normally distributed. This is often

the maximum likelihood method in case the data follow a normal distribution. Assuming the

When the parameters β have been estimated as b, the estimated model is now

y ¼ β1x<sup>1</sup> þ β2x<sup>2</sup> þ … þ βKxK þ ε (3)

y ¼ b1x<sup>1</sup> þ b2x<sup>2</sup> þ … þ bKxK (4)

(y–Xβ) ! minimum, with respect

XTy: (5)

: (6)

<sup>i</sup> =ðN−KÞ (7)

T

). It means that the expected value of y is E(y) = Xβ and the variance is

number like given here, the differences will be negligible.

56 Advances in Statistical Methodologies and Their Application to Real Problems

parameters. It is based on minimizing the residuals, (y–Xβ)

normal distribution, the parameter estimate b has the variance given by

3. Linear regression

model is given by

3.1. Traditional regression model

to β. The solution is given by

written as y~N(Xβ, σ<sup>2</sup>

Here σ<sup>2</sup> is estimated by

where the residuals, ei'

Var(y) = σ<sup>2</sup>

It is assumed that the matrix X<sup>T</sup> X has an inverse. Furthermore, it is assumed that the samples are independent of each other. In case the model (Eq. (3)) is correct and data follow the normal distribution, the estimate (Eq. (5)) has some important properties compared with other possible estimates:


There is a general agreement that the variance matrix Var(b) should be as small as possible. The matrix (X<sup>T</sup> X)–<sup>1</sup> , the precision matrix, shows how precise the estimates b are. The properties (a) and (b) state that assuming the linear model (Eq. (3)), the solution (Eq. (5)) is, in fact from a theoretical point of view, the best possible one. It often occurs that the model (Eq. (3)) contains many variables. Interpretation of the estimation results (Eq. (4)) is an important part of the statistical analysis. Therefore, it is common to evaluate the parameters with the aid of a significance test. If a variable is not significant, it may be excluded. A parameter associated with a variable is commonly evaluated by a t-test. A t-test of a parameter bi is given by

$$t = \frac{b\_i}{\sqrt{Var(b\_i)}} \text{ = } \frac{b\_i}{\sqrt{(s^2 \times s^{\bar{\imath}})}} \tag{8}$$

Here s <sup>2</sup> is given by Eq. (7) and s ii is the ith diagonal element of (X<sup>T</sup> X) –1 .

This is the motivation that the program packages such as SAS and SPSS in statistics compute the linear least squares solution in the case of linear regression analysis as the initial solution. Significance of the parameters in Eq. (4) is shown using Eq. (8).

The problem in using the least squares solution appears when the precision matrix becomes close to singular. If the computer program (SAS or SPSS) detects that the computation of b may be imprecise due to close to singularity of the precision matrix, the user is informed that the estimates b may be imprecise. However, the information is related to the judgement of the situation and the numerical precision of the computer. There are practical issues long before the question of the numerical uncertainties of the estimates, b's.

Consider two examples. Suppose that X is N +2 and that x<sup>1</sup> = c x<sup>2</sup> + δ, for some constant c, where |δ| < 10–<sup>5</sup> . We may be able to compute (X<sup>T</sup> X)–<sup>1</sup> . However, inference on the model (Eq. (4)) may be uncertain or incorrect. At the other example, suppose that X is N +40, but that practical rank is 15. Assume that X<sup>T</sup> X = X<sup>1</sup> T X<sup>1</sup> + X<sup>2</sup> T X2, where X<sup>2</sup> = (x2,1 x2,2 … x2,40). If all x2,i i = 1,…,40 are small, say |x2,i| < 10–<sup>5</sup> , inference from Eq. (4) may be uncertain or incorrect. The first example may not be realistic. However, the second one is realistic. It is common that data in applied sciences and industry are of reduced practical rank. In these cases, the model (Eq. (3)) is incorrect and leads to uncertain or incorrect results.

#### 3.3. Stepwise linear regression

We shall consider closer the procedure of stepwise linear regression. We do that by using the Cholesky factorization X<sup>T</sup> X = FF<sup>T</sup> , where F is lower triangular. Then we can write the columns of the data matrix X as

$$\begin{aligned} \mathbf{x}\_1 &= F\_{11}\mathbf{t}\_1\\ \mathbf{x}\_2 &= F\_{21}\mathbf{t}\_1 + F\_{22}\mathbf{t}\_2\\ &\dots\\ \mathbf{x}\_K &= F\_{K1}\mathbf{t}\_1 + F\_{K2}\mathbf{t}\_2 + \dots + F\_{KK}\mathbf{t}\_K \end{aligned} \tag{9}$$

The vectors (ti) are mutually orthogonal and of length 1, t<sup>i</sup> T t<sup>j</sup> = δij. The significance of xK, when x1, x2, …, xK–<sup>1</sup> are given, is computed from Eq. (8) with

$$\mathbf{b}\_{K} = \frac{\mathbf{y}^{T}(\mathbf{F\_{KK}}\mathbf{t\_{K}})}{\mathbf{F\_{KK}^{2}}} = \frac{(\mathbf{y}^{T}\mathbf{t\_{K}})}{F\_{KK}} \text{ and } \mathbf{s^{KK}} = \mathbf{1}/F\_{KK}^{2} \tag{10}$$

This gives

$$\mathbf{t} = \frac{(\mathbf{y}^T \mathbf{t} \boldsymbol{\kappa})}{\mathbf{s}} \tag{11}$$

This shows that the t-test for the significance of x<sup>K</sup> is independent of the size FKK. When the selection of a new variable is carried out among many variables (e.g. 500), there is a considerable risk that FKK is so small that the marginal effect of x<sup>K</sup> is of no importance although it is being declared as statistically significant. The issue is that the user of program packages is not informed of, if a significant variable improves the prediction of the response variable or not.

#### 3.4. Variance of regression coefficients

It is instructive to study closer the variance of the regression coefficient in the linear least squares model, Eq. (6). It can be written as

$$Var(\mathbf{b}) \mathbb{E}[\mathbf{y}^T \mathbf{y} - \mathbf{y}^T \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}] \times [\left(\mathbf{X}^T \mathbf{X}\right)^{-1}]/(\text{N} - \text{K})\tag{12}$$

An important objective of the modelling task is to keep the value of Eq. (12) as small as possible. Equation (12) is a product of two parts. For these two parts it can be shown that, when a new variable is added to the model


(In theory these measures can be unchanged, but in practice these changes always occur). Note that the two terms, (i) and (ii), are equally important and appear in a symmetric way in Eq. (12). If the data are normally distributed, it can be shown that


are stochastically independent. This means that the knowledge of one of them does not give any information on the other one. If, for instance, a certain fit has been obtained, then there is no information on (b). In order to know the quality of predictions, we must be compute (b) and use Eq. (12). Knowing (a), the results concerning the precision matrix can be good or bad. Program packages only use the t-test or some equivalent measure to evaluate to test significance of the variables/components. There is no information on the precision matrix. We must compute the precision matrix together with the significance testing in order to find out how well the model is performing. However, a modelling procedure must include both (a) and (b) in order to secure small values of Eq. (12). The procedure must handle both the decrease of (i) and the increase of (ii) during the model estimation.

## 4. Latent variable model

3.3. Stepwise linear regression

X = FF<sup>T</sup>

58 Advances in Statistical Methodologies and Their Application to Real Problems

The vectors (ti) are mutually orthogonal and of length 1, t<sup>i</sup>

bK <sup>¼</sup> <sup>y</sup><sup>T</sup>ðFKKtK<sup>Þ</sup> F2 KK

VarðbÞ≅½y<sup>T</sup>y−y<sup>T</sup>XðX<sup>T</sup>X<sup>Þ</sup>

−1

Eq. (12). If the data are normally distributed, it can be shown that

−1 , −1

−1 <sup>X</sup><sup>T</sup>y�,

, always increases

x1, x2, …, xK–<sup>1</sup> are given, is computed from Eq. (8) with

3.4. Variance of regression coefficients

squares model, Eq. (6). It can be written as

when a new variable is added to the model

i. the residual error, <sup>½</sup>y<sup>T</sup>y−y<sup>T</sup>XðX<sup>T</sup>X<sup>Þ</sup>

a. the residual error, <sup>½</sup>y<sup>T</sup>y−y<sup>T</sup>XðX<sup>T</sup>X<sup>Þ</sup>

b. the precision matrix, <sup>ð</sup>X<sup>T</sup>X<sup>Þ</sup>

ii. the precision matrix, <sup>ð</sup>X<sup>T</sup>X<sup>Þ</sup>

x<sup>1</sup> ¼ F11t<sup>1</sup> x<sup>2</sup> ¼ F21t<sup>1</sup> þ F22t<sup>2</sup>

Cholesky factorization X<sup>T</sup>

of the data matrix X as

This gives

We shall consider closer the procedure of stepwise linear regression. We do that by using the

… x<sup>K</sup> ¼ FK1t<sup>1</sup> þ FK2t<sup>2</sup> þ … þ FKKt<sup>K</sup>

> <sup>¼</sup> <sup>ð</sup>y<sup>T</sup>tK<sup>Þ</sup> FKK

<sup>t</sup> <sup>¼</sup> <sup>ð</sup>y<sup>T</sup>tK<sup>Þ</sup>

This shows that the t-test for the significance of x<sup>K</sup> is independent of the size FKK. When the selection of a new variable is carried out among many variables (e.g. 500), there is a considerable risk that FKK is so small that the marginal effect of x<sup>K</sup> is of no importance although it is being declared as statistically significant. The issue is that the user of program packages is not informed of, if a significant variable improves the prediction of the response variable or not.

It is instructive to study closer the variance of the regression coefficient in the linear least

−1

An important objective of the modelling task is to keep the value of Eq. (12) as small as possible. Equation (12) is a product of two parts. For these two parts it can be shown that,

(In theory these measures can be unchanged, but in practice these changes always occur). Note that the two terms, (i) and (ii), are equally important and appear in a symmetric way in

<sup>X</sup><sup>T</sup>y� · ½ðX<sup>T</sup>X<sup>Þ</sup>

<sup>X</sup><sup>T</sup>y�, always decreases

−1

, where F is lower triangular. Then we can write the columns

t<sup>j</sup> = δij. The significance of xK, when

<sup>s</sup> (11)

KK (10)

�=ðN−KÞ (12)

T

and sKK <sup>¼</sup> <sup>1</sup>=F<sup>2</sup>

(9)

The measurement data in applied sciences and industry often represent a 'system', where variables are dependent on each other from chemical equilibrium, action-reaction in physics and physical and technical balance. Geometrically, the samples are located in a low-dimensional space. In these cases, the model (Eq. (3)) is incorrect (apart from simple, designed laboratory experiments) and the nice theory above may not applicable. Why is this the case? When a new sample is available, we cannot automatically insert it into Eq. (4) and compute the y-value. We must secure that the new sample is located geometrically along the samples used for analysis, the rows of X. A correct model for this kind of data is

$$y = a\_1\tau\_1 + a\_2\tau\_2 + \dots + a\_A\tau\_A + \varepsilon.\tag{13}$$

Here (τa) are latent variables and (αa) are the regression coefficients on the latent variables. The regression task is both to determine the latent variables, τ's, and to compute the associated regression coefficients. The resulting Eq. (13) is then converted into Eq. (4) using Eq. (26). The model (Eq. (13)) is called latent structure linear regression. It is sometimes difficult to make 'good' interpretation of the variables, when using latent variables. However, by using appropriate modelling procedure, it is possible to obtain a fairly good interpretation of the variables (see Section 5.4).

## 5. The H-principle of mathematical modelling

#### 5.1. Background

In the 1920s, there were large discussions on the measurement aspects of quantum mechanics. W. Heisenberg pointed out that there are certain magnitudes that cannot be determined exactly at the same time. His famous uncertainty inequality states that there is a lower limit to how well conjugate magnitudes can be determined at the same time. The position and momentum of an elementary particle is an example. For that example, the inequality is

$$
\Delta(\text{position}) \times \Delta(\text{momentum}) \gtrsim \text{constant} \tag{14}
$$

The lower limit of the inequality is related to the Planck's constant of light. These considerations are based on physical theory. In practice, it means that there are some restrictions on the outcome of an experiment. The results may depend on the instrument and the phenomenon being studied. The restrictions are detected by the application of the measurement instrument. The uncertainty inequality and the associated theory are a kind of guidance, when carrying out experiments.

## 5.1.1. The H-principle

When modelling data, there is an analogous situation. Instead of a measurement instrument we have a mathematical method. The conjugate magnitudes are ΔFit and ΔPrecision and they cannot be controlled at the same time. It is recommended to carry out the modelling in steps and at each step evaluate the situation as prescribed by the uncertainty inequality. At each step, it is assumed that there is given a weight vector. The recommendations of the H-principle of mathematical modelling are the following (expressed in the case of regression analysis):

	- a. expression for improvement in fit, Δ(Fit)
	- b. the associated prediction, Δ(Precision)

Note that the H-principle is a recommendation of how to proceed in determining the solution of a mathematical problem, when data are uncertain. It is not like maximum likelihood, where a certain function is to be maximized. However, it is necessary to compute the improvement in the solution at each step and the associated precision. The optimal balance is described in Section 3 and the situation is evaluated to find out if data are following along.

#### 5.1.2. Application to linear regression

Let us consider closer how this applies to linear regression. The task is to determine a weight vector w according to this principle. For the score vector t = Xw we have


If S = X<sup>T</sup> X the adjustment (Eq. (20)) gives orthogonal score vectors, t's. Therefore, the variance, (b), is used for the precision. Treating Σ as a constant the task is to maximize

Applications of the H-Principle of Mathematical Modelling http://dx.doi.org/10.5772/66153 61

$$
\left[\frac{|\mathbf{Y}^T\mathbf{t}|^2}{\mathbf{t}^T\mathbf{t}}\right] \times \frac{I}{\left[\frac{1}{\mathbf{t}^T\mathbf{t}}\right]} = \mathbf{w}^T\mathbf{X}^T\mathbf{Y}\mathbf{Y}^T\mathbf{X}\mathbf{w} \tag{15}
$$

This is a maximization task because improvement in fit is negative. The maximization is carried out under the restriction that w is of length 1, |w|=1. By using the Lagrange multiplier method, it can be shown that the task is an eigenvalue task,

$$\lambda X^T Y Y^T X \mathfrak{w} = \lambda \mathfrak{w} \tag{16}$$

In case there is only one y-variable, the eigenvalue task has a direct solution

$$w = \frac{X^T y}{|X^T y|} = \frac{\mathbf{C}}{|\mathbf{C}|} \tag{17}$$

These are the solutions used in PLS regression. Thus, we can state that the PLS regression is consistent with the H-principle. H-methods take as a starting point the PLS solution and determine how the prediction aspect or the solution can be improved.

#### 5.2. Interpretation of the modelling results

The fit obtained by the score vector t is

$$\frac{\left(\mathbf{y}^{\mathsf{T}}\mathbf{t}\right)^{2}}{\left(\mathbf{t}^{\mathsf{T}}\mathbf{t}\right)} = \frac{\left(\mathbf{C}^{\mathsf{T}}\mathbf{C}\right)^{2}}{\mathbf{C}^{\mathsf{T}}\mathbf{S}\mathbf{C}} = \left[\frac{c\_{1}^{2}}{f} + \frac{c\_{2}^{2}}{f} + \dots + \frac{c\_{K}^{2}}{f}\right]^{2} \tag{18}$$

where f ¼ ffiffiffiffiffiffiffiffiffiffiffiffi <sup>C</sup><sup>T</sup>SC <sup>p</sup>

ΔðpositionÞ ·ΔðmomentumÞ ≥ constant (14)

+

Δ(Precision)

The lower limit of the inequality is related to the Planck's constant of light. These considerations are based on physical theory. In practice, it means that there are some restrictions on the outcome of an experiment. The results may depend on the instrument and the phenomenon being studied. The restrictions are detected by the application of the measurement instrument. The uncertainty inequality and the associated theory are a kind of guidance, when carrying out

When modelling data, there is an analogous situation. Instead of a measurement instrument we have a mathematical method. The conjugate magnitudes are ΔFit and ΔPrecision and they cannot be controlled at the same time. It is recommended to carry out the modelling in steps and at each step evaluate the situation as prescribed by the uncertainty inequality. At each step, it is assumed that there is given a weight vector. The recommendations of the H-principle of mathematical modelling are the following (expressed in the case of regression analysis):

1. Carry out determining the solution in steps. You specify how you want to look at the data

4. In case the computed solution improves the prediction abilities of the model, the solution is accepted. If the solution does not provide this improvement, the modelling stops.

Note that the H-principle is a recommendation of how to proceed in determining the solution of a mathematical problem, when data are uncertain. It is not like maximum likelihood, where a certain function is to be maximized. However, it is necessary to compute the improvement in the solution at each step and the associated precision. The optimal balance is described in

Let us consider closer how this applies to linear regression. The task is to determine a weight

X the adjustment (Eq. (20)) gives orthogonal score vectors, t's. Therefore, the variance,

at this step by formulating how the weights are computed.

60 Advances in Statistical Methodologies and Their Application to Real Problems

a. expression for improvement in fit, Δ(Fit)

3. Compute the solution that minimises the product Δ(Fit)

5. The data are adjusted for what has been selected; restart at 1.

Section 3 and the situation is evaluated to find out if data are following along.

vector w according to this principle. For the score vector t = Xw we have

(b), is used for the precision. Treating Σ as a constant the task is to maximize

2 =ðt TtÞ

TtÞ

b. the associated prediction, Δ(Precision)

experiments.

5.1.1. The H-principle

2. At each step compute

5.1.2. Application to linear regression

a. Improvement in fit: <sup>j</sup>Y<sup>T</sup>t<sup>j</sup>

b. Associated variance: Σ=ðt

If S = X<sup>T</sup>

The regression coefficient can be written similarly as

$$\frac{(\mathbf{y}^T \mathbf{t})}{(\mathbf{t}^T \mathbf{t})} = \frac{(\mathbf{C}^T \mathbf{C})^{1.5}}{\mathbf{C}^T \mathbf{S} \mathbf{C}} = \left[ \frac{c\_1^2}{\mathcal{g}} + \frac{c\_2^2}{\mathcal{g}} + \dots + \frac{c\_K^2}{\mathcal{g}} \right]^{1.5} \tag{19}$$

where ¼ ðC<sup>T</sup>SC<sup>Þ</sup> 1 1:5.

At each step, we can see how much each variable contributes to the fit and the regression coefficient..

#### 6. A common framework for linear regression

#### 6.1. Background: views on the regression analysis task

There are many ways to carry out a regression analysis. Here we present a general framework for carrying out linear regression that includes most methods based on linear algebra. The basic idea is to separate the computations into two parts: the first part is concerned on how it is preferable to look at data. In practice, there can be different emphasis on the regression analysis. Sometimes it is important to determine important variables. In other cases, it may be the predictions derived that is important. The other part is the numerical algorithm to compute the solution vector and associated measures. At the first part, it is assumed that a weight matrix W = (w1, …,wK) is given. Each weight vector should be of length one, although this is not necessary. The weight vectors (wa) specify how one wants to look at the data. They may be determined by some optimization or significance criteria. They may also be determined by a standard regression method such as PLS regression. In Section 6.4, we show the choices in the case of the six regression analysis. The role of the weight vectors is only to compute the loading vectors. The other part of the computations is a numerical algorithm, which is the same for all choices of W. There is no restriction on the weight vectors except that they may not give zeroloading vector, p<sup>a</sup> ≠ 0.

#### 6.2. Decomposition of data

The starting point is a variance matrix S and a covariance matrix C. S can be any positive semidefinite matrix, but C is assumed to be the covariance, C = X<sup>T</sup> Y. The algorithm is formulated for multiple y's.

Initially S<sup>0</sup> = S and C<sup>0</sup> = C and B = 0. For a = 1, …, K:

$$\text{Loading\\_vector}: \mathbf{p}\_a = \mathbf{S}\_{a-1} \mathbf{w}\_a \tag{20}$$

$$\text{Scaling} \quad \text{constant}: d\_d = \frac{1}{\mathbf{w}\_a^T \mathbf{p}\_a} \tag{21}$$

$$\text{Loading vector}: \mathbf{q}\_a = \mathbf{C}\_{a-1}^T \mathbf{w}\_a \tag{22}$$

$$\text{Loading weight vector}: \mathfrak{v}\_a \tag{23}$$

The loading weight vectors are computed by, see reference [1],

$$\mathbf{w}\_1 = \mathbf{w}\_1, \mathbf{v}\_a = \mathbf{w}\_a - d\_1(\mathbf{p}\_1^T \mathbf{w}\_a)\mathbf{v}\_1 - \dots - d\_{a-1}(\mathbf{p}\_{a-1}^T \mathbf{w}\_a)\mathbf{v}\_{a-1}, a = \text{2, 3, ..., K} \tag{24}$$

S is adjusted by the loading vector p<sup>a</sup> and similarly for C,

$$\mathbf{S}\_{a} = \mathbf{S}\_{a-1} \mathbf{-} d\_{a} \mathbf{p}\_{a} \mathbf{p}\_{a}^{T} \tag{25}$$

$$\mathbf{C}\_{a} = \mathbf{C}\_{a-1} \mathbf{-} d\_{a} \mathbf{p}\_{a} \mathbf{q}\_{a}^{T} \tag{26}$$

The adjustment of S is of rank one reduction. S also reduces in size by

$$tr(d\_a \mathbf{p}\_a \mathbf{p}\_a^T) = d\_a \mathbf{p}\_a^T \mathbf{p}\_a = \frac{\mathbf{w}\_a^T \mathbf{S}^2 \mathbf{w}\_a}{\mathbf{w}\_a^T \mathbf{S} \mathbf{w}\_a} > 0. \tag{27}$$

The loading weight matrix V satisfies V<sup>T</sup> P = D–<sup>1</sup> , see reference [1].

#### 6.3. Expansion of matrices

analysis. Sometimes it is important to determine important variables. In other cases, it may be the predictions derived that is important. The other part is the numerical algorithm to compute the solution vector and associated measures. At the first part, it is assumed that a weight matrix W = (w1, …,wK) is given. Each weight vector should be of length one, although this is not necessary. The weight vectors (wa) specify how one wants to look at the data. They may be determined by some optimization or significance criteria. They may also be determined by a standard regression method such as PLS regression. In Section 6.4, we show the choices in the case of the six regression analysis. The role of the weight vectors is only to compute the loading vectors. The other part of the computations is a numerical algorithm, which is the same for all choices of W. There is no restriction on the weight vectors except that they may not give zero-

The starting point is a variance matrix S and a covariance matrix C. S can be any positive semi-

Scaling constant : da <sup>¼</sup> <sup>1</sup>

Loading vector : <sup>q</sup><sup>a</sup> <sup>¼</sup> <sup>C</sup><sup>T</sup>

1waÞv1−…−da<sup>−</sup>1ðp<sup>T</sup>

<sup>S</sup><sup>a</sup> <sup>¼</sup> <sup>S</sup>a−1−dapap<sup>T</sup>

<sup>C</sup><sup>a</sup> <sup>¼</sup> <sup>C</sup>a−1−dapaq<sup>T</sup>

<sup>a</sup> <sup>p</sup><sup>a</sup> <sup>¼</sup> <sup>w</sup><sup>T</sup>

<sup>a</sup> S<sup>2</sup> wa

, see reference [1].

w<sup>T</sup> <sup>a</sup> Sw<sup>a</sup>

<sup>a</sup> Þ ¼ dap<sup>T</sup>

P = D–<sup>1</sup>

Y. The algorithm is formulated for

<sup>a</sup>−1w<sup>a</sup> (22)

<sup>a</sup>−1waÞva−1, a ¼ 2, 3, …, K (24)

<sup>a</sup> (25)

<sup>a</sup> (26)

> 0: (27)

(21)

Loading vector : p<sup>a</sup> ¼ Sa−1w<sup>a</sup> (20)

Loading weight vector : v<sup>a</sup> (23)

w<sup>T</sup> <sup>a</sup> p<sup>a</sup>

definite matrix, but C is assumed to be the covariance, C = X<sup>T</sup>

62 Advances in Statistical Methodologies and Their Application to Real Problems

The loading weight vectors are computed by, see reference [1],

<sup>v</sup><sup>1</sup> <sup>¼</sup> <sup>w</sup>1, <sup>v</sup><sup>a</sup> <sup>¼</sup> <sup>w</sup>a−d1ðp<sup>T</sup>

S is adjusted by the loading vector p<sup>a</sup> and similarly for C,

The loading weight matrix V satisfies V<sup>T</sup>

The adjustment of S is of rank one reduction. S also reduces in size by

trðdapap<sup>T</sup>

Initially S<sup>0</sup> = S and C<sup>0</sup> = C and B = 0. For a = 1, …, K:

loading vector, p<sup>a</sup> ≠ 0.

multiple y's.

6.2. Decomposition of data

At a = K we get S<sup>K</sup> = 0. Expanding S and C we get

$$\mathbf{S} = d\_1 \mathbf{p}\_1 \mathbf{p}\_1^T + \dots + d\_A \mathbf{p}\_A \mathbf{p}\_A^T + \dots + d\_K \mathbf{p}\_K \mathbf{p}\_K^T = \mathbf{P} \mathbf{D} \mathbf{P}^T \tag{28}$$

$$\mathbf{C} = \mathbf{X}^T \mathbf{Y} = d\_1 \mathbf{p}\_1 \mathbf{q}\_1^T + \dots + d\_A \mathbf{p}\_A \mathbf{q}\_A^T + \dots + d\_K \mathbf{p}\_K \mathbf{q}\_K^T = \mathbf{P} \mathbf{D} \mathbf{Q}^T \tag{29}$$

Inserting appropriate matrices we get

$$\mathbf{S}^{-1} = d\_1 \mathbf{v}\_1 \mathbf{v}\_1^T + \dots + d\_A \mathbf{v}\_A \mathbf{v}\_A^T + \dots + d\_K \mathbf{v}\_K \mathbf{v}\_K^T = \mathbf{V} \mathbf{D} \mathbf{V}^T \tag{30}$$

$$\mathbf{B} = \mathbf{S}^{-1}X^TY = d\_1\mathbf{v}\_1\mathbf{q}\_1^T + \dots + d\_A\mathbf{v}\_A\mathbf{q}\_A^T + \dots + d\_K\mathbf{v}\_K\mathbf{q}\_K^T = \mathbf{V}\mathbf{D}\mathbf{Q}^T\tag{31}$$

$$\widehat{\boldsymbol{Y}}^{T}\widehat{\boldsymbol{Y}} = d\_{1}\boldsymbol{\mathfrak{q}}\_{1}\boldsymbol{\mathfrak{q}}\_{1}^{T} + \dots + d\_{A}\boldsymbol{\mathfrak{q}}\_{A}\boldsymbol{\mathfrak{q}}\_{A}^{T} + \dots + d\_{K}\boldsymbol{\mathfrak{q}}\_{K}\boldsymbol{\mathfrak{q}}\_{K}^{T} = \boldsymbol{\mathfrak{Q}}\boldsymbol{\mathfrak{D}}\boldsymbol{\mathfrak{Q}}^{T} \tag{32}$$

If S = X<sup>T</sup> X, we can expand X and Y in a similar way. Compute a score matrix T by T = XV. This gives

$$X = d\_1 \mathfrak{t}\_1 \mathfrak{p}\_1^T + \dots + d\_A \mathfrak{t}\_A \mathfrak{p}\_A^T + \dots + d\_K \mathfrak{t}\_K \mathfrak{p}\_K^T = \mathbf{T} \mathbf{D} \mathbf{P}^T \tag{33}$$

$$\widehat{\mathbf{Y}} = d\_1 \mathbf{t}\_1 \mathbf{q}\_1^T + \dots + d\_A \mathbf{t}\_A \mathbf{q}\_A^T + \dots + d\_K \mathbf{t}\_K \mathbf{q}\_K^T = \mathbf{T} \mathbf{D} \mathbf{Q}^T \tag{34}$$

The score vectors are mutually orthogonal. This follows from T<sup>T</sup> T = D–<sup>1</sup> . Generally, only A terms of the expansions are used, because it is verified that the modelling task cannot be improved beyond A terms. For the proof of geometric properties of the vectors in these expansions, see reference [1].

It can be recommended to use a test set, (Xt, Yt), when carrying out a regression analysis. Centring (and scaling if used) is done on the test set by using the corresponding values from (X, Y). The estimated y-values are computed as Ŷ<sup>t</sup> = X<sup>t</sup> B and the score vectors for the test set as T<sup>t</sup> = X<sup>t</sup> V. It is often useful to study the plots, showing columns of Y against columns of T (in stepwise regression, the plots are called added variable plots). Similarly, we can plot the score vectors of the test set, the columns of Tt, against columns of Yt.

#### 6.4. Choices of weight vectors

The weight vectors for the six regression methods are as follows:

#### (i) Stepwise regression, maximize covariance:

$$w\_a(\text{ii}) = 1 \text{ for } \mathbb{C}\_{\text{ii}} = \max\_{i=1}^{\mathbb{K}} |\mathbb{C}\_i| \text{ and } = 0 \text{ otherwise} \tag{35}$$

#### (ii) Stepwise regression, maximize R<sup>2</sup> :

$$w\_a(\text{ii}) = 1 \text{ for} \frac{(\mathbf{y}^T \mathbf{x}\_{\text{ii}})^2}{(\mathbf{x}\_{\text{ii}}^T \mathbf{x}\_{\text{ii}})} = \max\_{i=1}^K \frac{(\mathbf{y}^T \mathbf{x}\_i)^2}{(\mathbf{x}\_i^T \mathbf{x}\_i)}, \text{ and } = 0 \text{ otherwise} \tag{36}$$

(iii) and (iv) Eigenvector of S:

$$\text{For } \mathbf{S} = \mathbf{U}E\mathbf{U}^{\top}, \; \mathbf{w}\_a = \mathbf{u}\_a \tag{37}$$

(v) PLS regression:

$$\mathbf{w}\_{a} = \mathbf{C}\_{a} / |\mathbf{C}\_{a}| \tag{38}$$

where C<sup>a</sup> is the reduced covariance.

#### (vi) H-method:

The weight vector of PLS regression is sorted with largest first, w<sup>ð</sup>s<sup>Þ</sup> <sup>1</sup> , <sup>w</sup><sup>ð</sup>s<sup>Þ</sup> <sup>2</sup> , …, <sup>w</sup><sup>ð</sup>s<sup>Þ</sup> K . Columns of X are rearranged to match this sorting, X(s).

$$\mathbf{w}\_{a,m} = \left( w\_1^{(s)}, w\_2^{(s)}, \dots, w\_m^{(s)}, 0, \dots, 0 \right) \tag{39}$$

The index m is chosen, which gives the best explained variation,

$$\mathbf{t}\_{i} = X^{(s)} \boldsymbol{\omega}\_{a,i}, \frac{\left(\mathbf{y}^{\mathsf{T}} \mathbf{t}\_{m}\right)^{2}}{\left(\mathbf{t}\_{m}^{\mathsf{T}} \mathbf{t}\_{m}\right)} = \max\_{i=1}^{K} \frac{\left(\mathbf{y}^{\mathsf{T}} \mathbf{t}\_{i}\right)^{2}}{\left(\mathbf{t}\_{i}^{\mathsf{T}} \mathbf{t}\_{i}\right)} \tag{40}$$

For further details of this method, see reference [2]. It has been applied to different bio-assay studies, where there can be many variables (3000 or more). Comparisons with several other methods have been shown that this method is preferable to work within the bio-assay studies, see references [3, 4]. Several other methods to improve the PLS solution have been developed.

## 7. Model validation

#### 7.1. Numerical measures

The Mallow's Cp value and Akaike's information measure are commonly presented as results in a regression analysis. Mallow's Cp value is given by

$$\mathbf{C}\_p = \frac{|\mathbf{y} - \hat{\mathbf{y}}\_A|^2}{\sigma^2} + 2A \mathbf{-} \mathbf{N} \tag{41}$$

As σ<sup>2</sup> we use the residual variance, s<sup>2</sup> <sup>A</sup>, at the maximal number of steps in the algorithm. Cp is an estimation of '(total means squared error)/σ<sup>2</sup> '. The interpretation of Cp is


Akaike's information measure is given by

(iii) and (iv) Eigenvector of S:

where C<sup>a</sup> is the reduced covariance.

X are rearranged to match this sorting, X(s).

The weight vector of PLS regression is sorted with largest first,

64 Advances in Statistical Methodologies and Their Application to Real Problems

wa,<sup>m</sup> ¼

The index m is chosen, which gives the best explained variation,

<sup>t</sup><sup>i</sup> <sup>¼</sup> <sup>X</sup>ðs<sup>Þ</sup>

in a regression analysis. Mallow's Cp value is given by

 w<sup>ð</sup>s<sup>Þ</sup> <sup>1</sup> , <sup>w</sup><sup>ð</sup>s<sup>Þ</sup>

wa,i,

<sup>ð</sup>yTtm<sup>Þ</sup> 2

For further details of this method, see reference [2]. It has been applied to different bio-assay studies, where there can be many variables (3000 or more). Comparisons with several other methods have been shown that this method is preferable to work within the bio-assay studies, see references [3, 4]. Several other methods to improve the PLS solution have been developed.

The Mallow's Cp value and Akaike's information measure are commonly presented as results

2

Cp <sup>¼</sup> <sup>j</sup>y−y^A<sup>j</sup>

ðtT

<sup>2</sup> , …, wðs<sup>Þ</sup>

<sup>m</sup>tm<sup>Þ</sup> <sup>¼</sup> max<sup>K</sup>

<sup>m</sup> , 0, …, 0

i¼1

(v) PLS regression:

(vi) H-method:

7. Model validation

7.1. Numerical measures

As σ<sup>2</sup> we use the residual variance, s<sup>2</sup>

i. it should be as small as possible.

an estimation of '(total means squared error)/σ<sup>2</sup>

ii. its value should be as close to A as possible

iii. deviations of Cp from A suggests bias

For <sup>S</sup> <sup>¼</sup> UEU<sup>T</sup>, <sup>w</sup><sup>a</sup> <sup>¼</sup> <sup>u</sup><sup>a</sup> (37)

 w<sup>ð</sup>s<sup>Þ</sup> <sup>1</sup> , <sup>w</sup><sup>ð</sup>s<sup>Þ</sup>

<sup>ð</sup>yTti<sup>Þ</sup> 2

<sup>σ</sup><sup>2</sup> <sup>þ</sup> <sup>2</sup>A−<sup>N</sup> (41)

<sup>A</sup>, at the maximal number of steps in the algorithm. Cp is

'. The interpretation of Cp is

ðt T

w<sup>a</sup> ¼ Ca=jCaj (38)

<sup>2</sup> , …, <sup>w</sup><sup>ð</sup>s<sup>Þ</sup> K 

<sup>i</sup> <sup>t</sup>i<sup>Þ</sup> (40)

. Columns of

(39)

$$AIC\_A = N(\log(s\_A^2) + 1) + s(A+1) \tag{42}$$

It is an information measure that states the discrepancy between the correct model and the one obtained at step A. The number of components, A, is chosen that gives the smallest value of Eq. (32).

Both Cp and AIC have the property that they are not dependent on the given linear model, [5, 6]. Therefore, they can be used for all six methods considered here.

A t-value for the significance of a regression coefficient is given by

$$t\text{-value} = \frac{(\mathbf{y}^T \mathbf{t}\_A)}{\mathbf{s}\_A} \tag{43}$$

Here t<sup>A</sup> is the Ath score vector of unit length. The significance, p-value, of the t-value can be computed using the t-distribution. Although the assumptions of a t-test are not valid for any of the six methods, it is useful to look at the significance. One can show that theoretically this value should be larger than 2 in order to be significant.

When comparing methods, it is useful to look at the estimate for the variance, Var(b), of the regression coefficients. We compute

$$(\text{trace}(Var(\mathbf{b}\_A)))^{\natural\_{\natural}} = (\mathbf{s}\_A^2 \Sigma\_{a=1}^A d\_{\mathbf{s}}(\mathbf{v}\_a^T \mathbf{v}\_a))^{\natural\_{\natural}} \tag{44}$$

We cannot use this measure to determine the dimension. However, it is useful in comparing different methods.

#### 7.2. The covariance

Eq. (15) is equivalent to the singular value decomposition of the covariance C. Therefore, it is suggested that the modelling of data should continue as long as the covariance is not zero. The dimension of a model can be determined by finding, when C = X<sup>T</sup> y ≅ 0 for reduced matrices. One procedure is to study the individual terms, (x<sup>i</sup> T y). Assume that data can be described by a multivariate normal distribution with a covariance matrix Σxy. Then, it is shown in reference [7] that the sample covariances (x<sup>i</sup> T y)/(N – 3) are approximately normally distributed. If σxi,y = 0, it is shown in reference [7] that approximate 95% limits for the residual covariance, (x<sup>i</sup> T y)/(N – 3), are given by

$$\pm 1.96\sqrt{N}\sigma\_{\text{xi}}\sigma\_{\text{y}}/(\text{N-3}) \text{ $\equiv$ }\pm 1.96\sqrt{N}\text{s}\_{\text{xi}}\text{s}\_{\text{y}}/(\text{N-3})\tag{45}$$

Thus, when modelling stops, it is required that all residual covariances should be within these limits. If σxi,y = 0, the distribution of the residual covariance approaches quickly the normal distribution by the central limit theorem. Therefore, it is a reliable measure to judge, if the covariances have become zero or close to zero.

Another approach is to study the total value, y<sup>T</sup> XX<sup>T</sup> y = Σi(x<sup>i</sup> T y) 2 . If the covariance matrix Σxy is zero, Σxy = 0, the mean, µ = E{(y<sup>T</sup> XX<sup>T</sup> y)}, and variance, σ<sup>2</sup> = Var{(y<sup>T</sup> XX<sup>T</sup> y)}, can be computed [8]. If the covariance is not zero, Σxy ≠ 0, it can be shown that E{(y<sup>T</sup> XX<sup>T</sup> y)} > µ. The upper 95% limit of a normal distribution N(µ,σ<sup>2</sup> ) is µ + 1.65σ. This is used for mean and variance. In the analysis, it is checked if y<sup>T</sup> XX<sup>T</sup> y is below the upper 95% limit (a one-sided test)

$$\|\mathbf{y}^T \mathbf{X} \mathbf{X}^T \mathbf{y} < \text{trace}(\mathbf{X}^T \mathbf{X})(\mathbf{y}^T \mathbf{y})/N + 1.65 \sqrt{2 \text{trace}(\mathbf{X}^T \mathbf{X} \mathbf{X}^T \mathbf{X})(\mathbf{y}^T \mathbf{y})^2/N^2} \tag{46}$$

When this inequality is satisfied, there is an indication that modelling should stop.

In the analysis in Section 8, we use the p-value,

$$p\text{-value} = P\{\mathbf{y}^T X \mathbf{X}^T \mathbf{y} > \mu | N(\mu, \sigma^2) \}\tag{47}$$

This analysis has been found useful for different types of regression analysis.

#### 7.3. Cross-validation and test sets

In stepwise regression, a search is carried out to find the next best variable. In PLS regression, a weight vector w is determined that gives the maximal size of the y-loading vector q. When search or optimization procedures are carried out there is a considerable risk of overfitting. Evaluation of the results by standard statistical significance tests may show high significance, while other measures may show that the results are not significant. The uncertainties of predictions of new samples depend on the location of the samples in question. The further away from the centre (average values), the larger the uncertainties are. On the other hand, it is important to have large values, both large x-samples and y-values, in order to obtain stable estimates. In chemometrics, these special features of prediction are well known. A 10-fold cross-validation is often used. The samples are divided randomly into 10 groups. Nine groups are used for calibration (modelling) and the results applied to the 10th group. This is repeated for all groups so that all y-values are computed by a model, yc, that is using 90% of the samples. Experience has shown that this procedure may not always function well. As an example, one can mention the case, where the y-values have very skew distribution. There can be big difference in the results from cross-validation depending on how well large y-values are represented in the groups. It may be necessary that each group has a similar 'profile' as the total set of samples. For the present data, the y-values show very skew distribution (log(log(yvalues)) show a normal distribution). It may be necessary that each group is representative for the whole data. This can be achieved by ordered cross-validation. Here, the samples are ordered in some way, which reflects the variation or sizes in data. In a 10-fold ordered cross-validation, the first group of samples is number 1, 11, 21, etc. of the ordered samples. The second group of samples is number 2, 12, 22, etc. In this way, each of the 10 groups of samples is representative with respect to the chosen ordering. The cross-validation is now carried out in the usual way.

As a result of cross-validation, we compute

$$D\_a = 1 - |\mathbf{y} - \mathbf{y}\_c|^2 / |\mathbf{y}|^2 \text{ for } a = 1, 2, \dots, A \tag{48}$$

Da is computed for each dimension. In the analysis below sorted y-values are used in the ordered cross-validation. A test set for the analysis is also selected in a similar way. Samples are ordered according to the first PLS score vector. 15% of the samples or 160 +0.15 = 24 samples are used in the test set (Xt, yt). Thus, the analysis in Section 8 is based on 136 samples. Equation (38) is also used, when applying results to a test set (y = y<sup>t</sup> and y<sup>c</sup> = X<sup>t</sup> +b).

## 8. Results for six different methods

#### 8.1. Preliminary remarks

Another approach is to study the total value, y<sup>T</sup>

In the analysis in Section 8, we use the p-value,

XX<sup>T</sup>

If the covariance is not zero, Σxy ≠ 0, it can be shown that E{(y<sup>T</sup>

<sup>y</sup><sup>T</sup>XXTy <sup>&</sup>lt; traceðX<sup>T</sup>XÞðy<sup>T</sup>yÞ=<sup>N</sup> <sup>þ</sup> <sup>1</sup>:<sup>65</sup>

When this inequality is satisfied, there is an indication that modelling should stop.

This analysis has been found useful for different types of regression analysis.

<sup>p</sup>−value <sup>¼</sup> <sup>P</sup>{y<sup>T</sup>XX<sup>T</sup><sup>y</sup> <sup>&</sup>gt; <sup>µ</sup>jNðµ, <sup>σ</sup><sup>2</sup>

In stepwise regression, a search is carried out to find the next best variable. In PLS regression, a weight vector w is determined that gives the maximal size of the y-loading vector q. When search or optimization procedures are carried out there is a considerable risk of overfitting. Evaluation of the results by standard statistical significance tests may show high significance, while other measures may show that the results are not significant. The uncertainties of predictions of new samples depend on the location of the samples in question. The further away from the centre (average values), the larger the uncertainties are. On the other hand, it is important to have large values, both large x-samples and y-values, in order to obtain stable estimates. In chemometrics, these special features of prediction are well known. A 10-fold cross-validation is often used. The samples are divided randomly into 10 groups. Nine groups are used for calibration (modelling) and the results applied to the 10th group. This is repeated for all groups so that all y-values are computed by a model, yc, that is using 90% of the samples. Experience has shown that this procedure may not always function well. As an example, one can mention the case, where the y-values have very skew distribution. There can be big difference in the results from cross-validation depending on how well large y-values are represented in the groups. It may be necessary that each group has a similar 'profile' as the total set of samples. For the present data, the y-values show very skew distribution (log(log(yvalues)) show a normal distribution). It may be necessary that each group is representative for the whole data. This can be achieved by ordered cross-validation. Here, the samples are ordered in some way, which reflects the variation or sizes in data. In a 10-fold ordered cross-validation, the first group of samples is number 1, 11, 21, etc. of the ordered samples. The second group of samples is number 2, 12, 22, etc. In this way, each of the 10 groups of samples is representative with respect to the chosen ordering. The cross-validation is now carried out in the usual way.

XX<sup>T</sup>

66 Advances in Statistical Methodologies and Their Application to Real Problems

zero, Σxy = 0, the mean, µ = E{(y<sup>T</sup>

of a normal distribution N(µ,σ<sup>2</sup>

7.3. Cross-validation and test sets

As a result of cross-validation, we compute

Da ¼ 1−jy−ycj

2 =jyj

Da is computed for each dimension. In the analysis below sorted y-values are used in the ordered cross-validation. A test set for the analysis is also selected in a similar way. Samples

analysis, it is checked if y<sup>T</sup>

XX<sup>T</sup>

y)}, and variance, σ<sup>2</sup> = Var{(y<sup>T</sup>

y = Σi(x<sup>i</sup> T y) 2

y is below the upper 95% limit (a one-sided test)

q

. If the covariance matrix Σxy is

2 =N<sup>2</sup>

Þ} (47)

y)}, can be computed [8].

(46)

y)} > µ. The upper 95% limit

XX<sup>T</sup>

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2traceðX<sup>T</sup>XX<sup>T</sup>XÞðy<sup>T</sup>y<sup>Þ</sup>

<sup>2</sup> f or a <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, …, <sup>A</sup> (48)

XX<sup>T</sup>

) is µ + 1.65σ. This is used for mean and variance. In the

We have chosen here data that are challenging to work with. They are representative for many situations that industry and research projects find themselves in, when working with optical instruments. It is expected that large dimension is needed to model the data. The aim of the present work is to show how commonly used measures in applied statistics and popular chemometric procedures fit within the present framework. Therefore, we have chosen data, where there is not big difference between the six methods chosen here. We show that the different measures/procedures do not agree on what dimension should be used. Thus, the data analyst is choosing one set of measures that will typically be based on experience. It is a part of industry standard, see reference [9], to develop a calibration method. When the development work is finished, a blind test is carried out for 40 new samples. The data analyst should not know the new 40 reference values, y-values. The estimate for the new 40 samples should be evaluated by another person. This procedure is used here: the last 40 samples are put aside and the calibration analysis will be based on the other samples. When analysis has been completed for the six methods, we apply the results to the 40 samples. Optical instruments may give thousands of values for each measurement that is carried out (the one used here gives 1200 values). Experts often point out which part of the data should be used. Here it has been proposed to work with 40 variables, which all show significant correlation with the y-values. An upper limit for the dimension is chosen here to be 20. There is not a numerical problem in computing the precision matrix. The ratio between the largest and the smallest eigenvalue of S is λ1/λ<sup>40</sup> = 9.0 +106 . However, experience suggests that the dimension should be less than 20.

#### 8.2. Modelling results for six methods

In order to make readings of the following tables easier, we shall use the following headings for the tables.


viii. ordered cross-validation, samples ordered by y-values, Eq. (38)

ix. test set results, samples ordered by values of the first PLS score vector (Eq. (38)).

The improvement in fit, size of score vectors, etc. are not shown in the tables below. Only dimensions from 10 to 20 are shown. Dimensions from 1 to 9 are all significant for all methods.

#### 8.2.1. Forward stepwise regression, maximal covariance

Here a variable is selected at each step that has maximal covariance |(x<sup>i</sup> T y)| for the reduced data (Table 1).

The 15th variable selection has the smallest AIC value and highly significant t-value. Therefore, it seems reasonable to choose 15 variables. Cross-validation and test set have reached the high level at dimension 10 or 11. This indicates some overfitting by choosing 15 variables.

#### 8.2.2. Forward stepwise regression, maximal R2 -value

Here a variable is selected at each step that has maximal size of (x<sup>i</sup> T y) 2 /(x<sup>i</sup> T xi) for the reduced data (Table 2).

It seems appropriate to choose 16 variables. Seventeenth is at the boundary of being also significant. Cp for 16 variables indicates some bias present. Cross-validation and test set have reached high values at step 10, which indicates some overfitting.

## 8.2.3. Principal component regression

Here score vectors are selected that correspond to the associated eigenvalues (Table 3).

Score vector no 14 is not significant, and perhaps should be excluded. It seems appropriate to choose dimension 17 here. It is common to remove score vectors that are not significant.


Table 1. Nine measures at stepwise selection of variables having maximal covariance.

#### Applications of the H-Principle of Mathematical Modelling http://dx.doi.org/10.5772/66153 69


Table 2. Nine measures at stepwise selection of variables having maximal increase in R<sup>2</sup> .


Table 3. Nine measures at principal component regression.

However, it may not always be a good practice. The t-test is not a valid test and the score vectors beyond dimension 20 are so small that they have no practical importance.

#### 8.2.4. Ridge regression

viii. ordered cross-validation, samples ordered by y-values, Eq. (38)

68 Advances in Statistical Methodologies and Their Application to Real Problems

Here a variable is selected at each step that has maximal covariance |(x<sup>i</sup>

Here a variable is selected at each step that has maximal size of (x<sup>i</sup>

reached high values at step 10, which indicates some overfitting.

Table 1. Nine measures at stepwise selection of variables having maximal covariance.

8.2.1. Forward stepwise regression, maximal covariance

8.2.2. Forward stepwise regression, maximal R2

8.2.3. Principal component regression

data (Table 1).

data (Table 2).

ix. test set results, samples ordered by values of the first PLS score vector (Eq. (38)).

The improvement in fit, size of score vectors, etc. are not shown in the tables below. Only dimensions from 10 to 20 are shown. Dimensions from 1 to 9 are all significant for all methods.

The 15th variable selection has the smallest AIC value and highly significant t-value. Therefore, it seems reasonable to choose 15 variables. Cross-validation and test set have reached the high level at dimension 10 or 11. This indicates some overfitting by choosing 15 variables.


It seems appropriate to choose 16 variables. Seventeenth is at the boundary of being also significant. Cp for 16 variables indicates some bias present. Cross-validation and test set have

Score vector no 14 is not significant, and perhaps should be excluded. It seems appropriate to choose dimension 17 here. It is common to remove score vectors that are not significant.

No. (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) (ix) 10 90.5 –1571.9 12.65 0.0000 0.0518 0.000 0.23 0.9577 0.9761 11 88.0 –1572.2 –1.78 0.0765 0.0186 0.033 0.24 0.9609 0.9747 12 81.4 –1575.7 2.52 0.0127 0.0121 0.000 0.27 0.9640 0.9744 13 62.1 –1590.0 4.10 0.0001 0.0111 0.000 0.31 0.9631 0.9760 14 29.9 –1618.3 –5.61 0.0000 0.0070 0.000 0.43 0.9626 0.9804 15 18.3 –1629.1 3.64 0.0003 0.0053 0.002 0.46 0.9658 0.9786 16 19.1 –1627.3 –1.09 0.2772 0.0022 0.617 0.50 0.9662 0.9797 17 18.4 –1627.1 1.63 0.1045 0.0017 0.664 0.61 0.9663 0.9804 18 20.1 –1624.4 –0.60 0.5525 0.0012 0.744 0.64 0.9657 0.9804 19 17.4 –1626.5 2.18 0.0307 0.0013 0.462 0.78 0.9652 0.9813 20 19.0 –1623.8 –0.59 0.5539 0.0006 0.806 0.97 0.9656 0.9811

Here score vectors are selected that correspond to the associated eigenvalues (Table 3).

T

T y) 2 /(x<sup>i</sup> T

y)| for the reduced

xi) for the reduced

Here score vectors are selected that correspond to the associated eigenvalues of S = X<sup>T</sup> X + kI. The value of k is estimated by leave-one-out cross-validation, see reference [1]. The optimal value of k is k = 0.0002 (Table 4).

It seems also appropriate here to choose dimension 17. The value of k is based on the full model. However, there is no practical difference between dimension 17 and a full model.

#### 8.2.5. PLS regression

When working with PLS regression, each step is analysed closer. The score vectors associated with the test set are computed as T<sup>t</sup> = X<sup>t</sup> + V. The correlation coefficient between response values of the test set, Yt, and the 12th score vector is –0.058 and the associated p-value is 0.197. Therefore, we can conclude that the 12th score vector does not contribute to the modelling task. We can thus conclude that the dimension should be 11 (Table 5).

## 8.2.6. H-method, maximal R<sup>2</sup> value

Here, it can be recommended to use 13 components. When 13 have been selected, there is no covariance left (see steps (v) and (vi)) (Table 6).


Table 4. Nine measures at ridge regression.


Table 5. Nine measures at PLS regression.



Table 6. Nine measures at H-method determining maximal R<sup>2</sup> value along the covariance.


Table 7. Comparison of results from six methods.

8.2.5. PLS regression

with the test set are computed as T<sup>t</sup> = X<sup>t</sup>

70 Advances in Statistical Methodologies and Their Application to Real Problems

covariance left (see steps (v) and (vi)) (Table 6).

8.2.6. H-method, maximal R<sup>2</sup> value

Table 4. Nine measures at ridge regression.

Table 5. Nine measures at PLS regression.

When working with PLS regression, each step is analysed closer. The score vectors associated

values of the test set, Yt, and the 12th score vector is –0.058 and the associated p-value is 0.197. Therefore, we can conclude that the 12th score vector does not contribute to the modelling

Here, it can be recommended to use 13 components. When 13 have been selected, there is no

No (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) (ix) 10 782.7 –1250.6 –2.39 0.0176 0.0586 0.000 0.15 0.7984 0.9176 11 410.6 –1347.7 –1103 0.0000 0.0555 0.000 0.16 0.8722 0.9520 12 382.1 –1355.4 3.22 0.0015 0.0322 0.000 0.20 0.8783 0.9497 13 151.7 –1461.5 –11.59 0.0000 0.0302 0.000 0.19 0.9334 0.9681 14 153.3 –1458.6 –0.48 0.6345 0.0108 0.000 0.23 0.9292 0.9673 15 85.8 –1504.0 –7.10 0.0000 0.0108 0.000 0.28 0.9421 0.9717 16 57.9 –1525.8 4.93 0.0000 0.0071 0.000 0.32 0.9508 0.9731 17 23.7 –1557.7 –5.90 0.0000 0.0053 0.000 0.35 0.9566 0.9734 18 25.6 –1554.8 0.34 0.7345 0.0023 0.000 0.43 0.9563 0.9727 19 27.5 –1551.7 0.23 0.8168 0.0023 0.000 0.53 0.9553 0.9728 20 19.0 –1560.0 3.24 0.0014 0.0023 0.000 0.65 0.9579 0.9744

No (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) (ix) 10 207.0 –1514.5 7.14 0.000 0.0296 0.000 0.11 0.9426 0.9738 11 137.0 –1553.0 6.59 0.000 0.0116 0.000 0.14 0.9498 0.9735 12 107.8 –1570.9 4.55 0.000 0.0100 0.000 0.17 0.9568 0.9725 13 67.3 –1600.5 5.75 0.000 0.0101 0.000 0.20 0.9587 0.9744 14 58.0 –1607.1 3.02 0.003 0.0024 0.563 0.26 0.9628 0.9754 15 39.7 –1622.5 4.22 0.000 0.0040 0.000 0.35 0.9656 0.9762 16 26.6 –1634.4 3.78 0.000 0.0024 0.088 0.43 0.9654 0.9775 17 22.4 –1637.7 2.43 0.016 0.0023 0.013 0.47 0.9671 0.9804 18 23.5 –1635.6 0.96 0.340 0.0005 0.849 0.52 0.9673 0.9804 19 21.1 –1637.3 2.08 0.039 0.0004 0.822 0.85 0.9679 0.9826 20 19.0 –1638.7 2.02 0.045 0.0006 0.641 1.12 0.9674 0.9816

V. The correlation coefficient between response

+

task. We can thus conclude that the dimension should be 11 (Table 5).

#### 8.3. Evaluation of modelling results

When modelling has been carried out, industry standards recommend to apply the model to 40 new samples, (Xn, yn). The estimated values are ŷ<sup>n</sup> = Xnb. X<sup>n</sup> are the new X-values and b is the regression coefficients from the method. Results are shown in Table 7.

From Table 7, it can be seen that PLS regression is slightly better than the other methods. The table also shows the estimate of the standard deviations of the regression coefficients. It shows that the PLS regression has much smaller value than the other methods. In conclusion, the PLS regression can be recommended for determining y-values for future samples.

Note that the values in the first row in Table 7 are smaller than those were obtained for crossvalidation and test sets during the calibration analysis. It indicates that the new 40 samples deviate in some way from the 160 samples that were used in the analysis. This is not explored further here.

#### 8.4. Confidence interval for regression coefficients

Approximate confidence intervals for regression coefficients can be obtained by the procedure presented in reference [10]. Let ei be the ith residual obtained by a regression method. Define

$$h\_i = \mathbf{x}^i \mathbf{S}\_A^{-1} \mathbf{x}^{iT} \text{ and } r\_i = \frac{\mathcal{e}\_i}{1 - h\_i}, \text{ } i = 1, 2, \dots, N. \tag{49}$$

A new set of residuals e� <sup>i</sup> are defined by randomly sampling from the modified residuals r1, …, rN. By repeating the generation of (e� <sup>i</sup> ), a new set of regression coefficients are obtained. This can be repeated say 200 times to get a confidence interval for the regression coefficients.

## 9. Discussion

Interpretation of Tables 1–6 reflects personal experience. Others may select the dimension in a different way. The main issue is that the dimension of all the selected methods is less than 20. If a full model including all 40 variables is estimated, we are overfitting by a dimension over 20. Regression coefficients become large and inference from the estimation not reliable.

The advantage of working with latent variable is that the first ones collect a large amount of variation. This is well known from principal component analysis. Interpretation of variables cannot be done directly in latent variable models. However, using Eqs. (17) and (18) we can set up equations that show how individual variables contribute to the fit and regression coefficients.

## 10. Applications of the H-principle

The H-principle has been extended to many areas of applied mathematics. It has in general been successful due to the presence of latent structure in data. The implementation of the H-principle is different from area to area. Furthermore, 'household' administration may be needed in order to keep the number of variables low (cf. Mallows theory). In many cases, it has opened up for new mathematics.

In reference [11], it has been extended to multi-block and path modelling, giving new methods to carry out modelling of organized data blocks. The importance of these methods is due to the fact that the methods of regression analysis are extended so that regression models are computed between data blocks. Methods of regression analysis can be used to evaluate relationships between data blocks. Complicated assumptions like we see in structural equations modelling are not needed.

Linear latent structure regression can be viewed as determining a low-dimensional hyperplane in a high-dimensional space. In reference [12], it is extended to finding low-dimensional second, third and higher order surfaces in latent variables. Deviations from linearity often appear as curvature for low and high sample values, which can be handled by these surfaces. In reference [13], it has been applied to non-linear estimation that may give good low-rank solutions, where full-rank regularized solutions do not give convergence.

In reference [14], it is extended to multi-linear algebra, where there are many indices in data, e.g. X = (xijk) and y = (yij). The basic issue in multi-linear algebra is defining the inverse. This is solved by defining directional inverses for each dimension. It makes it possible to extend methods of ordinary matrix analysis to multi-linear algebra. These methods have been successfully applied to multi-linear data and to growth models.

The H-principle has been applied to several areas of applied statistics. Here we briefly mention a few.

In time series analysis and dynamic systems, the objective of modelling is both to describe the data and also to obtain good forecasts. Thus, the requirement to a latent variable is both that it describes X and also gives good forecasts. Traditional models only focus on the description of X. By requiring that the latent variables also should give good forecast, better models are obtained.

In pattern recognition and classification, the objective of modelling is both to obtain good description of each group of data (that are detected or given a-priori) and to get low-error rate of classification. There are different ways to implement the H-principle in these areas, which have been developed. Applications show that these methods are superior to those based on statistical theory (e.g. linear and quadratic discriminate analysis based on the normal distribution, principal component analysis of each group of data).

## 11. Conclusion

hi <sup>¼</sup> <sup>x</sup><sup>i</sup>

72 Advances in Statistical Methodologies and Their Application to Real Problems

A new set of residuals e�

9. Discussion

By repeating the generation of (e�

10. Applications of the H-principle

opened up for new mathematics.

modelling are not needed.

S<sup>−</sup><sup>1</sup>

<sup>A</sup> <sup>x</sup>iT and ri <sup>¼</sup> ei

repeated say 200 times to get a confidence interval for the regression coefficients.

Regression coefficients become large and inference from the estimation not reliable.

1−hi

Interpretation of Tables 1–6 reflects personal experience. Others may select the dimension in a different way. The main issue is that the dimension of all the selected methods is less than 20. If a full model including all 40 variables is estimated, we are overfitting by a dimension over 20.

The advantage of working with latent variable is that the first ones collect a large amount of variation. This is well known from principal component analysis. Interpretation of variables cannot be done directly in latent variable models. However, using Eqs. (17) and (18) we can set up equations that show how individual variables contribute to the fit and regression coefficients.

The H-principle has been extended to many areas of applied mathematics. It has in general been successful due to the presence of latent structure in data. The implementation of the H-principle is different from area to area. Furthermore, 'household' administration may be needed in order to keep the number of variables low (cf. Mallows theory). In many cases, it has

In reference [11], it has been extended to multi-block and path modelling, giving new methods to carry out modelling of organized data blocks. The importance of these methods is due to the fact that the methods of regression analysis are extended so that regression models are computed between data blocks. Methods of regression analysis can be used to evaluate relationships between data blocks. Complicated assumptions like we see in structural equations

Linear latent structure regression can be viewed as determining a low-dimensional hyperplane in a high-dimensional space. In reference [12], it is extended to finding low-dimensional second, third and higher order surfaces in latent variables. Deviations from linearity often appear as curvature for low and high sample values, which can be handled by these surfaces. In reference [13], it has been applied to non-linear estimation that may give good low-rank

In reference [14], it is extended to multi-linear algebra, where there are many indices in data, e.g. X = (xijk) and y = (yij). The basic issue in multi-linear algebra is defining the inverse. This is solved by defining directional inverses for each dimension. It makes it possible to extend

solutions, where full-rank regularized solutions do not give convergence.

<sup>i</sup> are defined by randomly sampling from the modified residuals r1, …, rN.

<sup>i</sup> ), a new set of regression coefficients are obtained. This can be

, i ¼ 1, 2, …, N: (49)

We have presented a short review of standard regression analysis. Theoretically, it has important properties, which makes it a standard approach in popular program packages. However, we show that the results obtained may not always be reliable, when there is a latent structure in data. This is a serious problem in industry and applied sciences, because data typically have latent structure.

The H-principle is formulated in close analogy to the Heisenberg uncertainty inequality. It suggests that in the case of uncertain data, the computation of the mathematical solution should be carried out in steps, where at each step an optimal balance between the fit and associated precision should be obtained. A general framework is presented for linear regression. Any set of weight vectors can be used that do not give loading vectors of zero size. Using the framework, six different regression methods are carried out. It is shown that the methods give low rank solutions. The same type of numerical and graphic analysis can be carried out for any type of regression analysis within this framework. Traditional analysis in applied statistics and graphic analysis, which is popular in chemometrics, can be carried out for each method that uses the framework.

The algorithm can be viewed as an approximation to the full rank solution. Modelling stops, if further steps are not supported by data. Dimension measures, cross-validation and test sets are used to identify, when the steps are supported by data.

## Acknowledgements

The cooperation with Clinical Biochemistry, Holbæk Hospital, Denmark, in the Sime project is highly appreciated. The author appreciates the use of the data from the project in the present article.

## Author details

Agnar Höskuldsson

Address all correspondence to: ah@agnarh.dk

Centre for Advanced Data Analysis, Denmark

## References


#### **Signal Optimal Smoothing by Means of Spectral Analysis Signal Optimal Smoothing by Means of Spectral Analysis**

Guido Travaglini Guido Travaglini

Author details

Agnar Höskuldsson

References

Address all correspondence to: ah@agnarh.dk

74 Advances in Statistical Methodologies and Their Application to Real Problems

Centre for Advanced Data Analysis, Denmark

[5] https://en.wikipedia.org/wiki/Mallows's\_Cp

Copenhagen, 1996, ISBN 87-985941-0-9

Press: Cambridge, New York, 1997.

S0169-7439(98)00111-7

DOI: 10.1002/cem.1131

[6] https://en.wikipedia.org/wiki/Akaike\_information\_criterion

Handbook. American Science Press: Columbus, Ohio, 1985.

[1] Höskuldsson A, A common framework for linear regression, Chemometrics and Intelligent

[2] Reinikainen SP, Höskuldsson A, COVPROC method: strategy in modeling dynamic

[3] McLeod G, et al. A comparison of variate pre-selection methods for use in partial least squares regression: a case study on NIR spectroscopy applied to monitoring beer fermentation, Journal of Food Engineering 90 (2009) 300–307. DOI: 10.1016/j.jfoodeng.2008.06.037

[4] Tapp HS, et al., Evaluation of multiple variate methods from a biological perspective: a nutrigenomics case study, Genes Nutrition 7 (2012) 387–397. DOI: 10.1007/s12263-012-0288-4

[7] Siotani M, Hayakawa T, Fujikoshi Y, Modern Multivariate Analysis: A Graduate Course and

[8] Höskuldsson A: Prediction Methods in Science and Technology, Vol. 1, Thor Publishing:

[10] Davison AC, Hinkley DV: Bootstrap Methods and their Application, Cambridge University

[11] Höskuldsson A, Modelling procedures for directed network of data blocks, Chemometrics and Intelligent Laboratory Systems 97 (2009) 3–10. DOI: 10.1016/j.chemolab.2008.09.002 [12] Höskuldsson, A. The Heisenberg modelling procedure and applications to nonlinear modelling. Chemometrics and Intelligent Laboratory System 44 (1998) 15–30. DOI: 10.1016/

[13] Höskuldsson A, H-methods in applied sciences, Journal of Chemometrics 22 (2008) 150–177.

[14] Höskuldsson A, Data analysis, matrix decompositions and generalized inverse, SIAM

Journal on Scientific Computing, 15 (1994) 239–262. DOI: 10.1137/0915018

[9] Clinical and Laboratory Standards Institute, http://shop.clsi.org/chemistry-documents/

Laboratory Systems 146 (2015) 250–262. DOI: 10.1016/j.chemolab.2015.05.022

systems, Journal of Chemometrics 17 (2003) 130–139. DOI: 10.1002/cem.770

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66150

## **Abstract**

This chapter introduces two new empirical methods for obtaining optimal smoothing of noise‐ridden stationary and nonstationary, linear and nonlinear signals. Both methods utilize an application of the spectral representation theorem (SRT) for signal decomposi‐ tion that exploits the dynamic properties of optimal control. The methods, named as SRT1 and SRT2, produce a low‐resolution and a high‐resolution filter, which may be utilized for optimal long‐ and short‐run tracking as well as forecasting devices. Monte Carlo simulation applied to three broad classes of signals enables comparing the dual SRT methods with a similarly optimized version of the well‐known and reputed empiri‐ cal Hilbert‐Huang transform (HHT). The results point to a more satisfactory performance of the SRT methods and especially the second, in terms of low and high resolution as compared to the HHT for any of the three signal classes, in many cases also for nonlinear and stationary/nonstationary signals. Finally, all of the three methods undergo statistical experimenting on eight select real‐time data sets, which include climatic, seismological, economic and solar time series.

**Keywords:** signal analysis, white and colored noise, Monte Carlo simulation, time series properties, smoothing techniques

## **1. Introduction**

The literature on time series smoothing and denoising techniques is vast and encompasses different disciplines, such as chemometrics, econometrics, seismology, signal analysis and many more. Among the most renown and utilized methods are the Savitzky‐Golay and Hodrick‐Prescott filters, the Hilbert‐Huang transform (HHT), wavelet analysis, as well as the ample class of kernel filters [1–6].

and reproduction in any medium, provided the original work is properly cited.

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

All of these techniques share an ample spectrum of degrees of resolution loss of the original raw signal, namely, a large family of filters addressed at separating from noise the stochas‐ tic or broken trend underlying the observed signal. The researcher is thus enabled to select the desired denoising frequency visually and/or by means of some prior information, like threshold selection, rescaling and data‐dependent smoothing weights. Automatic selection procedures are available in the context of Savitzky‐Golay filtering developed in the field of chemometrics [7] and quite a few in the field of image processing (e.g. [8]). Such procedures are utilized also for the stoppage criteria embedded in the HHT but are virtually absent in the field of econometrics [3, 8–11].

In general, however, the capability of balancing high‐ with low‐resolution characteristics in a world of different real‐time data (RTD) patterns is oftentimes questionable. In fact, in spite of their efficacy, all of these models contribute a common shortcoming by letting the researcher arbitrarily—and thus casually—select the degree of resolution, thereby risking over/under fitting of the slow mode with respect to the original signal. The immediate consequence is suboptimal denoising, namely, a signal extraction yielding a smoother that is statistically inconsistent with the original peak/trough pattern or that too closely replicates the original signal. For this purpose, automated and optimizing parameter selection must be utilized, based on the minimum squared distance between the actual and the smoothed series [12].

The goal of this chapter exactly addresses this problem, which may be defined as the search for statistical dynamic efficiency of the estimated smoother. This search, in practice, ends up with extracting from the available data set the optimal smoother, namely, the stochastic or broken trend, which minimizes the noise variance among many second‐best solutions. Section 2 introduces time series decomposition and the statistical taxonomy of stochastic noise. Sections 3 and 4 are devoted to an introduction and detailed description, respectively, of the HHT and of the SRT models for signal smoothing. Section 5 produces comparative efficiency results of the two new techniques with respect to HHT by using Monte Carlo simu‐ lations and reports descriptive statistics of some select climatic, seismological, economic and solar real‐time series. Section 6 concludes, while the Appendix produces the sources of the real‐time signals utilized.

## **2. Signal decomposition and statistical taxonomy, linearity and stationarity testing**

Any observed signal, by means of additive decomposition [3, 13], may be modeled as the sum of three different random variables of unknown distribution such that

$$Y\_t = y\_t^"+s\_t + y\_t \tag{1}$$

where *Yt* is the time sequence of the real‐valued observations for the discrete‐time notation *t* ∈ [1, *T*], *T* < ∞. Moreover, *yt* \* <sup>=</sup> <sup>E</sup>(*Yt*<sup>|</sup> *<sup>Ω</sup><sup>t</sup>*−*<sup>j</sup>*) where *Ω<sup>t</sup>*−*<sup>j</sup>* <sup>=</sup> {*Yt*−*<sup>j</sup>*}*<sup>j</sup> J* is the information set available at *t* − *j*, *j* ∈ [1, *J*] for *J* ≤ *T* a maximum preselect lag [14, 15]. The first term of Eq. (1) is the slow‐mode or growth component in the form of a broken trend [16] or of a smoother [3]. The second component *st* is the periodical seasonal cycle, if any and the last component is the fast‐mode component where E(*yt*) ~ IID(0, *σ<sup>y</sup>* 2 ).

The signal may be linear or nonlinear, as well as stationary or nonstationary, depending on its components and on their distributional properties. Briefly, nonlinearity is a feature of a signal characterized by the presence of linear as well as quadratic and/or cubic variables and/ or multiplicative variables, rendering the signal typically nonparametric and thus unsuitable for standard statistical testing such as variance analysis, prediction and stationarity. In addi‐ tion, a linear or nonlinear signal is stationary (nonstationary) if its best linear trend fit is sig‐ nificantly flat (rising/falling over time). Nonlinearity and nonstationarity of the signal [17] can be tested by means of appropriate procedures [18–21]. The resulting number of regime switches, when applicable, is determined by computing single or multiple intercept and/or trend time breaks [22–24]. For details on the corresponding critical‐value statistics, the reader is redirected to the respective authors.

All components of the observed signal are stochastic and unobservable and must be retrieved from the raw data set by means of appropriate computational methods, some of which were invoked in Introduction. In particular, the slow‐mode term requires the use of some filtering procedures to possibly attain an optimal low‐resolution time‐vary‐ ing trend, in fact the smoother (Section 3). In addition, if *st* is a high‐ or low‐resolution phenomenon, it may be entrenched into either or *yt* \* and thus difficult to disentangle from either component.

Any observed signal in Eq. (1) is shown to pertain to one of the following three statistical taxonomic classes: (1) a random Gaussian white noise (RGS), (2) a colored (pink/red) random noise (RWS) and (3) a valid‐threshold regime‐switching Gaussian mixture constructed as a casually ordered sequence of two or more of the previous signals (RGWS).

Formally, the first kind of signal (RGS) may be expressed as follows:

$$Y\_t = c + \varepsilon\_t + \delta t \tag{2}$$

where the constant term, which corresponds to the mean of the noise and the time‐slope coef‐ ficient, respectively, are *c*, *δ* ∈ (−∞ , ∞) and {*εt*}*<sup>t</sup>*=1 *T* is a real‐valued univariate white‐noise data sequence such that *ε<sup>t</sup>* ~ IID(0, *σε* 2 ) and *σε* <sup>2</sup> << ∞ is a constant. In Eq. (2), if *δ* = 0, the noise is trend‐stationary and the process is a pure RGS with constant mean and variance and expected zero‐autocovariance. Instead, if significantly *δ* ≠ 0, the noise is trend‐nonstation‐ ary and exhibits a rising/falling trend line with white noise superimposed. This result may be substituted into Eq. (1) such that the slow model, and with it the entire signal, would be significantly trended.

The second kind of signal (RWS) is

All of these techniques share an ample spectrum of degrees of resolution loss of the original raw signal, namely, a large family of filters addressed at separating from noise the stochas‐ tic or broken trend underlying the observed signal. The researcher is thus enabled to select the desired denoising frequency visually and/or by means of some prior information, like threshold selection, rescaling and data‐dependent smoothing weights. Automatic selection procedures are available in the context of Savitzky‐Golay filtering developed in the field of chemometrics [7] and quite a few in the field of image processing (e.g. [8]). Such procedures are utilized also for the stoppage criteria embedded in the HHT but are virtually absent in the

In general, however, the capability of balancing high‐ with low‐resolution characteristics in a world of different real‐time data (RTD) patterns is oftentimes questionable. In fact, in spite of their efficacy, all of these models contribute a common shortcoming by letting the researcher arbitrarily—and thus casually—select the degree of resolution, thereby risking over/under fitting of the slow mode with respect to the original signal. The immediate consequence is suboptimal denoising, namely, a signal extraction yielding a smoother that is statistically inconsistent with the original peak/trough pattern or that too closely replicates the original signal. For this purpose, automated and optimizing parameter selection must be utilized, based on the minimum squared distance between the actual and the smoothed series [12].

The goal of this chapter exactly addresses this problem, which may be defined as the search for statistical dynamic efficiency of the estimated smoother. This search, in practice, ends up with extracting from the available data set the optimal smoother, namely, the stochastic or broken trend, which minimizes the noise variance among many second‐best solutions. Section 2 introduces time series decomposition and the statistical taxonomy of stochastic noise. Sections 3 and 4 are devoted to an introduction and detailed description, respectively, of the HHT and of the SRT models for signal smoothing. Section 5 produces comparative efficiency results of the two new techniques with respect to HHT by using Monte Carlo simu‐ lations and reports descriptive statistics of some select climatic, seismological, economic and solar real‐time series. Section 6 concludes, while the Appendix produces the sources of the

**2. Signal decomposition and statistical taxonomy, linearity and** 

of three different random variables of unknown distribution such that

*Yt* = *yt*

*t* ∈ [1, *T*], *T* < ∞. Moreover, *yt*

Any observed signal, by means of additive decomposition [3, 13], may be modeled as the sum

available at *t* − *j*, *j* ∈ [1, *J*] for *J* ≤ *T* a maximum preselect lag [14, 15]. The first term of Eq. (1)

is the time sequence of the real‐valued observations for the discrete‐time notation

\* <sup>=</sup> <sup>E</sup>(*Yt*<sup>|</sup> *<sup>Ω</sup><sup>t</sup>*−*<sup>j</sup>*) where *Ω<sup>t</sup>*−*<sup>j</sup>* <sup>=</sup> {*Yt*−*<sup>j</sup>*}*<sup>j</sup>*

\* + *st* + *yt* (1)

*J*

is the information set

field of econometrics [3, 8–11].

76 Advances in Statistical Methodologies and Their Application to Real Problems

real‐time signals utilized.

**stationarity testing**

where *Yt*

$$Y\_t = c + \sum\_{l=1}^{T} \varepsilon\_l + \delta t \tag{3}$$

where the coefficients contribute the same characteristics as those in Eq. (2), but even if *δ* = 0 the noise is trend‐nonstationary and both the variance and the autocovariance are not con‐ stant overtime. In consequence, the process, which is an additive white noise, is entirely time dependent in both mean and variance and this affects the overtime pattern of the signal and its components.

The third kind of signal (RGWS) is represented by the Markov regime‐switching model [25– 31], where the raw signal may undergo significant structural changes across its lifetime due to variability in its underlying probability transition matrix. Therefore, the signal, which is a combination of Eqs. (2) and (3), may experience some quiet and some turbulent states of nature, like stock‐market fluctuations and seismic signals. The obvious consequence is that the signal may fall short of standard parametric normality or stationarity statistical testing. Moreover, a threshold upper and lower limit must be imposed on regime‐switching dates to avoid spurious computation of clusters too close to each endpoint. The threshold is set in the present context to be 15–85% of the total number of observations with consequential "valid" dates comprised within the reduced‐size sample [24].

The RGWS noise class is represented by the following sequence

$$Y\_{t,n} = \begin{cases} c\_1 + f\_t\left(\varepsilon\_{t(1)}\right) + \delta\_1 \, t(1) \\ \vdots \\ c\_n + f\_n\left(\varepsilon\_{t(n),n}\right) + \delta\_n \, t(m) \end{cases} \tag{4}$$

where *m* ≥ 2 are the valid state indexes and, for *i* ∈ [1, *m*], the time index *t*(*i*) is state‐spe‐ cific, the expected value E(|*δi*|) ≥ 0, whereas *f i* ( . ) are the nonadditive or additive white‐noise functions, respectively, from Eqs. (2) and (3). From Eq. (4), the typical *i*th state of the signal is

$$Y\_{\iota\iota} = c\_{\iota} + f\_{\iota}(\varepsilon\_{\iota(\iota)}) + \delta\_{\iota}t\_{\iota'} \,\,\forall i \in m \tag{5}$$

where, given an *m*‐sized vector of exponents represented by a sequence of positive integers 1 ≤ *α*(*i*) << ∞, the fast modes of the signal are

$$\begin{cases} f\_i(\boldsymbol{\varepsilon}\_{t|0,i}) = \left(\boldsymbol{\varepsilon}\_{t|0}\right)^{a(0)}, \text{ for some } 1 \le i < m\\ f\_i(\boldsymbol{\varepsilon}\_{t|0,i}) = \left(\sum\_{t|i=1}^{T|i|} \boldsymbol{\varepsilon}\_{t|0}\right)^{a(0)}, \text{ for the remaining } 1 \le i < m \end{cases} \tag{6}$$

The entire *m*‐sized sequence of the state functions *f <sup>i</sup>*(*ε<sup>t</sup>*(*i*),*<sup>i</sup>*) may turn to be highly nonlinear if at least one *α*(*i*) > 1. As a result, the entire signal would give rise in many cases to quirky graphics characterized by enhanced peaks and/or troughs, possibly intermitted by sizably flatter states. Multiple regime switching is expected to be very frequent in such a case, much less so if *α*(*i*) ≡ 1, where Eq. (4) would show a more moderate pattern and fewer state gyra‐ tions, even zero.

Either of the three taxonomies enables empirical estimation of Eq. (1) by means of any of the smoothing procedures introduced in Section 1. Such procedures in different manners will find the smoother *yt* \* and by default the component *yt* . In the IID context applied to the noise *εt* of Eq. (2) to Eq. (4), the random slow and fast components of Eq. (1) are both discrete sto‐ chastic processes with zero or nonzero mean and with finite variance and autocovariance. Incidentally, if the slow component is significantly trended (not trended), its autocovariance process is persistent (tapers off slowly over time), whereas the autocovariance of the second component rapidly tapers off over time (is white noise).

Whereas by simple logic the extraction process of the smoother *yt* \* is carried out over the entire timespan of the signal, in the presence of significant regime changes in Eqs. (4) and (5), this procedure might not produce an optimal smoother. Therefore, extraction of the smoother may be applied on a subsample cluster basis rather than by full‐sample computation. Each subsample corresponds to a state (Eq. (4)) and its own smoother is estimated separately from the others. The sequence so obtained is then tested by minimum percentage root mean squared error (PRMSE). In quite, a few real‐time data cases (Section 5), the subsample estima‐ tion technique is found to be a better performer.

As a matter of empirical fact, simulation outcomes of Monte Carlo iterations with 1000 draws indicate that around 60% of the RGWS signals with *α*(*i*) ≡ 1 in Eq. (5) are nonlinear, while on average, roughly 40% are stationary and almost half of them contain one or more valid regime switches. Nonlinearity and stationarity are very common in the RGWS signals with *α*(*i*) > 1 in Eq. (5), as they are found in 70% of the corresponding simulations. Finally, on average, less than 1% (10%) of the RGS (RWS) are nonstationary (stationary), whereas in most cases (over 90%), both signals are linear.

## **3. The Hilbert‐Huang transform (HHT)**

where the coefficients contribute the same characteristics as those in Eq. (2), but even if *δ* = 0 the noise is trend‐nonstationary and both the variance and the autocovariance are not con‐ stant overtime. In consequence, the process, which is an additive white noise, is entirely time dependent in both mean and variance and this affects the overtime pattern of the signal and

The third kind of signal (RGWS) is represented by the Markov regime‐switching model [25– 31], where the raw signal may undergo significant structural changes across its lifetime due to variability in its underlying probability transition matrix. Therefore, the signal, which is a combination of Eqs. (2) and (3), may experience some quiet and some turbulent states of nature, like stock‐market fluctuations and seismic signals. The obvious consequence is that the signal may fall short of standard parametric normality or stationarity statistical testing. Moreover, a threshold upper and lower limit must be imposed on regime‐switching dates to avoid spurious computation of clusters too close to each endpoint. The threshold is set in the present context to be 15–85% of the total number of observations with consequential "valid"

dates comprised within the reduced‐size sample [24].

78 Advances in Statistical Methodologies and Their Application to Real Problems

*Yt*

cific, the expected value E(|*δi*|) ≥ 0, whereas *f*

1 ≤ *α*(*i*) << ∞, the fast modes of the signal are

*<sup>i</sup>*(*ε<sup>t</sup>*(*i*),*<sup>i</sup>*) <sup>=</sup> (*ε<sup>t</sup>*(*i*))

*<sup>i</sup>*(*ε<sup>t</sup>*(*i*),*<sup>i</sup>*) <sup>=</sup> ( <sup>∑</sup>

The entire *m*‐sized sequence of the state functions *f*

*Yt*,*<sup>i</sup>* = *ci* + *f*

⎧ ⎪ ⎨ ⎪ ⎩

*f*

*f*

tions, even zero.

The RGWS noise class is represented by the following sequence

*i <sup>m</sup>* = ⎧ ⎪ ⎨ ⎪ ⎩

*c*<sup>1</sup> + *f*

*cm* + *f*

*α*(*i*)

*<sup>ε</sup><sup>t</sup>*(*i*)) *α*(*i*)

*t*(*i*)=1 *T*(*i*)

where *m* ≥ 2 are the valid state indexes and, for *i* ∈ [1, *m*], the time index *t*(*i*) is state‐spe‐

*i*

functions, respectively, from Eqs. (2) and (3). From Eq. (4), the typical *i*th state of the signal is

*<sup>i</sup>*(*ε<sup>t</sup>*(*i*),*<sup>i</sup>*) + *δ<sup>i</sup> t*

where, given an *m*‐sized vector of exponents represented by a sequence of positive integers

if at least one *α*(*i*) > 1. As a result, the entire signal would give rise in many cases to quirky graphics characterized by enhanced peaks and/or troughs, possibly intermitted by sizably flatter states. Multiple regime switching is expected to be very frequent in such a case, much less so if *α*(*i*) ≡ 1, where Eq. (4) would show a more moderate pattern and fewer state gyra‐

<sup>1</sup>(*ε<sup>t</sup>*(1),1) + *δ*<sup>1</sup> *t*(1) <sup>⋮</sup>

⎫ ⎪ ⎬ ⎪ ⎭

, for the remaining 1 ≤ *i* < *m*

( . ) are the nonadditive or additive white‐noise

; ∀*i* ∈ *m* (5)

*<sup>i</sup>*(*ε<sup>t</sup>*(*i*),*<sup>i</sup>*) may turn to be highly nonlinear

(4)

(6)

*<sup>m</sup>*(*ε<sup>t</sup>*(*m*),*<sup>m</sup>*) + *δ<sup>m</sup> t*(*m*)

*i*

, for some 1 ≤ *i* < *m*

its components.

The Hilbert‐Huang transform (HHT) is an empirical procedure designed to correct for noise‐ ridden signal smoothing, purported to work for both nonlinear and nonstationary time series [4–6, 32, 33]. The HHT is based on the so‐called ensemble empirical mode decomposition (EEMD), which consists, for any arbitrary number of steps 2 ≤ *Q* ≤ *T*, of producing at each step an intrin‐ sic mode function (IMF). Each IMF is computed by consecutive "siftings" of the data, a procedure that involves finding the means of the high‐resolution envelopes constructed by cubic splining of the extreme values of the available data, possibly after correcting for endpoints [34]. The *Q*‐step IMF‐sifting process for a given signal *Yt* is represented by the following one‐lag adaptive sequence

$$\begin{aligned} h\_{t,0} &= \, \, \, Y\_t \\ h\_{t,1} &= \, h\_{t,0} - \operatorname{E} \begin{pmatrix} h\_{t,0} \\ \end{pmatrix} \\ h\_{t,2} &= \, h\_{t,1} - \operatorname{E} \begin{pmatrix} h\_{t,1} \\ \end{pmatrix} \\ \dots \\ h\_{t,Q} &= \, h\_{t,Q-1} - \operatorname{E} \begin{pmatrix} h\_{t,Q-1} \\ \end{pmatrix} \end{aligned} \tag{7}$$

where, given the upper and lower envelopes *st*,*<sup>p</sup> <sup>U</sup>*, *st*,*<sup>p</sup> <sup>L</sup>* , *p* ∈ [0, *Q* − 1], E(*ht*,*<sup>p</sup>*) is the mean enve‐ lope obtained at the end of each sifting process. More generally, from the second line onward, Eq. (7) may be written as follows:

$$h\_{t,q} = h\_{t,q-1} - \mathcal{E}\left(h\_{t,q-1}\right), \; q \in [1,Q] \tag{8}$$

which represents the family of all the sifted IMFs starting from the highest frequency. Subsequently, the matrix *HST*,*<sup>Q</sup>*:(*T*, *Q*) of the HHT smoothers is obtained. This is the matrix of the EEMDs. Its rows are expressed as *Y*˜ *<sup>t</sup>*,*<sup>q</sup>* = *ht*−*<sup>q</sup>* − *ht*−*q*−<sup>1</sup> from which from which the noise estimates are

$$u\_{t,q} = \bar{Y}\_{t,q} - \bar{Y}\_{t,q-1^t} \quad u\_{t,q} \sim \text{IID}(0, \sigma\_u^2). \tag{9}$$

In an HHT environment, the optimal high‐resolution smoother among the candidates of *HST*,*<sup>Q</sup>* may be detected by utilizing the stoppage criterion for sifting, or else by similar means closely akin to the percentage root mean squared error (PRMSE), a much‐used performance index for goodness‐of‐fit purposes. The PRMSEs may be computed as follows:

$$S\_q = \left( T^{-1} \frac{\sum\_{l=1}^{\overline{l}} u\_{lq}^2}{\sum\_l \overline{y}^2} \right)^{0.5}, \ \forall q \in Q \tag{10}$$

which produce an inverse signal‐to‐noise ratio screeplot of length *Q* [35, 36]. The procedure for detecting the screeplot global minimizer, denoted as *Q*\* , requires

$$P\_q = \frac{\left\|\mathcal{S}\_{\stackrel{\text{red}}{q\text{-}}} - \mathcal{S}\_{\stackrel{\text{red}}{q\text{-}}}\right\|}{\left\|\mathcal{S}\_{q\text{-}1} - \mathcal{S}\_q\right\|}\tag{11}$$

where ‖.‖ is the Euclidean norm of the enclosed argument and *Q*\* <sup>=</sup> arg max <sup>1</sup>≤*q*≤*<sup>Q</sup>* (*Pq*) [37]. There immediately follows that the measured PRMSE, given *Q*\* , is represented by the following formula

$$\mathfrak{S}\_{Q^\*} = \left( T^{-1} \frac{\sum\_{i=1}^{\overline{l}} u\_{iQ^\*}^2}{\sum\_{i=1}^{\overline{l}} \mathfrak{Y}\_{iQ^\*}^2} \right)^{0.5} \tag{12}$$

which is the flagship of the efficiency indicators that shall be utilized on confrontational grounds with respect to the two SRTs discussed in Section 4.

#### **4. The spectral representation transform (SRT)**

By virtue of the spectral representation theorem, the signal *Yt* as in Eq. (1) may be approxi‐ mated by the following De Moivre's formula, as follows:

$$Y\_{\iota} \simeq \cos(\omega t \,) \text{--} r \ast \sin(\omega t \,) \tag{13}$$

which depicts a harmonic waveform continuous function, where *r* = √ \_\_\_ −1 and *ω* are a given arbitrary and constant frequency. There follows that the signal *Yt* may be defined by the fol‐ lowing periodic function

where, given the upper and lower envelopes *st*,*<sup>p</sup>*

80 Advances in Statistical Methodologies and Their Application to Real Problems

Eq. (7) may be written as follows:

EEMDs. Its rows are expressed as *Y*˜

*ut*,*<sup>q</sup>* = *Y*˜ *<sup>t</sup>*,*<sup>q</sup>* − *Y*˜ *<sup>t</sup>*,*q*−<sup>1</sup>

*Sq* <sup>=</sup> (*T*<sup>−</sup><sup>1</sup>

*S*

for detecting the screeplot global minimizer, denoted as *Q*\*

*Pq* <sup>=</sup> ‖*Sq*−<sup>2</sup>

immediately follows that the measured PRMSE, given *Q*\*

grounds with respect to the two SRTs discussed in Section 4.

**4. The spectral representation transform (SRT)**

mated by the following De Moivre's formula, as follows:

By virtue of the spectral representation theorem, the signal *Yt*

*<sup>U</sup>*, *st*,*<sup>p</sup>*

lope obtained at the end of each sifting process. More generally, from the second line onward,

*ht*,*<sup>q</sup>* = *ht*,*q*−<sup>1</sup> − E(*ht*,*q*−1), *q* ∈ [1, *Q*] (8)

which represents the family of all the sifted IMFs starting from the highest frequency. Subsequently, the matrix *HST*,*<sup>Q</sup>*:(*T*, *Q*) of the HHT smoothers is obtained. This is the matrix of the

In an HHT environment, the optimal high‐resolution smoother among the candidates of *HST*,*<sup>Q</sup>* may be detected by utilizing the stoppage criterion for sifting, or else by similar means closely akin to the percentage root mean squared error (PRMSE), a much‐used performance

, *ut*,*<sup>q</sup>* ~ IID(0, *σ<sup>u</sup>*

0.5

− *Sq*−<sup>1</sup> ‖ \_\_\_\_\_\_\_\_\_ 2

, requires

‖*Sq*−<sup>1</sup> <sup>−</sup> *Sq*‖ (11)

, is represented by the following formula

as in Eq. (1) may be approxi‐

(12)

*<sup>t</sup>*,*<sup>q</sup>* = *ht*−*<sup>q</sup>* − *ht*−*q*−<sup>1</sup>

index for goodness‐of‐fit purposes. The PRMSEs may be computed as follows:

∑ *t*=1 *T ut*,*q* 2 \_\_\_\_\_\_\_\_ ∑ *t*=1 *T y*˜ 2 *<sup>t</sup>*,*q*−1)

which produce an inverse signal‐to‐noise ratio screeplot of length *Q* [35, 36]. The procedure

where ‖.‖ is the Euclidean norm of the enclosed argument and *Q*\* <sup>=</sup> arg max <sup>1</sup>≤*q*≤*<sup>Q</sup>* (*Pq*) [37]. There

∑ *t*=1 *T ut*,*Q*\* 2 \_\_\_\_\_\_ ∑ *t*=1 *T y*˜*<sup>t</sup>*,*Q*\* <sup>2</sup> )

which is the flagship of the efficiency indicators that shall be utilized on confrontational

*Yt* ≃ cos(*ωt* ) −*r* ∗ sin(*ωt* ) (13)

0.5

*^ <sup>Q</sup>*\* <sup>=</sup> (*T*<sup>−</sup><sup>1</sup>

*<sup>L</sup>* , *p* ∈ [0, *Q* − 1], E(*ht*,*<sup>p</sup>*) is the mean enve‐

from which from which the noise estimates are

, ∀*q* ∈ *Q* (10)

). (9)

$$\hat{Y}\_{t,k} = \mu + \sum\_{l=1}^{K} \left[ \phi\_k \sin \left( \omega\_k \cdot (t-1) \right) + \varphi\_k \cos \left( \omega\_k \cdot (t-1) \right) \right] \tag{14}$$

where *μ* is the mean of the signal, *k* ∈ [2 ≤ *K* ≤ *T*) for *K* a maximum lag integer, {*φk*}*<sup>k</sup>*=1 *K* , {*ϕk*}*<sup>k</sup>*=1 *K* are real‐valued random coefficient sequences, both IID with finite mean and variance and {*ωk*}*<sup>k</sup>*=1 *K* is the frequency sequence such that

$$\omega\_{k} = \begin{cases} T^{-1} 2\pi k \text{ if } \varDelta \operatorname{Y}\_{t} \operatorname{\mathfrak{H}} \{ \operatorname{Y}\_{t-1} \}, & \varprojlim\_{k \in \operatorname{057}} (\omega\_{k} = \pi) \\\ T^{-1} \pi k \text{ if } \varDelta \operatorname{Y}\_{t} \operatorname{\mathfrak{H}} \operatorname{\mathfrak{H}} \operatorname{\mathfrak{H}} \{ \operatorname{\mathfrak{w}}\_{k} = \pi \} \end{cases} \tag{15}$$

where *Δ yt* = *f*( . ) expresses the existence of linearity and nonlinearity of the process, depend‐ ing on the exponent attached to the lagged signal level [20, 21]. Needless to say, *K* generally corresponds to the maximal EMD level contemplated in the HHT method.

The fitted signal of Eq. (14), which is the smoother estimable by ordinary least squares [25], is actually the SRT of the original signal with time‐varying amplitude and lag *k*. If we let the time series of the prediction error be

$$e\_{t|k} = \,^t Y\_t - \,^t \hat{Y}\_{t|k'} \,\,\forall k \in K \tag{16}$$

where *et*,*<sup>k</sup>* ~ IID(0, *σ<sup>e</sup>*,*<sup>k</sup>* 2 ). After producing 1000 Monte Carlo normal simulations of Eq. (14) with *T* = 300, the central limit theorem (CLT), as *k* → *T*, is found not to hold asymptotically for RGWS, as is obvious in the presence of regime switches. Elsewise, for RGS and RWS, we have lim*<sup>k</sup>*→*<sup>T</sup>* (*Yt* <sup>=</sup> *<sup>Y</sup>* ^ *<sup>t</sup>* ,*<sup>k</sup>*).

Similar to the technique utilized for identifying the optimal smoother in the PRMSE sense exhibited in Section 3, also here a matrix of smoother candidates may be obtained depending on the *k*th lag chosen. The matrix is defined as *S ST*,*<sup>K</sup>*:(*T*, *K*) from which the optimal smoother can be selected by applying the optimal lag stopping criterion similar in kind to Eq. (16). However, we utilize here a performance index for model selection different from the PRMSE, which is based on the dynamics of the Hamiltonian optimal control problem [38–40].

If we let *et*,*<sup>k</sup>* be defined as in Eq. (16), then its dynamics is captured by the first‐order autore‐ gressive AR(1) process *gt*,*<sup>k</sup>* = *et*,*<sup>k</sup>* − *et*−1,*<sup>k</sup>* while the dynamics of the smoother is expressed as the AR(1) process *ht*.*<sup>k</sup>* = *Y ^ <sup>t</sup>*,*<sup>k</sup>* <sup>−</sup> *<sup>Y</sup> ^ <sup>t</sup>*−1,*<sup>k</sup>* . We expect both processes to be normally distributed with zero mean and finite variance.

Hence, the Hamiltonian problem may be expressed as follows:

$$H\_k = \frac{1}{2} \sum\_{l=2}^{T} \left\{ e\_{l,k}^2 + g\_{l,k}^2 + h\_{l,k}^2 \right\} , \quad \forall k \in K \tag{17}$$

where the first element within the curly braces is of obvious reading and represents a resolu‐ tion index that picks up the high frequencies of the problem. The second and the third ele‐ ments capture the AR(1) dynamics of the processes involved in the Hamiltonian problem, namely, those of the prediction error and those of the signal itself. Eq. (17) is a cubic spline smoother that partly resembles the Hodrick‐Prescott filter in the inclusion of both a cyclical and a trend part [3]. Moreover, Eq. (17) forms the basis for a screeplot [36] similar in kind to that of Eq. (11), whereby the optimal lag *K*\* is obtained after letting

$$V\_k = \frac{\|H\_{k=2} - H\_{k=1}\|}{\|H\_{k=1} - H\_k\|} \tag{18}$$

such that *K*\* <sup>=</sup> arg max <sup>1</sup>≤*k*≤*<sup>K</sup>* (*Vk*) and wherefrom *<sup>Y</sup>* ^ *t*,*K*\* \* is the optimal low‐resolution Hamiltonian‐ based smoother, which is named SRT1.

From this smoother the dual high‐resolution SRT smoother, named SRT2, may be easily con‐ structed by means of envelope augmentation, a procedure knowingly embedded into the HHT sifting process (Section 3). Let the upper and the lower envelopes of the actual signal be defined as the unique cubic splines *st <sup>U</sup>*, *st L* constructed by using the extrema of the given data. The upshot is the signal *Y ^ <sup>t</sup>*,*K*\* \*\* = \_\_1 <sup>2</sup>(*st <sup>U</sup>* + *st L* ) ∈ [*st <sup>U</sup>*, *st L* ], such that the second smoother is *Y*˜ *<sup>t</sup>*,*K*\* \* = \_\_1 2(*Y ^ <sup>t</sup>*,*K*\* \* + *Y ^ <sup>t</sup>*,*K*\* \*\* ), which is the mean of the two smoothers. The error time series, for the optimal given lag *K*\* , are expressed in a fashion similar to Eq. (9), as follows

 { *ε<sup>t</sup>*,*K*\* = *Yt* − *Y ^ <sup>t</sup>*,*K*\* \* , *ε<sup>t</sup>*,*K*\* : IID(0, *σε* 2 ) *<sup>η</sup><sup>t</sup>*,*K*\* <sup>=</sup> *Yt* <sup>−</sup> *<sup>Y</sup>*˜ *<sup>t</sup>*,*K*\* \* , *η<sup>t</sup>*,*K*\* : IID(0, *ση* 2 ) (19)

where the first error is associated with SRT1 and the second is associated with SRT2.

From Eq. (19), the PRMSEs of both models may be found by the same means as those employed to obtain Eq. (12), namely

$$\hat{\boldsymbol{W}}\_{\boldsymbol{k}^\*} = \left( \boldsymbol{T}^{-1} \frac{\sum\_{i=1}^{\overline{l}} \boldsymbol{\varepsilon}\_{i,\boldsymbol{k}^\*}^2}{\sum\_{i=1}^{\overline{l}} \boldsymbol{\hat{Y}}\_{i,\boldsymbol{k}^\*}^2} \right)^{0.5}, \quad \tilde{\boldsymbol{W}}\_{\boldsymbol{k}^\*} = \left( \boldsymbol{T}^{-1} \frac{\sum\_{i=1}^{\overline{l}} \boldsymbol{\eta}\_{i,\boldsymbol{k}^\*}^2}{\sum\_{i=1}^{\overline{l}} \boldsymbol{\hat{Y}}\_{i,\boldsymbol{k}^\*}^2} \right)^{0.5} \tag{20}$$

where the first (second) index is associated with SRT1 (SRT2).

#### **5. Smoother analysis applied to artificial and real‐time signals**

**Figure 1** depicts six signals pertaining to the three taxonomic classes (Section 2). The first three signals are 300 observation artificial RGS, RWS and RGWS, drawn from standard normal dis‐ tributions. The other three are real‐world signals, of which the first two are recent Japanese earthquake seismographic waveforms collected from a specific web directory [41]. The last signal represents the Qualcomm NASDAQ‐traded stock close price on a weekly basis from

where the first element within the curly braces is of obvious reading and represents a resolu‐ tion index that picks up the high frequencies of the problem. The second and the third ele‐ ments capture the AR(1) dynamics of the processes involved in the Hamiltonian problem, namely, those of the prediction error and those of the signal itself. Eq. (17) is a cubic spline smoother that partly resembles the Hodrick‐Prescott filter in the inclusion of both a cyclical and a trend part [3]. Moreover, Eq. (17) forms the basis for a screeplot [36] similar in kind to

> ^ *t*,*K*\* \*

From this smoother the dual high‐resolution SRT smoother, named SRT2, may be easily con‐ structed by means of envelope augmentation, a procedure knowingly embedded into the HHT sifting process (Section 3). Let the upper and the lower envelopes of the actual signal

, are expressed in a fashion similar to Eq. (9), as follows

*<sup>η</sup><sup>t</sup>*,*K*\* <sup>=</sup> *Yt* <sup>−</sup> *<sup>Y</sup>*˜ *<sup>t</sup>*,*K*\*

*<sup>U</sup>*, *st L*

*^ <sup>t</sup>*,*K*\*

where the first error is associated with SRT1 and the second is associated with SRT2.

⎞ ⎟ ⎠

**5. Smoother analysis applied to artificial and real‐time signals**

0.5

From Eq. (19), the PRMSEs of both models may be found by the same means as those employed

**Figure 1** depicts six signals pertaining to the three taxonomic classes (Section 2). The first three signals are 300 observation artificial RGS, RWS and RGWS, drawn from standard normal dis‐ tributions. The other three are real‐world signals, of which the first two are recent Japanese earthquake seismographic waveforms collected from a specific web directory [41]. The last signal represents the Qualcomm NASDAQ‐traded stock close price on a weekly basis from

is obtained after letting

*<sup>U</sup>*, *st L*

\*\* ), which is the mean of the two smoothers. The error time series, for the

2 )

> 2 )

∑ *t*=1 *T η<sup>t</sup>*,*K*\* 2 \_\_\_\_\_\_ ∑ *t*=1 *T Y*˜ *<sup>t</sup>*,*K*\* <sup>2</sup> )

0.5

\* , *ε<sup>t</sup>*,*K*\* : IID(0, *σε*

\* , *η<sup>t</sup>*,*K*\* : IID(0, *ση*

, *<sup>W</sup>*˜ *<sup>K</sup>*\* <sup>=</sup> (*T*<sup>−</sup><sup>1</sup>

‖*Hk*−<sup>1</sup> <sup>−</sup> *Hk*‖ (18)

is the optimal low‐resolution Hamiltonian‐

constructed by using the extrema of the given

], such that the second smoother is

(19)

(20)

that of Eq. (11), whereby the optimal lag *K*\*

such that *K*\* <sup>=</sup> arg max <sup>1</sup>≤*k*≤*<sup>K</sup>* (*Vk*) and wherefrom *<sup>Y</sup>*

based smoother, which is named SRT1.

be defined as the unique cubic splines *st*

data. The upshot is the signal *Y*

{

to obtain Eq. (12), namely

*W*^

optimal given lag *K*\*

*Y*˜ *<sup>t</sup>*,*K*\* \* = \_\_1 2(*Y ^ <sup>t</sup>*,*K*\* \* + *Y ^ <sup>t</sup>*,*K*\*

*Vk* <sup>=</sup> ‖*Hk*−<sup>2</sup> <sup>−</sup> *Hk*−1‖ \_\_\_\_\_\_\_\_\_\_

82 Advances in Statistical Methodologies and Their Application to Real Problems

*^ <sup>t</sup>*,*K*\* \*\* = \_\_1 <sup>2</sup>(*st <sup>U</sup>* + *st L* ) ∈ [*st*

*<sup>K</sup>*\* =

where the first (second) index is associated with SRT1 (SRT2).

⎛ ⎜ ⎝ *T*<sup>−</sup><sup>1</sup> ∑ *t*=1 *T ε<sup>t</sup>*,*K*\* 2 \_\_\_\_\_\_ ∑ *t*=1 *T Y* ^ *t*,*K*\* 2

*ε<sup>t</sup>*,*K*\* = *Yt* − *Y*

**Figure 1.** Artificial and real‐time random signals, all smoothers and linear trends. Vertical axes represent signal magnitudes, horizontal axes represent time in calendar years (1, 2, 3 and 6) or in seconds (4, 5).

the year 1995 to date. In particular, with regard to the first two real‐time signals (**Figure 1**, panels 4 and 5), the reported seismographs concern two earthquakes, respectively, occurred in Makurazaki (Kagoshima prefecture) on November 13, 2015 at 20:51 UTC, with moment magnitude (*Mw*) of 6.7 and no damages no victims and in the Tohoku region on March 07, 2011 at 5:46 UTC, with *Mw* of 9.0, an ensuing tsunami and a reported death toll of 15,893. The waveform data from respective seismogram stations include 108 and 47,200 observations of which only the last 3000 were retained for experimentation.

Horizontal (vertical) measurements in **Figure 1** represent time lengths (magnitudes). All of the three smoothers (HHT, SRT1 and SRT2) are included together with a linear best‐fit trend. The first two signals (RGS and RWS) are linear while the third (RGWS) is nonlinear (**Figure 1**, panels 1–3). Among them, only RGS is stationary and none exhibits a valid time break. Of the three real‐time signals (Makurazaki and Tohoku earthquakes, Qualcomm), all are nonlinear and one is nonsta‐


**Table 1.** Monte Carlo simulation mean values of select coefficients for the three signal classes.

tionary as well (Tohoku earthquake). Moreover, all of these signals exhibit one single valid time break each, respectively, located at observation 57, 1123 and 261. Finally, from unreported results, SRT2 prevails as the least‐PRMSE smoother for the first four signals, while the quirky behavior of the last two signals (Tohoku and Qualcomm) finds HHT as the most efficient smoother.

**Table 1**, columns A–C, exhibits the efficiency performance and other coefficients of the three smoothers (HHT, SRT1 and SRT2) for the three signal classes (RGS, RWS and RGWS). The results shown are the mean values obtained from empirical Monte Carlo signal simulations with 1000 draws each and length of 300 observations. The PRMSE coefficients reported in line 1 have received attention in Eqs. (12) and (20), whereas the coefficients reported in line 2 show the simulated mean error variance appearing in their numerators. The variance of the smoothed trends is exhibited in line 3. In all cases, simple eyeballing points to SRT2 as the most efficient, that is, the variance minimizing smoothing method (**Table 1**, third column of columns A–C).

Stationarity of the error variances in Eqs. (9) and (19) is crucial for establishing the goodness‐ of‐fit of the estimated smoothers of all signal classes. Here, stationarity is tested by means of a novel technique, which corrects the conventional augmented Dickey‐Fuller test statistic [19] after accounting for overtime changes in subsample variances, cycles and growing/falling linear trends in the signal [12]. This technique exhibits a similar nonparametric distribution as the ADF test statistic, such that the critical test statistics for stationarity of the untransformed signal are ‐2.79 and ‐2.51 for *p*‐values of 1 and 5%, respectively. The corrected ADF test sta‐ tistics of the error variances are reported in **Table 1**, line 4. In many, if not in most cases, sta‐ tionarity emerges, yet the error variances generated by SRT2 are significantly stationary for all signal classes, whereas the other methods fail in the particular case of RGWS.

In order to further compare the HHT and the SRT performances, select real‐time signals are being put to empirical testing for the sake of optimal smoothing analysis. The entire data set of the real‐time data (RTD) includes eight signals of different nature: climatic, economic, seismic and solar. The RTD proposed are mostly high resolution and long term, ranging from a minimum of 315 to a maximum of 1915 observations of which one is yearly and all the others are monthly. The last access to the web for all available data was November 30, 2015.

The full source list and related time intervals are contained in the RTD Appendix, while the synthetic acronymed list of the select signals is the following: AMO (Atlantic Multidecadal Oscillation), GISS (Global Land‐Ocean Temperature Index), Yamalia (Yamal Peninsula Temperature Reconstructions), S&P500 (Standard & Poor's 500 Composite Index), SPI (U.S. Standardized Precipitation Index), Banda (Banda Aceh earthquake, 2004), NASDAQ ( Close values of the NASDAQ Stock Index) and finally SSN (Sunspot Numbers).

Among these RTDs, worth of more details than those provided in the Appendix is the earth‐ quake of Banda Aceh, Indonesia, which occurred on December 26, 2014 at 00:58 UTC with *Mw* of 9.2 and which caused, especially due to an extraordinarily violent tsunami wave, a death toll of an estimated 250,000. The data utilized straddle the foreshock and the main shock trem‐ ors, namely, the observations comprised between 12,000 and 15,000 out of a total of recorded waves tallying 58,320 observations.

The relevant descriptive statistics and visual performances of the RTDs are exhibited, respec‐ tively, in **Table 2** and in **Figure 2a** and **b**. From **Table 2**, a very diverse pattern emerges in terms of mean and standard deviation with estimated volatilities ranging from 1.0 (NASDAQ) to 101.0 (Banda). Moreover, skewness appears mild everywhere, barring two cases where it is relatively high (S&P500 and NASDAQ), whereas kurtosis hovers for all RTDs around its criti‐ cal value of three. All but two of the RTDs (AMO and SSN) are nonstationary and all exhibit zero valid regime switches, exclusion made for Banda and NASDAQ. Finally, they are all nonlinear except for AMO and SPI, as subsumed from the Harvey linearity test statistic whose critical value is close to 3.0. Optimal smoothing of the RTDs is achieved in the last two cases (Banda and NASDAQ) by means of subsample cluster analysis whereas for the other RTDs full‐sample computation is more efficient (Section 2). The SRT2 smoother is found to exhibit the smallest PRMSEs in 75% of the cases proposed, whereas only three of them prefer the HHT method (S&P500, NASDAQ and SSN).

For each RTD, the actual signal, its optimal smoother (HHT or SRT2) and its linear trend are exhibited in **Figure 2a** and **b**. In **Figure 1**, horizontal (vertical) measurements represent time


**Table 2.** Descriptive and select test statistics of the real‐time dataset (RTD).

tionary as well (Tohoku earthquake). Moreover, all of these signals exhibit one single valid time break each, respectively, located at observation 57, 1123 and 261. Finally, from unreported results, SRT2 prevails as the least‐PRMSE smoother for the first four signals, while the quirky behavior of

**B.**

**each**

**Signal/coefficient HHT SRT1 SRT2 HHT SRT1 SRT2 HHT SRT1 SRT2 1. PRMSE** 0.924 0.973 0.875 0.224 0.228 0.164 0.288 0.355 0.221 **2. Error variance** 0.854 0.952 0.771 1.633 1.844 0.911 4.459 5.533 1.973 **3. Smoother variance** 0.924 0.978 0.881 7.290 8.088 5.554 15.482 15.586 8.928

**RWS Monte Carlo** 

**simulations, 300 observations** 

‐3.138 1.472 ‐6.851 ‐4.088 ‐3.243 ‐4.067 ‐1.831 ‐1.565 ‐3.104

**C.**

**each**

**RGWS Monte Carlo simulations, 300 observations** 

**Table 1**, columns A–C, exhibits the efficiency performance and other coefficients of the three smoothers (HHT, SRT1 and SRT2) for the three signal classes (RGS, RWS and RGWS). The results shown are the mean values obtained from empirical Monte Carlo signal simulations with 1000 draws each and length of 300 observations. The PRMSE coefficients reported in line 1 have received attention in Eqs. (12) and (20), whereas the coefficients reported in line 2 show the simulated mean error variance appearing in their numerators. The variance of the smoothed trends is exhibited in line 3. In all cases, simple eyeballing points to SRT2 as the most efficient, that is, the variance minimizing smoothing method (**Table 1**, third column of columns A–C).

Stationarity of the error variances in Eqs. (9) and (19) is crucial for establishing the goodness‐ of‐fit of the estimated smoothers of all signal classes. Here, stationarity is tested by means of a novel technique, which corrects the conventional augmented Dickey‐Fuller test statistic [19] after accounting for overtime changes in subsample variances, cycles and growing/falling linear trends in the signal [12]. This technique exhibits a similar nonparametric distribution as the ADF test statistic, such that the critical test statistics for stationarity of the untransformed signal are ‐2.79 and ‐2.51 for *p*‐values of 1 and 5%, respectively. The corrected ADF test sta‐ tistics of the error variances are reported in **Table 1**, line 4. In many, if not in most cases, sta‐ tionarity emerges, yet the error variances generated by SRT2 are significantly stationary for all

In order to further compare the HHT and the SRT performances, select real‐time signals are being put to empirical testing for the sake of optimal smoothing analysis. The entire data set of the real‐time data (RTD) includes eight signals of different nature: climatic, economic, seismic and solar. The RTD proposed are mostly high resolution and long term, ranging from a minimum of 315 to a maximum of 1915 observations of which one is yearly and all the others

are monthly. The last access to the web for all available data was November 30, 2015.

signal classes, whereas the other methods fail in the particular case of RGWS.

the last two signals (Tohoku and Qualcomm) finds HHT as the most efficient smoother.

**Table 1.** Monte Carlo simulation mean values of select coefficients for the three signal classes.

**Tables A.**

**4. Corrected ADF test statistic for stationarity of error** 

**variance**

**RGS Monte Carlo** 

84 Advances in Statistical Methodologies and Their Application to Real Problems

**each**

**simulations, 300 observations** 

**Figure 2.** Real‐time select random signals (a) vertical axes represent signal magnitudes, horizontal axes represent time in calendar years) and (b) vertical axes represent signal magnitudes, horizontal axes represent time in years (5, 7 and 8) or seconds (6).

lengths (magnitudes). Among the three HHT optimal smoothers found above, two are related to quirky signals, S&P500 and NASDAQ (**Figure 2a**, panel 4 and **Figure 2b**, panel 7). They exhibit however different frequencies, in spite of being estimated by the same subsample clus‐ ter technique. In fact the former (latter) is a low‐ (high‐) resolution signal. Both signals exhibit in any case broken trends characterized by a long period of quiet followed by wild gyrations, which reflect the highly varying moods of both the stock market and of the Federal Reserve Board. The third signal associated with optimal smoothing attained through the HHT is the time series of sunspots (SSN). Full‐sample estimation of the smoother was preferred, while broken trends clearly emerge from visual inspection (**Figure 2b**, panel 8). Significant regime switches are found to occur at the observations 1810 and 1902, in occasion of the Dalton and of the Modern Minimum, respectively [12].

Among the other five RTDs whose optimal smoother is of the SRT2 brand, only one (Banda) requires subsample cluster computation. The optimal smoother exhibits a dramatic regime switch in the passage from foreshocks to the main shocks and somewhat later (**Figure 2b**, panel 6). The two major break dates were found to be placed at observations 1833 and 1905 of 3000 observa‐ tions. All of the RTD optimal smoothers, including Banda, exhibit high resolution and, at least visually, manifest considerable accuracy in tracking the original signal. This is particularly true for AMO, GISS, Yamalia and SPI, which are sizably noise‐ridden (**Figure 2a**, panels 1–3 and 5).

## **6. Conclusions**

This chapter has introduced and described in detail two new dual empirical methods for obtaining optimal smoothing of random signals pertaining to three broad taxonomic classes. Both methods utilize an application of the spectral representation theorem for signal decom‐ position that exploits the dynamic properties of optimal control. The two methods, named SRT1 and SRT2, produce a low‐ and a high‐resolution filter, which may be utilized for optimal long‐ and short‐run tracking as well as forecasting devices. The methods proven by Monte Carlo simulation found to be more efficient than the empirical Hilbert‐Huang transform (HHT) for all of the taxonomic classes. The methods are also comparatively tested by using random artificial and a bunch of real‐time signals, particularly eight select data sets includ‐ ing climatic, seismological and economic time series. HHT is proven to be more efficient in few cases of quirky, multiple regime‐switch signals, like the Standard & Poor's 500 and the NASDAQ indexes.

## **Real‐Time Data**

**Figure 2.** Real‐time select random signals (a) vertical axes represent signal magnitudes, horizontal axes represent time in calendar years) and (b) vertical axes represent signal magnitudes, horizontal axes represent time in years (5, 7 and 8) or seconds (6).

86 Advances in Statistical Methodologies and Their Application to Real Problems


v4 1880‐07/2015. Available from: http://global‐land‐ocean‐temperature‐index.blogspot. com/p/blog‐page\_14.html [Accessed: 2015‐08‐20].


## **Author details**

Guido Travaglini

Address all correspondence to: jay\_of\_may@yahoo.com

University of Rome 1, Rome, Italy

## **References**


and nonstationary time series analysis. Proceedings of the Royal Society in London. 1998; **454A**: 903–993. DOI: 10.1098/rspa.1998.0193

v4 1880‐07/2015. Available from: http://global‐land‐ocean‐temperature‐index.blogspot.

**3.** Yamalia Tree Ring Summer Temperature Reconstructions, yearly observations, 1650‐2005. Available from: ftp://ftp.ncdc.noaa.gov/pub/data/paleo/treering/reconstructions/asia/rus‐

**4.** S&P500: Standard and Poor's Index, monthly observations adjusted close 1950:01‐2015:09. Source: http://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices [Accessed: 2016‐09‐11].

**5.** SPI, U.S. Standardized Precipitation Index, monthly, 1897:01‐2014:03. Available from: http://www.drought.gov/drought/content/products‐current‐drought‐and‐monitoring‐

**6.** Banda, Banda Aceh Earthquake, Indonesia, 12‐26‐2004, observations 12,000‐14,999. Source: [41]. Available from: http://earthquake.usgs.gov/earthquakes/dyfi/events/us/ 2004‐slav/

**7.** NASDAQ Composite index, monthly close, 1971:02‐2015:03. Available from: http://finance. yahoo.com/q/hp?s=%5EIXIC&a=01&b=5&c=1971&d=02&e=14&f=2015&g=m [Accessed:

**8.** SSN: Sun Spot Number Revisited Series by Clette et al., 2014, yearly 1700–2014. Available

[1] Savitzky A. and Golay M.J.E. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry. 1964; **36**: 1627–1639. DOI: 10.1021/ac60214a047

[2] Daubechies I. Ten Lectures on Wavelets. SIAM, Society for Industrial and Applied Mathematics, Philadelphia, PA. 1992. p. xix + 357. DOI: 10.1137/1.9781611970104; 10.1137/1.97

[3] Hodrick R. and Prescott E.C. Postwar U.S. business cycles: an empirical investigation.

[4] Huang N.E., Shen Z., Long S.R., Wu M.C., Shih H.H., Zheng Q., Yen N.‐C., Tung C.C. and Liu H.H. The empirical mode decomposition and the Hilbert spectrum for nonlinear

from: http://www.sidc.be/silso/datafiles [Accessed: 2016‐09‐11].

Address all correspondence to: jay\_of\_may@yahoo.com

81611970104#\_blank#Opens new window

Journal of Money, Credit and Banking. 1997; **29**: 1–16.

drought‐indicators/standardized‐precipitation‐index [Accessed: 2016‐09‐11].

com/p/blog‐page\_14.html [Accessed: 2015‐08‐20].

88 Advances in Statistical Methodologies and Their Application to Real Problems

us/ [Accessed: 2016‐09‐10].

University of Rome 1, Rome, Italy

2016‐07‐11].

**Author details**

Guido Travaglini

**References**

sia/yamalia2013temp2000yr.txt [Accessed: 2016‐09‐10].


[33] Wu Z. and Huang N.E. Ensemble empirical mode decomposition: a noise‐assisted data analysis method. Advances in Adaptive Data Analysis. 2009; **1**: 1–41. DOI: 10.1142/ S1793536909000047

[18] Harvey D.I., Leybourne S.J. and Xiao B. A powerful test for linearity when the order of integration is unknown. Studies in Nonlinear Dynamics & Econometrics. 2008; **12:**

[19] Said E. and Dickey D.A. Testing for unit roots in autoregressive moving average models of unknown order. Biometrika. 1984; **71**: 599–607. DOI: 10.1093/biomet/71.3.599

[20] Kapetanios G., Shin Y. and Snell A. Testing for a unit root in the nonlinear STAR frame‐ work. Journal of Econometrics. 2003; **112**: 359–379. DOI: 10.1016/S0304‐4076(02)00202‐6

[21] Kapeitanos G. and Shin Y. Testing the null hypothesis of nonstationary long memory against the alternative hypothesis of a nonlinear ergodic model. Econometrics Review. 2011; **30**: 620–645. DOI: 10.1080/07474938.2011.553568 DOI:10.1080/07474938.2011.553568#\_self [22] Perron P. The great crash, the oil price shock and the unit root hypothesis. Econometrica.

[23] Bai J. and Perron P. Computation and analysis of multiple structural change models.

[24] Kim D. and Perron P. Unit root tests allowing for a break in the trend function at an unknown time under both the null and alternative hypotheses. Journal of Econometrics.

[25] Hamilton J.D. A new approach to the economic analysis of nonstationary time series and

[26] Hamilton J.D. Time Series Analysis. Princeton University Press, Princeton, NJ. 1994, pp. xiv + 799. DOI: 10.1017/S0266466600009440; 10.1017/S0266466600009440#\_blank

[27] Hamilton J.D. Regime switching models. The New Palgrave Dictionary of Economics, 2nd ed., Durlauf S.N. and Blume L.E. editors. Palgrave Macmillan, London. 2008. DOI:

[28] Kanas A. Purchasing power parity and Markov regime switching. Journal of Money

[29] Mizrach B. Nonlinear time series analysis. The New Palgrave Dictionary of Economics, 2nd ed., Durlauf S.N. and Blume L.E. editors. Palgrave Macmillan, London, UK. 2008,

[30] Mizrach B. Nonlinear mean reversion in EMS exchange rates. Brussels Economic Review.

[31] Lee H.‐T. and Yoon G. Does purchasing power parity hold sometimes? Regime switching in real exchange rates. Applied Economics. 2013; **45**: 2279–2294. DOI:

[32] Wu Z., Huang N.E., Long S.R. and Peng C.‐K. On the trend, detrending and variability of nonlinear and nonstationary time series. PNAS. 2007; **104**: 14889–14894. DOI: 10.1073/

the business cycle. Econometrica. 1989; **57**: 357–384. DOI: 10.2307/1912559

Credit and Banking. 2006; **38**: 1669–1687. DOI: 10.1353/mcb.2006.0083

Journal of Applied Econometrics. 2003; **18**: 1–22. DOI: 10.1002/jae.659

Article 2. DOI: 10.2202/1558‐3708.1582

90 Advances in Statistical Methodologies and Their Application to Real Problems

1989; **57**: 1361–1401. DOI: 10.2307/1913712

10.1057/9780230226203.1411

pp. 169–177.

2010; **53**: 187–198.

pnas.0701020104

10.1080/00036846.2012661399

2009; **148**: 1–13. DOI: 10.1016/j.jeconom.2008.08.019


#### **Information‐Theoretic Clustering and Algorithms** Information-Theoretic Clustering and Algorithms

#### Toshio Uchiyama Toshio Uchiyama

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66588

## Abstract

Clustering is the task of partitioning objects into clusters on the basis of certain criteria so that objects in the same cluster are similar. Many clustering methods have been proposed in a number of decades. Since clustering results depend on criteria and algorithms, appropriate selection of them is an essential problem. Recently, large sets of users' behavior logs and text documents are common. These are often presented as high-dimensional and sparse vectors. This chapter introduces information-theoretic clustering (ITC), which is appropriate and useful to analyze such a high-dimensional data, from both theoretical and experimental side. Theoretically, the criterion, generative models, and novel algorithms are shown. Experimentally, it shows the effectiveness and usefulness of ITC for text analysis as an important example.

Keywords: information-theoretic clustering, competitive learning, Kullback-Leibler divergence, Jensen-Shannon divergence, clustering algorithm, text analysis

## 1. Introduction

Clustering is the task of partitioning objects into clusters on the basis of certain criteria so that objects in the same cluster are similar. It is a fundamental procedure to analyze data [1, 2].

Clustering is unsupervised and different from supervised classification. In supervised classification, we have a set of labeled data (belong to predefined classes), train a classifier using the labeled data (training set), and judge which class a new object belongs to by the classifier. In the case of clustering, we find meaningful clusters without using any labeled data and group a given collection of unlabeled data into them. Clustering can also help us to find meaningful classes (labels) for supervised classification. Since it is more difficult to prepare the training set for larger data sets, recently unsupervised analysis of data such as clustering becomes more important.

For example, Table 1 user-item matrix shows which item a user bought. When considering the data as a set of feature vectors for users, we can find a lot of types of users' behavior by

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

distribution, and eproduction in any medium, provided the original work is properly cited.


Table 1. Consumption behavior of users.


Table 2. Word frequencies in documents (bag-of-words feature representation).

clustering. It is also possible to analyze data as a set of feature vectors for items. From worddocument matrix in Table 2, both document clusters and word clusters could be extracted.

Many clustering methods have been proposed in a number of decades. Those include k-means algorithm [3], competitive learning [4], spherical clustering [5], spectral clustering [6], and maximum margin clustering [7]. Since clustering results depend on criteria and algorithms, appropriate selection of them is an essential problem. Large sets of users' behavior logs and text documents (as shown in Tables 1 and 2) are common recently. These are often presented as high-dimensional and sparse vectors. This chapter introduces information-theoretic clustering [8] and algorithms that are appropriate and useful to analyze such a high-dimensional data.

Information-theoretic clustering (ITC) uses Kullback-Leibler divergence and Jensen-Shannon divergence to determine its criterion, while k-means algorithm uses the sum of squared error as criterion. This chapter explains ITC by contrasting these two clustering techniques (criteria and algorithms), because there are a number of interesting similarities between them. There exists difficulty in algorithms for ITC. We explain the details of it and propose novel algorithms to overcome.

Experimental results for text data sets are presented to show the effectiveness and usefulness of ITC and novel algorithms for it. In experiments, maximum margin clustering and spherical clustering are used to compare. We also provide the evidence to support the effectiveness of ITC by detailed analysis of clustering results.

## 2. The sum-of-squared-error criterion and algorithms

Given a set of <sup>M</sup>-dimensional input vectors <sup>X</sup> <sup>¼</sup> {x<sup>i</sup> jxi <sup>∈</sup>R<sup>M</sup>;<sup>i</sup> <sup>¼</sup> <sup>1</sup>;…;N} where <sup>N</sup> is the number of vectors, clustering is the task of assigning each input vector <sup>x</sup><sup>i</sup> a cluster label <sup>k</sup>ð<sup>k</sup> <sup>¼</sup> <sup>1</sup>;…;K<sup>Þ</sup> to

Figure 1. Input vectors and the mean vector in C<sup>k</sup> .

clustering. It is also possible to analyze data as a set of feature vectors for items. From worddocument matrix in Table 2, both document clusters and word clusters could be extracted.

item1 item2 item3 item4 item5

User3 1 0 0 0 2 User4 0 0 1 0 0

94 Advances in Statistical Methodologies and Their Application to Real Problems

Table 2. Word frequencies in documents (bag-of-words feature representation).

Table 1. Consumption behavior of users.

User1 3 0 0 5 0 30050

Word1 30050 Word2 01200 Word3 10002 Word4 00100

CC User2 0 <sup>1</sup> <sup>2</sup> <sup>0</sup> <sup>0</sup> ) <sup>A</sup>

Document1 Document2 Document3 Document4 Document5

Many clustering methods have been proposed in a number of decades. Those include k-means algorithm [3], competitive learning [4], spherical clustering [5], spectral clustering [6], and maximum margin clustering [7]. Since clustering results depend on criteria and algorithms, appropriate selection of them is an essential problem. Large sets of users' behavior logs and text documents (as shown in Tables 1 and 2) are common recently. These are often presented as high-dimensional and sparse vectors. This chapter introduces information-theoretic clustering [8] and algorithms that are appropriate and useful to analyze such a high-dimensional

Information-theoretic clustering (ITC) uses Kullback-Leibler divergence and Jensen-Shannon divergence to determine its criterion, while k-means algorithm uses the sum of squared error as criterion. This chapter explains ITC by contrasting these two clustering techniques (criteria and algorithms), because there are a number of interesting similarities between them. There exists difficulty in algorithms for ITC. We explain the details of it and propose novel algo-

Experimental results for text data sets are presented to show the effectiveness and usefulness of ITC and novel algorithms for it. In experiments, maximum margin clustering and spherical clustering are used to compare. We also provide the evidence to support the effectiveness of

of vectors, clustering is the task of assigning each input vector <sup>x</sup><sup>i</sup> a cluster label <sup>k</sup>ð<sup>k</sup> <sup>¼</sup> <sup>1</sup>;…;K<sup>Þ</sup> to

jxi

<sup>∈</sup>R<sup>M</sup>;<sup>i</sup> <sup>¼</sup> <sup>1</sup>;…;N} where <sup>N</sup> is the number

01200 10002 00100 1

0 BB@

data.

rithms to overcome.

ITC by detailed analysis of clustering results.

Given a set of <sup>M</sup>-dimensional input vectors <sup>X</sup> <sup>¼</sup> {x<sup>i</sup>

2. The sum-of-squared-error criterion and algorithms

partition them into <sup>K</sup> clusters <sup>C</sup> <sup>¼</sup> {C<sup>1</sup> ;…;CK}. The sum-of-squared-error criterion [9] is a simple and widely used criterion for clustering.

Let μ<sup>k</sup> be the mean of the input vectors x<sup>i</sup> which belong to the cluster Ck (see Figure 1). Then, the error in C<sup>k</sup> is the sum of squared lengths of the differential (= "error") vectors ∥x<sup>i</sup> −μ<sup>k</sup> ∥<sup>2</sup> and the sum-of-squared-error criterion about all clusters (within-cluster sum of squares) is defined by

$$J\_W = \sum\_{k=1}^{K} \sum\_{\mathbf{x}^i \in \mathcal{C}^k} \|\mathbf{x}^i - \boldsymbol{\mu}^k\|^2. \tag{1}$$

JW is the objective function (criterion) to be minimized in clustering based on this criterion.

Also, we define the sum of squares of between-cluster JB and total JT as

$$J\_B = \sum\_{k=1}^{K} N\_k \|\boldsymbol{\mu}^k \boldsymbol{\mu}\|^2, \; J\_T = \sum\_{i=1}^{N} \|\boldsymbol{\mu}^i \boldsymbol{\mu}\|^2,\tag{2}$$

respectively, where Nk is the number of input vectors xi in <sup>C</sup><sup>k</sup> (i.e., <sup>N</sup> <sup>¼</sup> <sup>∑</sup><sup>K</sup> <sup>k</sup>¼<sup>1</sup> Nk) and

$$\boldsymbol{\mu}^{k} = \frac{1}{N\_{k}} \sum\_{\mathbf{x}^{i} \in \mathcal{C}^{k}} \mathbf{x}^{i}, \; \boldsymbol{\mu} = \frac{1}{N} \sum\_{i=1}^{N} \mathbf{x}^{i}. \tag{3}$$

It follows from these definitions that the total sum of squares is the sum of the within-cluster sum of squares and the between-cluster sum of squares:

$$J\_T = J\_W + J\_B.\tag{4}$$

Since the mean of the all input vectors <sup>μ</sup> is derived from <sup>X</sup> <sup>¼</sup> {x<sup>1</sup>;…;x<sup>N</sup>} [see Eq. (3)], JT does not depend on clusters C [see Eq. (2)] and is constant for the given input vectors X. Therefore, minimization of JW is equivalent to maximization of JB. In this sense, clustering based on minimizing this criterion JW works to find separable clusters each other.

#### 2.1. Generative model

In the background of the clustering based on the objective function (criterion) JW , there exists assumption of Gaussian distribution about input vectors [10].

Suppose that there are clusters Ck ðk ¼ 1;…;KÞ, which generates input vectors by the conditional probability density function:

$$p(\mathbf{x}^i|\mathbf{x}^i \in \mathbb{C}^k) = \frac{1}{(2\pi\sigma\_k^2)^{M/2}} \exp\left(-\frac{\|\mathbf{x}^i - \boldsymbol{\mu}^k\|^2}{2\sigma\_k^2}\right),\tag{5}$$

where σ<sup>k</sup> is a standard deviation of the cluster C<sup>k</sup> and M is the number of dimension of x<sup>i</sup> . In followings, we assume that σ<sup>k</sup> is constant value σ for all clusters C<sup>k</sup> ðk ¼ 1;…;KÞ. Considering independence of each generation, joint probability density function for the input vectors X becomes

$$p(\mathcal{X}|\mathcal{C}) = \prod\_{k=1}^{K} \prod\_{\mathbf{x}^i \in \mathcal{C}^k} \frac{1}{\left(2\pi\sigma^2\right)^{M/2}} \exp\left(-\frac{\|\mathbf{x}^i - \mathbf{z}^k\|^2}{2\sigma^2}\right),\tag{6}$$

where C indicate cluster information that specifies which input vector x<sup>i</sup> belongs to cluster Ck . Taking the logarithm of Eq. (6) yields

$$\ln \text{hp}(\mathcal{K}|\mathcal{C}) = -\frac{NM}{2}\text{log}(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum\_{k=1}^{K} \sum\_{\mathbf{x}^i \in \mathcal{C}^k} \|\mathbf{x}^i - \boldsymbol{\mu}^k\|^2. \tag{7}$$

Since σ is constant, the maximization of Eq. (7) is equivalent to the minimization of

$$\sum\_{k=1}^{K} \sum\_{\mathbf{x}^{i} \in \mathcal{C}^{k}} \|\mathbf{x}^{i} - \boldsymbol{\mu}^{k}\|^{2}. \tag{8}$$

which is nothing more or less than the objective function (criterion) JW . Therefore, under the assumption of Gaussian distribution about input vectors, clustering based on Eq. (8) works to find the most probable solution C.

#### 2.2. Algorithms

#### 2.2.1. k-means algorithm

k-means [3, 11] is well-known algorithm for clustering based on the sum-of-squared-error criterion. Main idea of this algorithm is as follows. In the objective function JW (1), error for vector x is calculated by ∥x−μ<sup>k</sup>∥<sup>2</sup> where μ<sup>k</sup> is the mean of cluster C<sup>k</sup> to which x belongs. If ∥x−μ<sup>t</sup> ∥<sup>2</sup> < ∥x−μ<sup>k</sup> ∥2 , changing the cluster from Ck to Ct can reduce the objective function JW.

We introduce weight vector w<sup>k</sup> <sup>ð</sup><sup>k</sup> <sup>¼</sup> <sup>1</sup>;…;K<sup>Þ</sup> (W) that represent cluster Ck to implement the idea mentioned above. The weight vector w<sup>k</sup> involves mean vector μ<sup>k</sup> and prototype vector of cluster Ck . As illustrated in Figure 2, the idea of k-means is alternative repetition of two steps "(a) Update weights" (calculating mean μ<sup>k</sup> as weight vector w<sup>k</sup> ) and "(b) Update clusters" (allocating input vector x<sup>i</sup> to a cluster C<sup>k</sup> on the basis of minimum length from weight vectors w<sup>k</sup>). Note that Figure 2b is a Voronoi tessellation determined by weight vectors w<sup>k</sup> , which are usually called prototype vector in this context.

Figure 2. Two steps in k-means algorithm. (a) Update weights. (b) Update clusters.

Figure 3a is a flow chart of k-means algorithm to which processes of initialization and termination are added. As a matter of fact, clustering is closely related to vector quantization. Vector quantization means mapping input vectors to a codebook that is a set of weight vectors (prototype vectors). When using quantization error EQ:

$$E\_Q = \sum\_{i=1}^{N} \min\_k \|\mathbf{x}^i \mathbf{-} \mathbf{w}^k\|^2,\tag{9}$$

clusters C determined by a local optimal solution of vector quantization W is a local optimal solution of clustering problem [12]. In this sense, clustering can be replaced by vector quantization and vice versa. We can write a flow chart for vector quantization as Figure 3b, but we also find this chart (b) as k-means algorithm. Furthermore, LBG algorithm [13], which is well known for vector quantization, is based on an approach of Lloyd [3] (one of original papers for k-means algorithm). These facts show a close relationship between clustering and vector quantization.

Initialization is important, because k-means algorithm converges to a local optimal solution which depends on an initial condition (a set of weights or clusters). If we initialize weights W by randomly selecting them from input vectors, it may converge to a very bad local optimal solution with high probability. Random labeling that randomly assigns cluster labels C to input vectors may lead to better solutions than random selection of weights. The initialization Random labeling can also be used for charts (b) and (c) in Figure 3 by replacing "Initialize weights" step to "Initialize clusters" and "Update weights" steps. For directly initializing weights, splitting algorithm [13] and k-means þþ [14] were known.

#### 2.2.2. Competitive learning

Suppose that there are clusters Ck

tional probability density function:

Taking the logarithm of Eq. (6) yields

find the most probable solution C.

vector x is calculated by ∥x−μ<sup>k</sup>

We introduce weight vector w<sup>k</sup>

∥2

usually called prototype vector in this context.

2.2. Algorithms

∥x−μ<sup>t</sup>

cluster Ck

wk

2.2.1. k-means algorithm

∥<sup>2</sup> < ∥x−μ<sup>k</sup>

becomes

<sup>p</sup>ðx<sup>i</sup> jxi ∈C<sup>k</sup>

96 Advances in Statistical Methodologies and Their Application to Real Problems

Þ ¼ <sup>1</sup> <sup>ð</sup>2πσ<sup>2</sup> k Þ

followings, we assume that σ<sup>k</sup> is constant value σ for all clusters C<sup>k</sup>

K k¼1 ∏ xi ∈C<sup>k</sup>

NM

pðXjCÞ ¼ ∏

lnpðXjCÞ ¼ −

where σ<sup>k</sup> is a standard deviation of the cluster C<sup>k</sup> and M is the number of dimension of x<sup>i</sup>

independence of each generation, joint probability density function for the input vectors X

1 ð2πσ<sup>2</sup>Þ

where C indicate cluster information that specifies which input vector x<sup>i</sup> belongs to cluster Ck

which is nothing more or less than the objective function (criterion) JW . Therefore, under the assumption of Gaussian distribution about input vectors, clustering based on Eq. (8) works to

k-means [3, 11] is well-known algorithm for clustering based on the sum-of-squared-error criterion. Main idea of this algorithm is as follows. In the objective function JW (1), error for

mentioned above. The weight vector w<sup>k</sup> involves mean vector μ<sup>k</sup> and prototype vector of

(allocating input vector x<sup>i</sup> to a cluster C<sup>k</sup> on the basis of minimum length from weight vectors

). Note that Figure 2b is a Voronoi tessellation determined by weight vectors w<sup>k</sup>

"(a) Update weights" (calculating mean μ<sup>k</sup> as weight vector w<sup>k</sup>

. As illustrated in Figure 2, the idea of k-means is alternative repetition of two steps

∥<sup>2</sup> where μ<sup>k</sup> is the mean of cluster C<sup>k</sup> to which x belongs. If

<sup>ð</sup><sup>k</sup> <sup>¼</sup> <sup>1</sup>;…;K<sup>Þ</sup> (W) that represent cluster Ck to implement the idea

, changing the cluster from Ck to Ct can reduce the objective function JW.

<sup>2</sup> logð2πσ<sup>2</sup>

Since σ is constant, the maximization of Eq. (7) is equivalent to the minimization of

∑ K k¼1 ∑ xi ∈C<sup>k</sup> ∥xi −μ<sup>k</sup> ∥2

<sup>M</sup>=<sup>2</sup> exp −

<sup>M</sup>=<sup>2</sup> exp −

<sup>Þ</sup><sup>−</sup> <sup>1</sup> <sup>2</sup>σ<sup>2</sup> <sup>∑</sup> K k¼1 ∑ xi ∈C<sup>k</sup> ∥x<sup>i</sup> −μ<sup>k</sup> ∥2

ðk ¼ 1;…;KÞ, which generates input vectors by the condi-

, (5)

ðk ¼ 1;…;KÞ. Considering

, (6)

: (7)

) and "(b) Update clusters"

, which are

: (8)

. In

.

∥x<sup>i</sup> −μ<sup>k</sup>∥<sup>2</sup> 2σ<sup>2</sup> k

∥x<sup>i</sup> −μ<sup>k</sup> ∥2 2σ<sup>2</sup> 

> Competitive learning [4, 11] is a learning method for vector quantization and also utilized for clustering. While k-means algorithm updates all weights W by batch processing, competitive learning updates one weight w at a time to reduce a part of the quantization error QE (see Figure 3c) as

Figure 3. Flow charts of algorithms based on sum-of-squared-error criterion. (a) k-means1, (b) k-means2, and (c) competitive learning.


$$\mathbf{c} = \arg\min\_{\mathbf{k}} \|\mathbf{x} - \mathbf{w}^k\|^2 \quad \text{(If there are several candidates, choose the smallest k)}. \qquad (10)$$

3. Update the winner's weight w<sup>c</sup> as

$$
\mathfrak{w}^{\varepsilon} \leftarrow (1 - \gamma)\mathfrak{w}^{\varepsilon} + \gamma \mathfrak{x}, \tag{11}
$$

where γ is a given learning rate (e.g., 0.01–0.1).

Though the winner-take-all update in Step 3 (Figure 4) that reduces partial error ∥x−w<sup>c</sup> ∥<sup>2</sup> in steepest direction does not always reduce the total quantization error EQ, repetition of the update can reduce EQ on the basis of stochastic gradient decent method [15, 16]. For termination condition, maximum number of times of iteration Nr (the number of maximum repetitions) can be used. After termination, the step of deciding clusters C like Figure 2b is required for clustering purpose.

Figure 4. Update of the winner's weight in competitive learning.

Against natural expectations, competitive learning outperforms k-means without any contrivance in most cases. Furthermore, information obtained in learning process allows us to improve its performance. Splitting rule [12] utilizes the number of each weight w wins to estimate density around it. As Figure 5a, b shows, higher density of input vectors around makes the weight vector w<sup>a</sup> win more frequently than w<sup>b</sup>.

Figure 5. Density of input vectors around a weight vector.

1. Select one input vector x randomly from X.

98 Advances in Statistical Methodologies and Their Application to Real Problems

where γ is a given learning rate (e.g., 0.01–0.1).

Figure 4. Update of the winner's weight in competitive learning.

wc

Though the winner-take-all update in Step 3 (Figure 4) that reduces partial error ∥x−w<sup>c</sup>

steepest direction does not always reduce the total quantization error EQ, repetition of the update can reduce EQ on the basis of stochastic gradient decent method [15, 16]. For termination condition, maximum number of times of iteration Nr (the number of maximum repetitions) can be used. After termination, the step of deciding clusters C like Figure 2b is required for

Figure 3. Flow charts of algorithms based on sum-of-squared-error criterion. (a) k-means1, (b) k-means2, and (c) compet-

<sup>∥</sup><sup>2</sup> <sup>ð</sup>If there are several candidates; choose the smallest kÞ: (10)

<sup>←</sup>ð1−γÞw<sup>c</sup> <sup>þ</sup> <sup>γ</sup>x, (11)

∥<sup>2</sup> in

2. Decide a winner w<sup>c</sup> from W by

itive learning.

<sup>c</sup> <sup>¼</sup> arg min<sup>k</sup> <sup>∥</sup>x−w<sup>k</sup>

3. Update the winner's weight w<sup>c</sup> as

clustering purpose.

Splitting rule in competitive learning [12] aims to overcome the problem of discrepancy between distribution of input vectors X and that of weight vectors W. The discrepancy causes a few weight vectors w monopolize X and leads to a solution of very poor quality, but it is impossible to figure out the distribution of input vectors beforehand. Accordingly, this splitting rule distributes weight vectors w in learning process as


## 3. Information-theoretic clustering and algorithms

Information-theoretic clustering (ITC) [8] is closely related to works about distributional clustering [17–19] and uses Kullback-Leibler divergence and Jensen-Shannon divergence to determine its criterion. Though there exists difficulty in algorithms and effectiveness for highdimensional count data (e.g., text data), its definition and properties are similar to those of the sum-of-squared-error criterion. The main contributions of this chapter are to present the technique to overcome the difficulty and effectiveness of ITC.

Let <sup>X</sup> <sup>¼</sup> {x<sup>i</sup> jxi ∈RM <sup>þ</sup> ;<sup>i</sup> <sup>¼</sup> <sup>1</sup>;…;N} be a set of <sup>M</sup>-dimensional input vectors (<sup>N</sup> denote the number of input vectors), where elements of vectors x are nonnegative real numbers. We define a l 1 -norm of input vector t i ð¼ <sup>∑</sup>mjxi <sup>m</sup>jÞ, normalized input vectors <sup>p</sup><sup>i</sup> <sup>¼</sup> <sup>x</sup><sup>i</sup> =t i , and an input probability distribution P<sup>i</sup> whose mth random variable takes the mth element of p<sup>i</sup> ð¼ pi <sup>m</sup>Þ. Let <sup>P</sup> <sup>¼</sup> {P<sup>1</sup> , …, PN} be a set of input distributions (input data).

Suppose that we assign each distribution Pi a cluster label <sup>k</sup>ð<sup>k</sup> <sup>¼</sup> <sup>1</sup>, …;K<sup>Þ</sup> to partition them into <sup>K</sup> clusters <sup>C</sup> <sup>¼</sup> {C<sup>1</sup> ;…;C<sup>K</sup>}.

Let P k be the distributions on the mean of input data P<sup>i</sup> which belong to the cluster C<sup>k</sup> (see Figure 6). Then, the generalized Jensen-Shannon (JS) divergence to be minimized in Ck is defined by

$$D\mathfrak{g}(\{P^{i}|P^{i}\in\mathbb{C}^{k}\})=\sum\_{P^{i}\in\mathbb{C}^{k}}\pi^{i}D\mathfrak{k}\sqcup(P^{i}\|\overline{P}^{k}),\ \overline{P}^{k}=\sum\_{P^{i}\in\mathbb{C}^{k}}\pi^{i}P^{i},\tag{12}$$

where Nk is the number of distributions <sup>P</sup><sup>i</sup> in cluster Ck (i.e., <sup>N</sup> <sup>¼</sup> <sup>∑</sup><sup>K</sup> <sup>k</sup>¼<sup>1</sup>Nk), <sup>D</sup>KLðP<sup>i</sup> ∥P k Þ is the Kullback-Leibler (KL) divergence to the mean distribution P <sup>k</sup> from Pi , and π<sup>i</sup> is the probability of <sup>P</sup><sup>i</sup> (∑P<sup>i</sup> <sup>∈</sup>Ckπ<sup>i</sup> <sup>¼</sup> 1). Here <sup>P</sup><sup>i</sup> <sup>¼</sup> <sup>1</sup>=Nk. Then, we define within-cluster JS divergence JSW which considers all clusters C<sup>k</sup> ðk ¼ 1;…;KÞ as

$$JS\_W = \sum\_{k=1}^{K} \frac{N\_k}{N} D\_{\mathbb{S}}(\{P^i | P^i \in \mathbb{C}^k\}) \tag{13}$$

$$=\frac{1}{N}\sum\_{k=1}^{K}\sum\_{\substack{P^i \in \mathcal{C}^k}} D\_{\text{KL}}(P^i \| \overline{P}^k) \tag{14}$$

$$\overline{p} = \frac{1}{N} \sum\_{k=1}^{K} \sum\_{p^i \in \mathcal{C}^k} \sum\_{m=1}^{M} p\_m^i \log \frac{p\_m^i}{\overline{p}\_m^k} = \frac{1}{N} \sum\_{k=1}^{K} \sum\_{p^i \in \mathcal{C}^k} \sum\_{m=1}^{M} \left( p\_m^i \log p\_m^i - p\_m^i \log \overline{p}\_m^k \right). \tag{15}$$

The within-cluster JS divergence JSW is the objective function (criterion) of information-theoretic clustering (ITC) to be minimized [8]. We also define JS divergence of between-cluster JSB and total JST as

$$JS\_{\mathcal{B}} = D\_{\mathbb{IS}}(\{\overline{P}^k | k = 1, \ldots, \mathbb{K}\}) = \sum\_{k=1}^{K} \pi^k D\_{\text{KL}}(\overline{P}^k \| \overline{P}), \ \pi^k = N\_k / N\_\prime \tag{16}$$

$$\mathbb{I}\_{\mathsf{J}} \mathsf{JS}\_{\mathsf{T}} = D\_{\mathsf{J}} (\{ \mathsf{P}^{i} | i = 1, \ldots, \mathsf{N} \}) = \sum\_{i=1}^{N} \pi^{i} D\_{\mathsf{KL}} (\mathsf{P}^{i} | \mathsf{P}), \ \pi^{i} = 1/N,\tag{17}$$

where <sup>P</sup> <sup>¼</sup> <sup>∑</sup><sup>N</sup> <sup>i</sup>¼<sup>1</sup>π<sup>i</sup> <sup>P</sup><sup>i</sup> <sup>¼</sup> <sup>1</sup>=N∑<sup>N</sup> <sup>i</sup>¼<sup>1</sup>P<sup>i</sup> is the distribution on the mean of all input data. It follows from these definitions that the total JS divergence is the sum of the within-cluster JS divergence and the between-cluster JS divergence [8]:

$$f\mathbf{S}\_{\Gamma} = f\mathbf{S}\_{W} + f\mathbf{S}\_{\mathcal{B}}.\tag{18}$$

Since JST are constant for given input distributions P, minimization of JSW is equivalent to maximization of JSB. In this sense, clustering based on minimizing this criterion JSW works to find separable clusters each other.

The definition and properties of ITC as shown so far are similar to those of the sum-ofsquared-error criterion. Those will help us to understand ITC.

#### 3.1. Generative model

of input vector t

<sup>K</sup> clusters <sup>C</sup> <sup>¼</sup> {C<sup>1</sup>

<sup>P</sup> <sup>¼</sup> {P<sup>1</sup>

Let P k

of <sup>P</sup><sup>i</sup> (∑P<sup>i</sup>

and total JST as

where <sup>P</sup> <sup>¼</sup> <sup>∑</sup><sup>N</sup>

<sup>i</sup>¼<sup>1</sup>π<sup>i</sup>

find separable clusters each other.

considers all clusters C<sup>k</sup>

¼ 1 <sup>N</sup> <sup>∑</sup> K k¼1 ∑ Pi ∈Ck ∑ M m¼1 pi <sup>m</sup>log pi m pk m ¼ 1 <sup>N</sup> <sup>∑</sup> K k¼1 ∑ Pi ∈C<sup>k</sup> ∑ M m¼1 ðpi <sup>m</sup>log pi

JSB ¼ DJSð{P

<sup>P</sup><sup>i</sup> <sup>¼</sup> <sup>1</sup>=N∑<sup>N</sup>

and the between-cluster JS divergence [8]:

JST <sup>¼</sup> <sup>D</sup>JSð{Pi

k

i ð¼ <sup>∑</sup>mjxi

;…;C<sup>K</sup>}.

<sup>D</sup>JSð{P<sup>i</sup> jPi ∈Ck

<sup>m</sup>jÞ, normalized input vectors <sup>p</sup><sup>i</sup> <sup>¼</sup> <sup>x</sup><sup>i</sup>

Suppose that we assign each distribution Pi a cluster label <sup>k</sup>ð<sup>k</sup> <sup>¼</sup> <sup>1</sup>, …;K<sup>Þ</sup> to partition them into

be the distributions on the mean of input data P<sup>i</sup> which belong to the cluster C<sup>k</sup> (see Figure 6).

∥P<sup>k</sup>

<sup>∈</sup>Ckπ<sup>i</sup> <sup>¼</sup> 1). Here <sup>P</sup><sup>i</sup> <sup>¼</sup> <sup>1</sup>=Nk. Then, we define within-cluster JS divergence JSW which

<sup>D</sup>KLðP<sup>i</sup> ∥P k

The within-cluster JS divergence JSW is the objective function (criterion) of information-theoretic clustering (ITC) to be minimized [8]. We also define JS divergence of between-cluster JSB

> K k¼1 πk DKLðP k

N i¼1 πi <sup>D</sup>KLðP<sup>i</sup>

from these definitions that the total JS divergence is the sum of the within-cluster JS divergence

Since JST are constant for given input distributions P, minimization of JSW is equivalent to maximization of JSB. In this sense, clustering based on minimizing this criterion JSW works to

jPi ∈C<sup>k</sup>

<sup>Þ</sup>; <sup>P</sup><sup>k</sup> <sup>¼</sup> <sup>∑</sup> Pi ∈C<sup>k</sup> πi Pi

<sup>k</sup> from Pi

<sup>m</sup> <sup>−</sup> pi

<sup>i</sup>¼<sup>1</sup>P<sup>i</sup> is the distribution on the mean of all input data. It follows

JST ¼ JSW þ JSB: (18)

distribution P<sup>i</sup> whose mth random variable takes the mth element of p<sup>i</sup>

Then, the generalized Jensen-Shannon (JS) divergence to be minimized in Ck is defined by

}Þ ¼ ∑ Pi ∈C<sup>k</sup> πi <sup>D</sup>KLðP<sup>i</sup>

JSW ¼ ∑ K k¼1

> ¼ 1 <sup>N</sup> <sup>∑</sup> K k¼1 ∑ Pi ∈C<sup>k</sup>

jk ¼ 1;…;K}Þ ¼ ∑

ji ¼ 1;…;N}Þ ¼ ∑

Nk <sup>N</sup> <sup>D</sup>JSð{Pi

where Nk is the number of distributions <sup>P</sup><sup>i</sup> in cluster Ck (i.e., <sup>N</sup> <sup>¼</sup> <sup>∑</sup><sup>K</sup>

Kullback-Leibler (KL) divergence to the mean distribution P

ðk ¼ 1;…;KÞ as

, …, PN} be a set of input distributions (input data).

100 Advances in Statistical Methodologies and Their Application to Real Problems

=t i , and an input probability

ð¼ pi

; (12)

∥P k Þ is the

<sup>m</sup>Þ: (15)

, and π<sup>i</sup> is the probability

<sup>k</sup>¼<sup>1</sup>Nk), <sup>D</sup>KLðP<sup>i</sup>

}Þ (13)

Þ (14)

<sup>∥</sup>PÞ; <sup>π</sup><sup>k</sup> <sup>¼</sup> Nk=N, (16)

<sup>∥</sup>PÞ; <sup>π</sup><sup>i</sup> <sup>¼</sup> <sup>1</sup>=N, (17)

<sup>m</sup>log pk

<sup>m</sup>Þ. Let

In the background of information-theoretic clustering (ITC), there also exists the bag-of-words assumption [20] that disregards the order of words in a document. (Since ITC is not limited for document clustering, "word" is just an example of feature.) It means that features in data are conditionally independent and identically distributed, where the condition is a given probability distribution for an input vector. Based on this assumption, we describe a generative probabilistic model related to ITC and make clear the relationship between the model and the objective function (criterion) JSW .

Let an input vector x ¼ {x1;…;xm;…;xM} present a set of the number of observations of mth feature. Suppose that there are clusters Ck ðk ¼ 1;…;KÞ which generates t <sup>i</sup> features for a data (= input vector) with the probability distribution P k <sup>¼</sup> {pk 1;…;pk <sup>M</sup>}, and conditional probability of a set of the observation about features in an input vector x<sup>i</sup> is expressed by multinomial distribution

$$p(\mathbf{x}^i | \mathbf{x}^i \in \mathbb{C}^k) = A^i \prod\_{m=1}^M (\overline{p}\_m^i)^{\mathbf{x}\_m^i}, \ A^i = \frac{t^i!}{\mathbf{x}\_1^i! \cdot \mathbf{x}\_2^i! \cdots \mathbf{x}\_M^i!}, \ t^i = \sum\_{m=1}^M |\mathbf{x}\_m^i|, \tag{19}$$

where A<sup>i</sup> is the number of combination of the observation. Assuming independence of each generation, joint probability function for the input vectors <sup>X</sup> <sup>¼</sup> {x<sup>1</sup>;…;x<sup>N</sup>} becomes

$$p(\mathcal{X}|\mathcal{C}) = \prod\_{k=1}^{K} \prod\_{\mathbf{x}^{i} \in \mathcal{C}^{k}} A^{i} \prod\_{m=1}^{M} \left(\overline{p}\_{m}^{k}\right)^{\mathbf{x}\_{m}^{i}},\tag{20}$$

where C indicates cluster information that specifies which input vector x<sup>i</sup> belongs to cluster C<sup>k</sup> . Taking the logarithm of Eq. (20) yields

$$\ln \mathfrak{p}(\mathcal{X}|\mathcal{C}) = \sum\_{i=1}^{N} \log \mathcal{A}^{i} + \sum\_{k=1}^{K} \sum\_{\substack{\mathbf{x}^{i} \in \mathcal{C}^{k} \ m=1}} \overset{M}{\sum\_{m=1}^{M} \mathbf{x}\_{m}^{i} \log \overline{p}\_{m}^{k}} \tag{21}$$

$$=\sum\_{i=1}^{N}\log A^{i}+\sum\_{k=1}^{K}\sum\_{p^{i}\in\mathcal{C}^{k}}\sum\_{m=1}^{M}t^{i}\cdot p^{i}\_{m}\log\overline{p}^{k}\_{m}.\tag{22}$$

This is a generative probabilistic model related to ITC. If we assume that t <sup>i</sup> takes constant value t for all input vectors, maximization of the probability (22) as well as minimization of the objective function JSW (15) come to the minimization of

$$\frac{1}{N} \sum\_{k=1}^{K} \sum\_{\substack{p^i \in \mathcal{C}^k \ m=1}} \sum\_{m=1}^{M} -p\_m^i \log \overline{p}\_m^k = \frac{1}{N} \sum\_{k=1}^{K} N\_k \sum\_{m=1}^{M} -\overline{p}\_m^k \log \overline{p}\_m^k,\tag{23}$$

for given input distribution P. Here, the relationship ∑P<sup>i</sup> ∈Ckpi <sup>m</sup> <sup>¼</sup> Nkp<sup>k</sup> <sup>m</sup> is used. Since t <sup>i</sup> may not be constant value t, the generative model (22) is not an equivalent model of ITC but the related

Figure 6. Input distributions and the mean in Ck .

model. This difference comes from the fact that the model treats each observation about features equally, while ITC treats each data (input vector) equally. Though the additional assumption t <sup>i</sup> <sup>¼</sup> <sup>t</sup> is required, ITC works to find the most probable solution <sup>C</sup> in the generative probabilistic model. Furthermore, Eq. (23) is also based on the minimization of entropy in clusters as Eq. (23) shows. Entropy (specifically, Shannon Entropy) is the expected value of the information contained in each message that is an input distribution here. The smaller entropy becomes, and the more compactly a model can explain observations (input distributions). In this sense, the objective function JSW (15) presents the goodness of the generative model. The relationship (including difference) between the probabilistic model and the objective function JSW is meaningful to improve the model and the objective function in future.

Choice of appropriate model for data is important, when analyzing them. For example, large set of text documents contain many kinds of words and are presented as high-dimensional vectors. Taking extreme diversity of documents' topics into account, feature vectors of documents are distributed almost uniformly in the vector space. As known by "the curse of dimensionality" [10], most of the volume of a sphere in high-dimensional space is concentrated near the surface, and it becomes not appropriate to choose the model based on Gaussian distribution which concentrates values around the mean. In contrast, ITC on the basis of the multinomial distribution is a reasonable and useful tool to analyze such a high-dimensional count data, because the generative model of ITC is consistent with them.

We introduce weight distribution Q<sup>k</sup> <sup>ð</sup><sup>k</sup> <sup>¼</sup> <sup>1</sup>;…;KÞðQ<sup>Þ</sup> that represent cluster Ck and that involves mean distribution P <sup>k</sup> and prototype distribution of cluster Ck in a manner similar to that of the sum-of-squared-error (SSE) criterion (see Section2.2.1). Figure 7 shows relationships between parameters in generative models. Parameters are generated or estimated by other parameters to maximize probability of generative model. For example, clustering is the task to find the most probable clusters C for given input vectors X or input distributions P. In Figure 7b, constructing a classifier is the task to find Q for given P and C (classes in this context) in training process. Then, it estimates C for unknown P using the trained Q. The classifier using multinominal distribution is known as multinominal Naive Bayes classifier [21]. As it shows, ITC and Naive Bayes classifier have a close relationship [18].

#### 3.2. Algorithms

There exists difficulty (Appendix A) in algorithms for ITC. We show a novel idea to overcome it.

Figure 7. Relationships between model parameters. (a) Clustering based on SSE criterion. (b) Information-theoretic clustering

#### 3.2.1. Competitive learning

model. This difference comes from the fact that the model treats each observation about features equally, while ITC treats each data (input vector) equally. Though the additional

.

probabilistic model. Furthermore, Eq. (23) is also based on the minimization of entropy in clusters as Eq. (23) shows. Entropy (specifically, Shannon Entropy) is the expected value of the information contained in each message that is an input distribution here. The smaller entropy becomes, and the more compactly a model can explain observations (input distributions). In this sense, the objective function JSW (15) presents the goodness of the generative model. The relationship (including difference) between the probabilistic model and the objective function JSW is meaningful to improve the model and the objective function in future.

Choice of appropriate model for data is important, when analyzing them. For example, large set of text documents contain many kinds of words and are presented as high-dimensional vectors. Taking extreme diversity of documents' topics into account, feature vectors of documents are distributed almost uniformly in the vector space. As known by "the curse of dimensionality" [10], most of the volume of a sphere in high-dimensional space is concentrated near the surface, and it becomes not appropriate to choose the model based on Gaussian distribution which concentrates values around the mean. In contrast, ITC on the basis of the multinomial distribution is a reasonable and useful tool to analyze such a high-dimensional

that of the sum-of-squared-error (SSE) criterion (see Section2.2.1). Figure 7 shows relationships between parameters in generative models. Parameters are generated or estimated by other parameters to maximize probability of generative model. For example, clustering is the task to find the most probable clusters C for given input vectors X or input distributions P. In Figure 7b, constructing a classifier is the task to find Q for given P and C (classes in this context) in training process. Then, it estimates C for unknown P using the trained Q. The classifier using multinominal distribution is known as multinominal Naive Bayes classifier [21].

There exists difficulty (Appendix A) in algorithms for ITC. We show a novel idea to overcome it.

count data, because the generative model of ITC is consistent with them.

As it shows, ITC and Naive Bayes classifier have a close relationship [18].

We introduce weight distribution Q<sup>k</sup>

involves mean distribution P

3.2. Algorithms

<sup>i</sup> <sup>¼</sup> <sup>t</sup> is required, ITC works to find the most probable solution <sup>C</sup> in the generative

<sup>ð</sup><sup>k</sup> <sup>¼</sup> <sup>1</sup>;…;KÞðQ<sup>Þ</sup> that represent cluster Ck and that

<sup>k</sup> and prototype distribution of cluster Ck in a manner similar to

assumption t

Figure 6. Input distributions and the mean in Ck

102 Advances in Statistical Methodologies and Their Application to Real Problems

When competitive learning decides a winner for an input distribution P, it easily faces the difficulty of calculating KL divergence from P to weight distributions Q (see Appendix A). To overcome this difficulty, we present the idea to change an order of steps in competitive learning (CL). As shown in Figure 8b, CL updates all weights (= weight distributions) before deciding winner by

$$
\mathbb{Q}^k \leftarrow (1 - \gamma)\mathbb{Q}^k + \gamma P,\tag{24}
$$

where γ is a learning rate. Since updated weight distributions Q<sup>k</sup> ðk ¼ 1;…;KÞ include all words (features) of input distribution <sup>P</sup>, it is possible to calculate KL divergence <sup>D</sup>KLðP∥Q<sup>k</sup> Þ for all k. In following steps, CL decide a winner Q<sup>c</sup> from Q by

$$c = \arg\min\_{k} D\_{\text{KL}}(P \| Q^{k}) \text{ (If there are several candidates, choose the smallest k),} \qquad (25)$$

and activate winner's update and discard others. These steps satisfy the CL's requirement that it partially reduces value of objective function JSW in steepest direction with the given learning rate γ. Here, neither approximation nor distortion is added to the criterion of ITC. Note that updates of weight distributions Q<sup>k</sup> before activation are provisional (see Figure 8b).

Related work that avoids the difficulty in calculating KL divergence presented skew divergence [22]. The skew divergence is defined as

$$s\_a(P, Q) = D\_{\rm KL}(P \| \alpha Q + (1 - \alpha)P),\tag{26}$$

where αð0≤α≤1Þ is the mixture ratio of distributions. The skew divergence is exactly the KL divergence at α ¼ 1. When α ¼ 1−γ, Eq. (26) becomes similar to Eq. (24). Then, we can rewrite the steps in CL above using the skew divergence as


Figure 8. Flow charts of competitive learning. (a) Competitive learning for SSE. (b) Competitive learning for ITC

$$c = \arg\min\_{k} s\_a(P, Q^k) \text{ (If there are several candidates, choose the smallest k),} \qquad (27)$$

3. Update the winner's weight distribution Q<sup>c</sup> as

$$Q^\epsilon \leftarrow (1 - \gamma) Q^\epsilon + \gamma P,\tag{28}$$

where γ is a learning rate and equal to 1−α (α is the mixture ratio for sα) usually.

Hence, we call this novel algorithm for ITC as "competitive learning using skew divergence" (sdCL). In addition, splitting rule in competitive learning [12] can also be applied to this algorithm.

#### 3.2.2. k-means type algorithm

Dhillon et al. [8] proposed information-theoretic divisive algorithm which is k-means type algorithm with divisive mechanism and uses KL divergence.<sup>1</sup> However, it still remains the

<sup>1</sup> The algorithm was proposed for feature/word clustering and applied to text classification. Since the algorithm uses document class (labeled data), it cannot be applied to general clustering problem.

difficulty to use KL divergence directly. In such a situation, we propose to use the skew divergence instead of KL divergence in k-means type algorithm as


$$Q^k = \frac{1}{N\_k} \sum\_{P^i \in \mathcal{C}^k} P^i. \tag{29}$$

3. Update each cluster c of an input distribution P<sup>i</sup> by

<sup>c</sup> <sup>¼</sup> arg min<sup>k</sup> <sup>s</sup>αðPi ;Q<sup>k</sup> Þ ðIf there are several candidates; choose the smallest kÞ; (30)

where mixture ratio αð0≤α≤1Þ for skew divergence s<sup>α</sup> is 0.99 for example.

4. Repeat 2 and 3 until change ratio of objective function JSW is less than small value (e.g., 10<sup>−</sup><sup>8</sup> ).

The algorithm itself works well to obtain valuable clustering results. Further, if α is close to 1, skew divergence s<sup>α</sup> becomes a good approximation of KL divergence. Therefore, restart of learning after termination with α closer to 1, such as 0:999; 0:9999;…, may lead to better clustering result.

#### 3.2.3. Other algorithms

<sup>c</sup> <sup>¼</sup> arg min<sup>k</sup>

3.2.2. k-means type algorithm

algorithm.

1

<sup>s</sup>αðP;Q<sup>k</sup>

104 Advances in Statistical Methodologies and Their Application to Real Problems

3. Update the winner's weight distribution Q<sup>c</sup> as

Qc

Figure 8. Flow charts of competitive learning. (a) Competitive learning for SSE. (b) Competitive learning for ITC

where γ is a learning rate and equal to 1−α (α is the mixture ratio for sα) usually.

document class (labeled data), it cannot be applied to general clustering problem.

Hence, we call this novel algorithm for ITC as "competitive learning using skew divergence" (sdCL). In addition, splitting rule in competitive learning [12] can also be applied to this

Dhillon et al. [8] proposed information-theoretic divisive algorithm which is k-means type algorithm with divisive mechanism and uses KL divergence.<sup>1</sup> However, it still remains the

The algorithm was proposed for feature/word clustering and applied to text classification. Since the algorithm uses

Þ ðIf there are several candidates; choose the smallest kÞ, (27)

<sup>←</sup>ð1−γÞQc <sup>þ</sup> <sup>γ</sup>P, (28)

Slonim and Tishby [23] proposed an agglomerative hierarchical clustering algorithm, which is a hard clustering version of Information Bottleneck algorithm of Tishby et al. [24]. It is similar to the algorithm of Baker and McCallum [18] and merges just two clusters at every step based on the JS divergence of their distributions. A merit of the agglomerative algorithms is not affected by the difficulty of calculating KL divergence, because it just uses JS divergence. However, a merge of clusters at each step optimizes a local criterion but not a global criterion, as Dhillon et al. [8] pointed out. Therefore, clustering results may not be as good as results obtained by nonhierarchical algorithms (e.g., k-means and competitive learning) in the sense of optimizing the objective function of ITC. Additionally, hierarchical algorithms are computationally expensive, when the number of inputs is large.

Note that a lot of studies [8, 18, 23] aimed at improving accuracy of text classification using feature/word clustering based on ITC or distributional clustering. If a clustering is just a step to final goal, feature clustering is meaningful. However, features which characterize clusters should not be merged, when we aim to find clusters (topics) from a set of documents. Actually, finding topics using clustering is the aim of this chapter.

## 4. Evaluation of clustering

Since clustering results depend on methods (criteria and algorithms), appropriate selection of them is important. So far, we introduced two criteria for clustering. These are called internal criteria that depend on their own models and not enough for evaluation. If criterion for clustering is common, we can compare clustering results by objective function of the criterion. Under a certain model that is an assumption in other word, a more probable result can be regarded as a better result. However, it is not guaranteed that the model or the assumption is reasonable at all times. Moreover, good clustering results under a certain criterion can be bad results under different criteria. A view from outside is required.

This section introduces external criteria that are Purity, Rand index (RI), and Normalized mutual information (NMI) [25] to evaluate clustering quality and to find better clustering methods. These criteria compare clusters with a set of classes, which are produced on the basis of human judges. Here, each input data belong to one of class A<sup>j</sup> ðj ¼ 1;…;JÞ and one of cluster Ck <sup>ð</sup><sup>k</sup> <sup>¼</sup> <sup>1</sup>;…;KÞ. Let <sup>T</sup>ðC<sup>k</sup> ;Aj <sup>Þ</sup> be the number of data that belongs to both Ck and <sup>A</sup><sup>j</sup> .

Purity is measured by counting the number of input data from the most frequent class in each cluster. Purity can be computed as

$$\text{purity} = \frac{1}{N} \sum\_{k=1}^{K} \max\_{j} T(\mathbf{C}^{k}, A^{j}), \tag{31}$$

where N is the total number of input data. Purity is close to 1, when each cluster has one dominant class.

Rand index (RI) checks all of the NðN−1Þ=2 pairs of input data and is defined by

$$\text{RI} = \frac{\mathbf{a} + \mathbf{b}}{\mathbf{a} + \mathbf{b} + \mathbf{c} + \mathbf{d}},\tag{32}$$

where a, b, c, and d are the number of pairs in following conditions:


The Rand index (RI) measures the percentage of agreements a+b in clusters and classes. Normalized mutual information (NMI) is defined as

$$\text{NMI} = \frac{I(\mathbb{C}; A)}{\left(H(\mathbb{C}) + H(A)\right)/2},\tag{33}$$

where, IðC; AÞ is mutual information and HðÞ is entropy and

#### Information‐Theoretic Clustering and Algorithms http://dx.doi.org/10.5772/66588 107

$$\begin{split} I(\mathsf{C}; A) &= \sum\_{k=1}^{K} \sum\_{j=1}^{l} P(\mathsf{C}^{k}, A^{j}) \log \frac{P(\mathsf{C}^{k}, A^{j})}{P(\mathsf{C}^{k})P(A^{j})} \\ &= \sum\_{k=1}^{K} \sum\_{j=1}^{l} \frac{T(\mathsf{C}^{k}, A^{j})}{N} \log \frac{T(\mathsf{C}^{k}, A^{j})N}{T(\mathsf{C}^{k})T(A^{j})}, \end{split} \tag{34}$$

$$H(\mathbb{C}) = \sum\_{k=1}^{K} -P(\mathbb{C}^k) \log P(\mathbb{C}^k) = \sum\_{k=1}^{K} -\frac{T(\mathbb{C}^k)}{N} \log \frac{T(\mathbb{C}^k)}{N},\tag{35}$$

$$H(A) = \sum\_{j=1}^{l} -P(A^j)\log P(A^j) = \sum\_{j=1}^{l} -\frac{T(A^j)}{N}\log\frac{T(A^j)}{N},\tag{36}$$

where <sup>P</sup>ðCk <sup>Þ</sup>, <sup>P</sup>ðAj <sup>Þ</sup>, and <sup>P</sup>ðC<sup>k</sup> ;Aj <sup>Þ</sup> are the probability of data being in cluster <sup>C</sup><sup>k</sup> , class A<sup>j</sup> , and in the intersection of C<sup>k</sup> and A<sup>j</sup> , respectively. Mutual information IðC; AÞ measures the mutual dependence between clusters C and classes A. It quantifies the amount of information obtained for classes through knowing about clusters. Hence, high NMI shows some kind of goodness about clustering in information theory.

## 5. Experiments

criteria that depend on their own models and not enough for evaluation. If criterion for clustering is common, we can compare clustering results by objective function of the criterion. Under a certain model that is an assumption in other word, a more probable result can be regarded as a better result. However, it is not guaranteed that the model or the assumption is reasonable at all times. Moreover, good clustering results under a certain criterion can be bad

This section introduces external criteria that are Purity, Rand index (RI), and Normalized mutual information (NMI) [25] to evaluate clustering quality and to find better clustering methods. These criteria compare clusters with a set of classes, which are produced on the basis

Purity is measured by counting the number of input data from the most frequent class in each

where N is the total number of input data. Purity is close to 1, when each cluster has one

max j

<sup>N</sup> <sup>∑</sup> K k¼1

RI <sup>¼</sup> <sup>a</sup> <sup>þ</sup> <sup>b</sup>

• "a," where the cluster number (suffix) is the same and the class number is the same

The Rand index (RI) measures the percentage of agreements a+b in clusters and classes.

NMI <sup>¼</sup> <sup>I</sup>ðC; <sup>A</sup><sup>Þ</sup> 

HðCÞ þ HðAÞ

 =2

• "b," where the cluster numbers are different and the class numbers are different

• "c," where the cluster number is the same and the class numbers are different • "d," where the cluster numbers are different and the class number is the same

<sup>Þ</sup> be the number of data that belongs to both Ck and <sup>A</sup><sup>j</sup>

<sup>T</sup>ðCk ;Aj ðj ¼ 1;…;JÞ and one of cluster

Þ, (31)

, (33)

<sup>a</sup> <sup>þ</sup> <sup>b</sup> <sup>þ</sup> <sup>c</sup> <sup>þ</sup> <sup>d</sup> , (32)

.

results under different criteria. A view from outside is required.

106 Advances in Statistical Methodologies and Their Application to Real Problems

of human judges. Here, each input data belong to one of class A<sup>j</sup>

purity <sup>¼</sup> <sup>1</sup>

Rand index (RI) checks all of the NðN−1Þ=2 pairs of input data and is defined by

where a, b, c, and d are the number of pairs in following conditions:

Normalized mutual information (NMI) is defined as

where, IðC; AÞ is mutual information and HðÞ is entropy and

;Aj

Ck

<sup>ð</sup><sup>k</sup> <sup>¼</sup> <sup>1</sup>;…;KÞ. Let <sup>T</sup>ðC<sup>k</sup>

dominant class.

cluster. Purity can be computed as

This section provides experimental results that show the effectiveness and usefulness of ITC and the proposed algorithm (sdCL: competitive learning using skew divergence). Experiments consist of two parts, experiment1 and experiment2.

In experiment1, we applied sdCL to the same data sets as used in the paper of Wang et al. [26] and compared performance of sdCL with other clustering algorithms evaluated in it. The algorithms that the paper [26] evaluated are as follows.


As shown above, maximum margin clustering (MMC) [7] and related works are much focused. These works extend the idea of support vector machine (SVM) [30] to the unsupervised scenario. The experimental results obtained by the MMC technique are often better than conventional clustering methods. Among those, CPMMC and CPM3C (Cutting plane multiclass maximum margin clustering) [26] are known as successful methods. Experimental results will show that the proposed algorithm sdCL outperforms CPM3C in text data clustering.

In experiment2, we focus on text data clustering and compare performance of algorithms, sdCL, sdCLS (sdCL with splitting rule, see Sections 2.2.2 and 3.2.1), and spherical competitive learning (spCL). We also provide the evidence to support the effectiveness of ITC by detailed analysis of clustering results.

spCL is an algorithm for spherical clustering like the spherical k-means algorithm [5] that was proposed for clustering high-dimensional and sparse data, such as text data. The objective function to be maximized for the spherical clustering is cosine similarity between input vectors and the mean vector of a cluster to which they belong. To implement spCL, we turn input and weight vectors (x, w) into a unit vector and decide winner w<sup>c</sup> by

<sup>c</sup> <sup>¼</sup> arg max<sup>k</sup> cos <sup>ð</sup>x;w<sup>k</sup> Þ ðIf there are several candidates; choose the smallest kÞ; (37)

and update the winner's weight w<sup>c</sup> as

$$
\omega \mathbf{w}^{\varepsilon} \leftarrow \frac{(1 - \gamma)\mathbf{w}^{\varepsilon} + \gamma \mathbf{x}}{\|(1 - \gamma)\mathbf{w}^{\varepsilon} + \gamma \mathbf{x}\|}. \tag{38}
$$

For all competitive learning algorithms, the learning rate γ ¼ 0:01, the number of maximum repetitions for updating weights Nr ¼ 1; 000; 000 (termination condition), and the threshold of times for splitting rule θ ¼ 1000 are used. After competitive learning (sdCL, sdCLS, or spCL) is terminated, we apply k-means type algorithm to remove fluctuation as a post-processing. Specifically, sdKM (the k-means type algorithm using skew divergence shown in Section 3.2.2) with α ¼ 0:999; 0:9999; 0:99999 is applied consecutively after sdCL and sdCLS. In each learning procedure including post-processing, an operation is iterated 50 times with different initial random seeds for a given set of parameters.

#### 5.1. Data sets

We mainly use the same data sets as used in the paper of Wang et al. [26]. When applying algorithms for ITC, we use probability distributions P<sup>i</sup> ði ¼ 1;…NÞ (P) derived from original data.

1. UCI data. From the UCI repository,<sup>2</sup> we use ionosphere, digits, letter, and satellite under the same setting of the paper [26]. The digits data (8 · 8 matrix) are generated from bitmaps of handwritten digits. Pairs (3 vs. 8, 1 vs. 7, 2 vs. 7, and 8 vs. 9) are focused due to the difficulty of differentiating. For the letter and satellite data sets, their first two classes are used. Since the ionosphere data contain minus values and cannot be transformed to probability distributions, we do not apply ITC to them.

<sup>2</sup> http://archive.ics.uci.edu/ml/[Accessed: 2016-10-25].

2. Text data. Four text data sets: 20Newsgroups (http://qwone.com/∼jason/20Newsgroups/ [Accessed: 2016-10-25]), WebKB,3 Cora [31], and RCV1 (Reuters Corpus Volume 1) [32] are used. In experiment1, we follow the setting of the paper [26]. For 20Newsgroups data set, topic "rec" which contains four topics {autos, motorcycles, baseball, hockey} is used. From the four topics, two sets of two-class data sets {Text-1: autos vs motorcycles, Text-2: baseball vs hockey} are extracted. From WebKB data sets, the four Universities data set (Cornell, Texas, Washington, and Wisconsin University), which has seven classes (student, faculty, staff, department, course, project, and other), are used. Note that topic of the "other" class is ambiguous and may contain various topics (e.g., faculty), because it is a collection of pages that were not deemed the "main page" representing an instance of the other six classes, as pointed out in the web page of the data set. Cora data set (Cora research paper classification) [31] is a set of information of research papers classified into a topic hierarchy. From this data set, papers in subfield {data structure (DS), hardware and architecture (HA), machine learning (ML), operating system (OS), programming language (PL)} are used. We select papers that contain title and abstract. RCV1 data set contains more than 800 thousands documents to which topic category is assigned. The documents with the highest four topic codes (CCAT, ECAT, GCAT, and MCAT) in the topic codes hierarchy in the training set. Multi-labeled instances are removed.

In experiment2, we use all of 20Newsgroups and RCV1 data sets. For RCV1 data set, we obtain 53 classes (categories) by mapping the data set to the second level of RCV1 topic hierarchy and remove multi-labeled instances. For WebKB data set, we remove "other" class due to ambiguity, use the other six classes, and do not use information of universities.

For all text data, we remove stop words using stop list [32] and empty data, if they are not removed. In experiment1, we follow the setting of the paper [26], but properties of data sets are slightly different (see Table 3). For Cora data sets, the differences of data sizes are large. However, they must keep the same (or close at least) characteristics (e.g., distributions of words and topics), because they are extracted from the same source.

3. Digits data. USPS (16 · 16) and MNIST (28 · 28) are data sets of handwritten digits image.<sup>4</sup> For USPS data set, 1, 2, 3, and 4 digit images are used. For MNIST and digits data from UCI repository, all 45 pairs of digits 0–9 are used in two-class problems.

The properties of those data sets are listed in Table 3.

## 5.2. Results of experiment1

results will show that the proposed algorithm sdCL outperforms CPM3C in text data clus-

In experiment2, we focus on text data clustering and compare performance of algorithms, sdCL, sdCLS (sdCL with splitting rule, see Sections 2.2.2 and 3.2.1), and spherical competitive learning (spCL). We also provide the evidence to support the effectiveness of ITC by detailed

spCL is an algorithm for spherical clustering like the spherical k-means algorithm [5] that was proposed for clustering high-dimensional and sparse data, such as text data. The objective function to be maximized for the spherical clustering is cosine similarity between input vectors and the mean vector of a cluster to which they belong. To implement spCL, we turn input and

<sup>←</sup> <sup>ð</sup>1−γÞw<sup>c</sup> <sup>þ</sup> <sup>γ</sup><sup>x</sup>

For all competitive learning algorithms, the learning rate γ ¼ 0:01, the number of maximum repetitions for updating weights Nr ¼ 1; 000; 000 (termination condition), and the threshold of times for splitting rule θ ¼ 1000 are used. After competitive learning (sdCL, sdCLS, or spCL) is terminated, we apply k-means type algorithm to remove fluctuation as a post-processing. Specifically, sdKM (the k-means type algorithm using skew divergence shown in Section 3.2.2) with α ¼ 0:999; 0:9999; 0:99999 is applied consecutively after sdCL and sdCLS. In each learning procedure including post-processing, an operation is iterated 50 times with different

We mainly use the same data sets as used in the paper of Wang et al. [26]. When applying

1. UCI data. From the UCI repository,<sup>2</sup> we use ionosphere, digits, letter, and satellite under the same setting of the paper [26]. The digits data (8 · 8 matrix) are generated from bitmaps of handwritten digits. Pairs (3 vs. 8, 1 vs. 7, 2 vs. 7, and 8 vs. 9) are focused due to the difficulty of differentiating. For the letter and satellite data sets, their first two classes are used. Since the ionosphere data contain minus values and cannot be transformed to probability distri-

Þ ðIf there are several candidates; choose the smallest kÞ; (37)

<sup>∥</sup>ð1−γÞw<sup>c</sup> <sup>þ</sup> <sup>γ</sup>x<sup>∥</sup> : (38)

ði ¼ 1;…NÞ (P) derived from original

weight vectors (x, w) into a unit vector and decide winner w<sup>c</sup> by

108 Advances in Statistical Methodologies and Their Application to Real Problems

wc

cos <sup>ð</sup>x;w<sup>k</sup>

initial random seeds for a given set of parameters.

algorithms for ITC, we use probability distributions P<sup>i</sup>

butions, we do not apply ITC to them.

http://archive.ics.uci.edu/ml/[Accessed: 2016-10-25].

and update the winner's weight w<sup>c</sup> as

tering.

analysis of clustering results.

<sup>c</sup> <sup>¼</sup> arg max<sup>k</sup>

5.1. Data sets

data.

2

The clustering results are shown in Tables 4–7, where values (except for sdCL) are the same in the paper of Wang et al. [26] (accuracy in that paper is equivalent to purity from its definition). In two-class problems, CPMMC outperforms other algorithms about purity and Rand Index (RI) in most cases. The proposed algorithm sdCL shows stable performances except for

<sup>3</sup> http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/[Accessed: 2016-10-25]. 4 http://www.kernel-machines.org/data [Accessed: 2016-10-25].


Table 3. Properties of data sets.

ionosphere to which sdCL cannot be applied. In multiclass problems, sdCL for text data (Cora, 20Newsgroups-4, and Reuters-RCV1-4) outperforms other algorithms. The results show that ITC and the proposed algorithm sdCL are effective for text data sets. Note that CPM3C shows the better results than sdCL for WebKB data. However, topic of the "other" class in WebKB is ambiguous (see Section5.1). The occupation ratio of them is large {0.710, 0.689, 0.777, 0.739} and almost same as the values of purity in CPM3C and sdCL. It means that these algorithms failed to find meaningful clusters in purity. Therefore, WebKB data are not appropriate to use for evaluation without removing "other" class.

#### 5.3. Results of experiment2

In experiment2, we focus on text data clustering. Table 8 shows that the proposed algorithms for ITC (sdCL and sdCLS) outperform spCL in purity, RI, and NMI. Considering that spCL is an algorithm for spherical clustering [5] which was proposed to analyze high-dimensional


Bold fonts indicate the maximum purities for a give data set.

Table 4. Purity comparisons for two-class problems.


Bold fonts indicate the maximum rand indices for a give data set.

ionosphere to which sdCL cannot be applied. In multiclass problems, sdCL for text data (Cora, 20Newsgroups-4, and Reuters-RCV1-4) outperforms other algorithms. The results show that ITC and the proposed algorithm sdCL are effective for text data sets. Note that CPM3C shows the better results than sdCL for WebKB data. However, topic of the "other" class in WebKB is ambiguous (see Section5.1). The occupation ratio of them is large {0.710, 0.689, 0.777, 0.739} and almost same as the values of purity in CPM3C and sdCL. It means that these algorithms failed to find meaningful clusters in purity. Therefore, WebKB data are not appropriate to use

Data Size (N) Feature (M) Class (K)

Ionosphere 351 34 2 Letter 1555 16 2 Digits 1555 64 2 Satellite 2236 36 2 Text-1 1981 16,259 2 Text-2 1987 15,955 2 20Newsgroups-4 3967 24,506 4 20Newsgroups-20 18,772 60,698 20 Cora-DS 2397 5745 9 Cora-HA 913 3340 7 Cora-ML 3569 6809 7 Cora-OS 2084 5029 4 Cora-PL 3026 6069 9 WebKB-Cornell 835 5574 7 WebKB-Texas 808 4482 7 WebKB-Washington 1191 7779 7 WebKB-Wisconsin 1218 8270 7 WebKB6 4219 14,142 6 Reuters-RCV1-4 19,806 44,214 4 Reuters-RCV1-53 534,135 216,704 53 MNIST 70,000 784 2 USPS 3046 256 4

110 Advances in Statistical Methodologies and Their Application to Real Problems

In experiment2, we focus on text data clustering. Table 8 shows that the proposed algorithms for ITC (sdCL and sdCLS) outperform spCL in purity, RI, and NMI. Considering that spCL is an algorithm for spherical clustering [5] which was proposed to analyze high-dimensional

for evaluation without removing "other" class.

5.3. Results of experiment2

Table 3. Properties of data sets.

Table 5. Rand Index (RI) comparisons for two-class problems.

data such as text documents, the criterion of information-theoretic clustering is worth to use for this purpose.

Table 8 also shows that sdCLS (sdCL with splitting rule, see Sections 2.2.2 and 3.2.1) is slightly better than sdCL in some cases. As far as Figure 9 (left, right) shows, values of JS divergence for sdCLS are smaller ("better" in ITC) than sdCL, and sdCLS outperforms sdCL in purity on average. Nevertheless, an advantage of sdCLS against sdCL is not so obvious in this experiment.


Bold fonts indicate the maximum purities for a give data set.

Table 6. Purity comparisons for multiclass problems.


Bold fonts indicate the maximum rand indices for a give data set.

Table 7. Rand Index (RI) comparisons for multiclass problems.


Table 8. Comparison for text data sets.

Data KM NC MMC CPM3C sdCL UCI-digits 0689 0.696 0.939 0.941 0.974 0.945 UCI-digits 1279 0.4042 0.9011 0.9191 0.945 0.868 USPS 0.932 0.938 – 0.950 0.958 Cora-DS 0.589 0.744 – 0.746 0.823 Cora-HA 0.385 0.659 – 0.695 0.767 Cora-ML 0.514 0.720 – 0.761 0.802 Cora-OS 0.518 0.522 – 0.730 0.735 Cora-PL 0.643 0.675 – 0.712 0.819 WebKB-Cornell 0.603 0.602 – 0.724 0.483 WebKB-Texas 0.604 0.602 – 0.712 0.495 WebKB-Washington 0.616 0.581 – 0.752 0.426 WebKB-Wisconsin 0.581 0.509 – 0.761 0.464 20Newsgroups-4 0.581 0.496 – 0.780 0.940 Reuters-RCV1-4 0.471 – – 0.703 0.800

Data KM NC MMC CPM3C sdCL UCI-digits 0689 0.4223 0.9313 0.9483 0.9674 0.9394 UCI-digits 1279 0.4042 0.9011 0.9191 0.9452 0.8300 USPS 0.9215 0.9011 0.9191 0.9452 0.9515 Cora-DS 0.2824 0.3688 – 0.4415 0.5057 Cora-HA 0.3402 0.4200 – 0.5980 0.6145 Cora-ML 0.2708 0.3103 – 0.4549 0.5974 Cora-OS 0.2387 0.2303 – 0.5916 0.6686 Cora-PL 0.3380 0.3397 – 0.4721 0.4729 WebKB-Cornell 0.5571 0.6143 – 0.7205 0.7192 WebKB-Texas 0.4505 0.3538 – 0.6910 0.6895 WebKB-Washington 0.5352 0.3285 – 0.7817 0.7767 WebKB-Wisconsin 0.4953 0.3331 – 0.7425 0.7397 20Newsgroups-4 0.3527 0.4189 – 0.7134 0.9360 Reuters-RCV1-4 0.2705 – – 0.6235 0.8064

Bold fonts indicate the maximum rand indices for a give data set.

Bold fonts indicate the maximum purities for a give data set.

112 Advances in Statistical Methodologies and Their Application to Real Problems

Table 6. Purity comparisons for multiclass problems.

Table 7. Rand Index (RI) comparisons for multiclass problems.

Figure 9. Purity versus JS divergence for 20Newsgroups (left) and Reuters-RCV1 (right) data sets.

Note that clustering result by dCL (competitive learning using KL divergence) is shown below. Since the values of NMI clearly illustrate that dCL converged to unrelated solutions to the classes, use of skew divergence is an effective technique to overcome this problem.



Table 9. Frequent words in classes of 20Newsgroups data set.

In followings, we examine inside of clustering results obtained by ITC to make clear whether ITC helps us to find meaningful clusters and candidates of classes for classification. Tables 9 and 10 show frequent words in classes and clusters obtained by sdCL of 20Newsgroups data set, respectively. The order of clusters is arranged so that clusters are made to correspond to classes. Table 11 is the cross table between clusters and classes. As shown in Table 10, the frequent words in some clusters remind us characteristics of them to distinguish from others. For example, the words in cluster 2: "image graphics jpeg," cluster 6: "sale offer shipping," and cluster 11: "key encryption chip" remind classes (comp.graphics), (misc.forsale), and (sci.crypt), respectively. We also imagine characteristics of clusters from the words in 7, 8, 9, 10, 13, 14, and 16th clusters. These clusters have documents of one dominant class and can be regarded as candidates of classes. However, there are some exceptions. The 1st and 15th clusters have the same word "god," while classes of (alt.atheism), (soc.religion.christian), and (talk.religion.misc) have also the same word "god." The cluster 1 and class (alt.atheism) have common words "religion evidence," and the cluster 1 has many documents of the dominant class (alt.atheism). The cluster 15 and the class (soc.religion.christian) have common words "jesus bible christ church," and the cluster 15 has many documents of the dominant class (soc.religion.christian). On the other hand, there is no cluster which has


Table 10. Frequent words in clusters of 20Newsgroups data set.

In followings, we examine inside of clustering results obtained by ITC to make clear whether ITC helps us to find meaningful clusters and candidates of classes for classification. Tables 9 and 10 show frequent words in classes and clusters obtained by sdCL of 20Newsgroups data set, respectively. The order of clusters is arranged so that clusters are made to correspond to classes. Table 11 is the cross table between clusters and classes. As shown in Table 10, the frequent words in some clusters remind us characteristics of them to distinguish from others. For example, the words in cluster 2: "image graphics jpeg," cluster 6: "sale offer shipping," and cluster 11: "key encryption chip" remind classes (comp.graphics), (misc.forsale), and (sci.crypt), respectively. We also imagine characteristics of clusters from the words in 7, 8, 9, 10, 13, 14, and 16th clusters. These clusters have documents of one dominant class and can be regarded as candidates of classes. However, there are some exceptions. The 1st and 15th clusters have the same word "god," while classes of (alt.atheism), (soc.religion.christian), and (talk.religion.misc) have also the same word "god." The cluster 1 and class (alt.atheism) have common words "religion evidence," and the cluster 1 has many documents of the dominant class (alt.atheism). The cluster 15 and the class (soc.religion.christian) have common words "jesus bible christ church," and the cluster 15 has many documents of the dominant class (soc.religion.christian). On the other hand, there is no cluster which has

talk.politics.mideast people israel armenian writes turkish jews article armenians israeli jewish talk.politics.misc people writes article president government mr stephanopoulos make time

alt.atheism god writes people article atheism religion time evidence comp.graphics image graphics jpeg file bit images software data files ftp comp.os.ms-windows.misc windows file dos writes article files ms os problem win comp.sys.ibm.pc.hardware drive scsi card mb ide system controller bus pc writes

114 Advances in Statistical Methodologies and Their Application to Real Problems

comp.sys.mac.hardware mac apple writes drive system problem article mb monitor mhz comp.windows.x window file server windows program dos motif sun display widget

misc.forsale sale shipping offer mail price drive condition dos st email rec.autos car writes article cars good engine apr ve people time rec.motorcycles writes bike article dod ca apr ve ride good time

rec.sport.baseball writes year article game team baseball good games time hit rec.sport.hockey game team hockey writes play ca games article season year

sci.electronics writes article power good ve work ground time circuit ca sci.med writes article people medical health disease time cancer patients sci.space space writes nasa article earth launch orbit shuttle time system soc.religion.christian god people jesus church christ writes christian christians bible time talk.politics.guns gun people writes article guns fbi government fire time weapons

talk.religion.misc god writes people jesus article bible christian good christ life

Table 9. Frequent words in classes of 20Newsgroups data set.

sci.crypt key encryption government chip writes clipper people article keys system

documents of (talk.religion.misc) as dominant, and the documents of (talk.religion.misc) are mostly shared by the clusters 1 and 15. Though there is a mismatch between clusters and classes, the clustering result is also acceptable, because words in the class (talk.religion.misc) are resemble those in the class (soc.religion.christian). We can also find that cluster 4 has many documents of the two classes (comp.sys.ibm.pc.hardware) and (comp.sys.mac.hardware). From Table 9, those classes have similar words except for "mac" and "apple." Thus, ITC missed to detect the difference of the classes, but found the cluster with common feature of them. In this sense, the clustering result is meaningful and useful. Another example is that documents in class (talk.politics.mideast) are divided into clusters 17 and 18. It means that ITC found two topics from one class and frequency words in the clusters seem to be reasonable (see 17th and 18th clusters in Table 10). The characteristic of cluster 20 that has words "mail list address email send" is different from all classes as well as other clusters, but the cluster 20 has some documents in all classes (see Table 11). This cluster may discover that all newsgroups include documents with such words. In summary, ITC helps us to find meaningful clusters, even when clusters obtained by ITC sometimes seem not to be the same as expected classes. The detailed analysis of the clustering results above could be the evidence to support the effectiveness and usefulness of ITC.


Table 11. The number of documents about each class in each cluster.

## 6. Conclusion

In this chapter, we introduced information-theoretic clustering (ITC) from both theoretical and experimental side. Theoretically, we have shown the criterion, generative model, and novel algorithms for ITC. Experimentally, we showed the effectiveness and usefulness of ITC for text analysis as an important example.

## A Difficulty about KL divergence

Let P and Q be a distribution whose mth random variable pm and qm takes the mth element of a vector p and q, respectively. The Kullback-Leibler (KL) divergence to Q from P is defined to be

$$D\_{\rm KL}(P \| \| Q) = p\_m \log \frac{p\_m}{q\_m}.\tag{39}$$

In this definition, it is assumed that the support set of P is a subset of the support set of Q (If qm is zero, pm must be zero). For a given cluster Ck , there is no problem to calculate JS divergence of cluster Ck by Eq. (12), because the support set of any distribution Pi <sup>ð</sup>∈C<sup>k</sup> Þ is the subset of the mean distribution P k . However, it is not guaranteed that KL divergence from Pi <sup>ð</sup>∈C<sup>k</sup> <sup>Þ</sup> to <sup>Q</sup><sup>t</sup> ðt≠kÞ (a weight distribution of other cluster Ct ) is finite. This causes a serious problem to find similar weight distribution Q for an input distribution P. For example, lack of even one word (feature) in a distribution Q is enough not to be similar. Therefore, it is difficult to use k-means type algorithm,<sup>5</sup> which updates weights or clusters by batch processing, in ITC.

## Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 26330259.

## Author details

Toshio Uchiyama

Address all correspondence to: uchiyama.toshio@do-johodai.ac.jp

Hokkaido Information University, Ebetsu-shi, Hokkaido, Japan

## References

6. Conclusion

analysis as an important example.

A Difficulty about KL divergence

Table 11. The number of documents about each class in each cluster.

116 Advances in Statistical Methodologies and Their Application to Real Problems

is zero, pm must be zero). For a given cluster Ck

In this chapter, we introduced information-theoretic clustering (ITC) from both theoretical and experimental side. Theoretically, we have shown the criterion, generative model, and novel algorithms for ITC. Experimentally, we showed the effectiveness and usefulness of ITC for text

1 625 4 9 2 2 4 2 8 0 3 2 0 5 49 12 54 1 5 13 163 2 1 632 44 19 17 101 10 7 4 8 0 14 32 12 18 2 0 0 1 1 3 0 107 707 189 58 60 27 5 0 0 1 12 40 2 3 3 0 0 1 1 4 0 74 64 630 711 18 104 3 3 1 0 2 93 6 3 2 0 1 0 1 5 1 43 53 11 15 712 0 1 0 1 1 9 2 1 1 0 1 0 0 3 6 0 9 5 18 25 2 648 17 15 3 0 0 16 1 3 1 1 0 1 0 7 0 1 2 7 8 0 35 784 65 1 1 0 25 2 7 2 2 2 1 1 8 1 3 0 0 4 6 7 36 867 2 0 2 10 6 6 0 1 1 0 2 9 4 0 4 4 0 1 4 3 4 886 18 0 2 3 2 3 0 3 6 3 10 0 0 1 1 0 1 9 3 2 37 943 0 1 1 1 0 3 2 3 0 11 2 15 2 4 6 5 5 1 0 1 2 881 36 5 6 0 20 9 15 3 12 1 11 11 52 33 2 32 14 4 3 0 8 603 28 10 2 1 0 0 1 13 1 0 1 0 5 1 3 1 1 1 2 1 5 771 5 3 0 0 2 2 14 3 9 5 2 3 5 4 4 4 1 1 1 29 17 843 2 4 0 5 5 15 94 1 2 0 0 0 2 1 1 3 0 0 0 10 2 863 5 4 6 299 16 15 2 0 5 8 2 6 36 11 0 1 18 6 8 4 7 805 2 189 93 17 9 2 1 0 1 1 0 7 0 0 0 0 2 1 8 6 2 547 14 10 18 32 0 0 0 0 0 1 1 0 2 1 1 5 3 4 12 12 333 38 12 19 3 7 8 2 16 3 12 20 4 4 7 13 6 13 11 13 33 12 470 16 20 6 50 44 33 46 58 53 35 8 34 16 27 66 48 36 22 18 19 8 11

Let P and Q be a distribution whose mth random variable pm and qm takes the mth element of a vector p and q, respectively. The Kullback-Leibler (KL) divergence to Q from P is defined to be

<sup>D</sup>KLðP∥QÞ ¼ pmlog pm

In this definition, it is assumed that the support set of P is a subset of the support set of Q (If qm

qm

: (39)

, there is no problem to calculate JS divergence


<sup>5</sup> k-means algorithm is known as algorithm for clustering based on the sum-of-squared-error criterion (see Section 2.2.1). Therefore we added word "type" for information-theoretic clustering.


[21] A. McCallum and K. Nigam, "A comparison of event models for naive Bayes text classification," In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, Citeseer, pp. 41–48, 1998.

[7] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, "Maximum margin clustering," Advances in Neural Information Processing Systems, vol. 17, pp. 1537–1544, 2004.

[8] I.S. Dhillon, S. Mallela, and R. Kumar, "A divisive information theoretic feature clustering algorithm for text classification," The Journal of Machine Learning Research, vol. 3, pp.

[9] R.O. Duda and P.E. Hart, Pattern classification and scene analysis. John Wiley & Sons,

[10] C.M. Bishop, Pattern recognition and machine learning (Information Science and Statis-

[11] J. MacQueen, et al., "Some methods for classification and analysis of multivariate observations," Proceedings of the fifth Berkeley symposium on mathematical statistics and

[12] T. Uchiyama and M.A. Arbib, "Color image segmentation using competitive learning," The IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 12, pp.

[13] Y. Linde, A. Buzo, and R.M. Gray, "An algorithm for vector quantizer design," IEEE

[14] D. Arthur and S. Vassilvitskii, "k-means++: The advantages of careful seeding," In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms Soci-

[15] S. Amari, "A theory of adaptive pattern classifiers," IEEE TransactionsonElectronic Com-

[16] S.-I. Amari, "Natural gradient works efficiently in learning," Neural Computation, vol.

[17] F. Pereira, N. Tishby, and L. Lee, "Distributional clustering of english words," In: Proceedings of the 31st annual meeting on Association for Computational Linguistics Asso-

[18] L.D. Baker and A.K. McCallum, "Distributional clustering of words for text classification," In: Proceedings of the 21st annual international ACM SIGIR conference on research

[19] N. Slonim and N. Tishby, "Document clustering using word clusters via the information bottleneck method," In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval ACM, pp. 208–215, 2000. [20] D.M. Blei, A.Y. Ng, and M.I. Jordan, "Latent dirichlet allocation," Journal of Machine

tics), 1st edn. 2006. corr. 2nd printing edn. Springer, New York, 2007.

probability, vol. 1, Oakland, CA, USA., pp. 281–297, 1967.

118 Advances in Statistical Methodologies and Their Application to Real Problems

Transactions on Communications, vol. 28, no. 1, pp. 84–95, 1980.

ety for Industrial and Applied Mathematics, pp. 1027–1035, 2007.

ciation for Computational Linguistics, pp. 183–190, 1993.

Learning Research, vol. 3, pp. 993–1022, 2003.

and development in information retrieval ACM, pp. 96–103, 1998.

puters, vol. 16, no. 3, pp. 299–307, 1967.

10, no. 2, pp. 251–276, 1998.

1265–1287, 2003.

New York, 1973.

1197–1206, 1994.


**Probability Distributions and Their Applications to Data Analysis**

## **Gamma-Kumaraswamy Distribution in Reliability Analysis: Properties and Applications** Gamma-Kumaraswamy Distribution in Reliability Analysis: Properties and Applications

Indranil Ghosh and Gholamhossein G. Hamedani Gholamhossein G. Hamedani

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66821

#### Abstract

Indranil Ghosh and

In this chapter, a new generalization of the Kumaraswamy distribution, namely the gamma-Kumaraswamy distribution is defined and studied. Several distributional properties of the distribution are discussed in this chapter, which includes limiting behavior, mode, quantiles, moments, skewness, kurtosis, Shannon's entropy, and order statistics. Under the classical method of estimation, the method of maximum likelihood estimation is proposed for the inference of this distribution. We provide the results of an analysis based on two real data sets when applied to the gamma-Kumaraswamy distribution to exhibit the utility of this model.

Keywords: gamma-Kumaraswamy distribution, Renyi's entropy, reliability parameter, stochastic ordering, characterizations

## 1. Introduction

The generalization of a distribution by mixing it with another distribution over the years has provided a mathematical based way to model a wide variety of random phenomena statistically. These generalized distributions are effective and flexible models to analyze and interpret random durations in a possibly heterogeneous population. In many situations, observed data may be assumed to have come from such a mixture population of two or more distributions.

Two parameter gamma and a two parameter Kumaraswamy are most popular distribution for analyzing any lifetime data. Gamma distribution is a well-known distribution, and it has several desirable properties [1].

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons

distribution, and eproduction in any medium, provided the original work is properly cited.

A serious limitation of the gamma distribution, however, is that the distribution function (or survival function) is not available in a closed form if the shape parameter is not an integer, thereby it requires some numerical methods to evaluate these quantities. As a consequence, this distribution is less attractive as compared to Ref. [2], which has nice tractable distribution function, survival function and hazard function. In this paper, we consider a four parameter gamma-Kumaraswamy distribution. It is observed that it has many properties which are quite similar to those of a gamma distribution, but it has an explicit expression for the distribution function or the survival functions. The major motivation of this chapter is to introduce a new family of distributions, make a comparative study of this family with respect to a Kumaraswamy family and a gamma family and provide the practitioner with an additional option, with a hope that it may have a 'better fit' compared to a gamma family or Kumaraswamy family in certain situations. It is noteworthy to note that the gamma-Kumaraswamy distribution is a generalization of Kumaraswamy distribution with the property that it can exhibit various shapes. (Figure 1). This provides more flexibility to the gamma-Kumaraswamy distribution in comparison with Kumaraswamy distribution in modeling different data sets. The property of left-skewness is a rare characteristic as it is not enjoyed by several generalizations of Kumaraswamy distribution. Our proposed model is different from that of Ref. [3], where the authors have proposed a generalized gamma-generated distribution with an extra positive parameter for any continuous baseline G distribution.

Figure 1. GK density plot for some specific parameter values.

The rest of the paper is organized as follows. In Section 2, we propose the gamma-Kumaraswamy distribution [GK(α, β, a, b)]. In Section 3, we study various properties of the GK(α, β, a, b) including the limiting behavior, transformation, and the mode. In Section 4, the moment generating function, the moments and the mean deviations from the mean and the median, and Renyi's entropy are studied. In Section 5, we consider the maximum likelihood estimation of the GK(α, β, a, b). In Section 6, we provide an expression for the reliability parameter for two independent GK(α, β, a, b) with different choices for the parameters α and β but for a fixed choice of the two shape parameters of Kumaraswamy distribution. In Section 7, discussion is made for the moment generating function of the r-th order statistic and also the limiting distribution of the sample minimum and the sample maximum for a random sample of size n drawn from GK(α, β, a, b). An application of GK(α, β, a, b) is discussed in Section 8. Certain characterizations of GK(α, β, a, b) are presented in Section 9. In Section 10, some concluding remarks are made.

## 2. The gamma-Kumaraswamy distribution

We consider the following class of gamma-X class of distributions, for which, the parent model being

$$f(\mathbf{x}) = \frac{1}{\Gamma(\alpha)\theta^{\alpha}} \frac{\mathbf{g}(\mathbf{x})}{\overline{\mathbf{G}}^2(\mathbf{x})} \exp\left(-\frac{\mathbf{g}(\mathbf{x})}{\beta \overline{\mathbf{G}}(\mathbf{x})}\right) \left(\frac{\mathbf{G}(\mathbf{x})}{\overline{\mathbf{G}}(\mathbf{x})}\right)^{\alpha - 1}, \quad \mathbf{x} > \mathbf{0}, \tag{1}$$

where α, β are positive parameters. Also, gðxÞ½GðxÞ� is the density function [cumulative distribution function] of the random variable X. Furthermore, GðxÞ is the survival function of the associated random variable X.

If <sup>X</sup> has density Eq. (1), then the random variable <sup>W</sup> <sup>¼</sup> <sup>G</sup>ðx<sup>Þ</sup> Gðx<sup>Þ</sup> has a gamma distribution with parameters α, β. The reverse happens to be true as well. Here, we consider G(.) to be the cdf of a Kumaraswamy distribution with parameters a, b. Then, the cdf of the gamma-Kumaraswamy (hereafter GK) reduces to

$$F(\mathbf{x}) = \int\_0^{\frac{1-(1-\mathbf{x}^a)^b}{(1-\mathbf{x}^a)^b}} \frac{e^{-w/\theta} w^{a-1}}{\Gamma(\alpha)\beta^a} dw = \mathcal{V}\_1 \left(\alpha, \frac{1-(1-\mathbf{x}^a)^b}{\beta(1-\mathbf{x}^a)^b}\right), \quad 0 < \mathbf{x} < 1. \tag{2}$$

where <sup>γ</sup>1ðα; <sup>z</sup>Þ ¼ <sup>Γ</sup>ðα; <sup>z</sup><sup>Þ</sup> <sup>Γ</sup>ðα<sup>Þ</sup> with <sup>Γ</sup>ðα; <sup>x</sup>Þ ¼ <sup>ð</sup><sup>x</sup> 0 u<sup>α</sup>−<sup>1</sup> e <sup>−</sup>udu is the regularized incomplete gamma function. So the density and hazard functions corresponding to Eq. (2) are given, respectively, by

$$f(\mathbf{x}) = \frac{ab\exp\left(-\frac{1-(1-\mathbf{x}^a)^b}{\beta(1-\mathbf{x}^a)^b}\right)}{\Gamma(\alpha)\beta^a} \frac{1}{(1-\mathbf{x}^a)^2} \left(\frac{1-(1-\mathbf{x}^a)^b}{\beta(1-\mathbf{x}^a)^b}\right)^{a-1} \mathbf{x}^{a-1}, \quad 0 < \mathbf{x} < 1,\tag{3}$$

and

A serious limitation of the gamma distribution, however, is that the distribution function (or survival function) is not available in a closed form if the shape parameter is not an integer, thereby it requires some numerical methods to evaluate these quantities. As a consequence, this distribution is less attractive as compared to Ref. [2], which has nice tractable distribution function, survival function and hazard function. In this paper, we consider a four parameter gamma-Kumaraswamy distribution. It is observed that it has many properties which are quite similar to those of a gamma distribution, but it has an explicit expression for the distribution function or the survival functions. The major motivation of this chapter is to introduce a new family of distributions, make a comparative study of this family with respect to a Kumaraswamy family and a gamma family and provide the practitioner with an additional option, with a hope that it may have a 'better fit' compared to a gamma family or Kumaraswamy family in certain situations. It is noteworthy to note that the gamma-Kumaraswamy distribution is a generalization of Kumaraswamy distribution with the property that it can exhibit various shapes. (Figure 1). This provides more flexibility to the gamma-Kumaraswamy distribution in comparison with Kumaraswamy distribution in modeling different data sets. The property of left-skewness is a rare characteristic as it is not enjoyed by several generalizations of Kumaraswamy distribution. Our proposed model is different from that of Ref. [3], where the authors have proposed a generalized gamma-generated distribution with an extra positive parameter for any continuous base-

1242 Advances in Statistical Methodologies and their Application to Real Problems Advances in Statistical Methodologies and Their Application to Real Problems

line G distribution.

Figure 1. GK density plot for some specific parameter values.

$$h\_{\Gamma}(\mathbf{x}) = \frac{\left( (\mathbf{1} - \mathbf{x}^a)^{-b} - \mathbf{1} \right)^{a-1} ab \mathbf{x}^{a-1} (\mathbf{1} - \mathbf{x}^a)^{b-1} \exp \left( -\beta^{-1} ((\mathbf{1} - \mathbf{x}^a)^{-b} - \mathbf{1}) \right)}{\beta^a (\mathbf{1} - \mathbf{x}^a)^{b+1} \left( \mathbf{1} - \Gamma \left( a, \frac{\mathbf{1} - (\mathbf{1} - \mathbf{x}^a)^b}{\beta (\mathbf{1} - \mathbf{x}^a)^b} \right) \right)}. \tag{4}$$

The percentile functions for GK distribution: The p th percentile xp is defined by F(xp) = p. From Eq. (2), we have γ<sup>1</sup> α; <sup>1</sup>−ð1−xa <sup>Þ</sup> b βð1−xaÞ b � � <sup>¼</sup> <sup>p</sup>. Define Zp <sup>¼</sup> <sup>1</sup>−ð1−xa <sup>Þ</sup> b βð1−xa Þ <sup>b</sup> , then Zp <sup>¼</sup> <sup>γ</sup><sup>−</sup><sup>1</sup> <sup>1</sup> <sup>ð</sup>α;pÞ, where <sup>γ</sup><sup>−</sup><sup>1</sup> <sup>1</sup> is the inverse of regularized incomplete gamma function. Hence, xp ¼ � 1− � βð1 þ Z1−pÞ �<sup>1</sup>=<sup>b</sup>�<sup>1</sup>=<sup>a</sup> .

In the density equation (3), a, b, and α are shape parameters and β is the scale parameter. It can be immediately verified that Eq. (3) is a density function. Plots of the GK density and survival rate function for selected parameter values are given in Figures 1 and 2, respectively.

Figure 2. GK hazard rate function plot for some specific parameter values.

If X~GK(a, b, α, β), then the survival function of X, S(x) will be

$$(1 - \gamma\_1 \left( a, \frac{1 - (1 - \mathbf{x}^a)^b}{\beta (1 - \mathbf{x}^a)^b} \right). \tag{5}$$

We simulate the GK distribution by solving the nonlinear equation

$$\gamma(1-\mu) - \gamma\_1\left(a, \frac{1 - \left(1 - \mathbf{x}^a\right)^b}{\beta \left(1 - \mathbf{x}^a\right)^b}\right) = 0,\tag{6}$$

where u has the uniform (0,1) distribution. Some facts regarding the GK distribution are as follows:

• If X~GK(a, b, α, β), then Xm~GK(a, b, α, β), ∀m≠0.

hFðxÞ ¼

Eq. (2), we have γ<sup>1</sup> α;

� <sup>ð</sup>1−x<sup>a</sup> Þ −b −1 �<sup>α</sup>−<sup>1</sup>

126 Advances in Statistical Methodologies and Their Application to Real Problems

<sup>1</sup>−ð1−xa <sup>Þ</sup> b βð1−xaÞ b � �

inverse of regularized incomplete gamma function. Hence, xp ¼

If X~GK(a, b, α, β), then the survival function of X, S(x) will be

Figure 2. GK hazard rate function plot for some specific parameter values.

We simulate the GK distribution by solving the nonlinear equation

1−γ<sup>1</sup> α;

<sup>1</sup>−ð1−xa Þ b

!

βð1−xaÞ b

abx<sup>a</sup>−<sup>1</sup>ð1−xa

The percentile functions for GK distribution: The p th percentile xp is defined by F(xp) = p. From

In the density equation (3), a, b, and α are shape parameters and β is the scale parameter. It can be immediately verified that Eq. (3) is a density function. Plots of the GK density and survival

<sup>¼</sup> <sup>p</sup>. Define Zp <sup>¼</sup> <sup>1</sup>−ð1−xa <sup>Þ</sup>

rate function for selected parameter values are given in Figures 1 and 2, respectively.

<sup>β</sup><sup>α</sup>ð1−xa<sup>Þ</sup>

Þ b−1 exp � −β<sup>−</sup><sup>1</sup>

> 1−ð1−xa Þ b βð1−xa Þ b

b βð1−xa Þ

<sup>b</sup>þ<sup>1</sup> 1−Γ α;

ðð1−xa Þ −b −1Þ �

<sup>b</sup> , then Zp <sup>¼</sup> <sup>γ</sup><sup>−</sup><sup>1</sup>

� 1− �

� � � � : (4)

<sup>1</sup> <sup>ð</sup>α;pÞ, where <sup>γ</sup><sup>−</sup><sup>1</sup>

�<sup>1</sup>=<sup>b</sup>�<sup>1</sup>=<sup>a</sup> .

βð1 þ Z1−pÞ

: (5)

<sup>1</sup> is the


The first result provides an important property of the GK distribution for information analysis is that this distribution is closed under power transformation. The latter result is equally important because it provides a simple way to generate random variables following the GK distribution.

## 3. Properties of GK distribution

The following lemma establishes the relation between GK(α, β, a, b) distribution and gamma distribution.

Lemma 1. (Transformation): If a random variable X follows a gamma distribution with parameters <sup>α</sup> and <sup>β</sup>, then <sup>Y</sup> <sup>¼</sup> <sup>1</sup>−ð1−Xa Þ b <sup>ð</sup>1−X<sup>a</sup> Þ <sup>b</sup> follows GK(α, β, a, b) distribution.

Proof. The proof follows immediately by using the transformation technique. W

The limiting behaviors of the GK pdf and its hazard function are given in the following theorem.

Theorem 1. The limits GK density function, f(x), and the hazard function, hFðxÞ, are given by

$$\lim\_{\mathbf{x}\to\mathbf{0}^{+}}f(\mathbf{x})=\lim\_{\mathbf{x}\to\mathbf{0}^{+}}h\_{f}(\mathbf{x})=\begin{cases}0,&a>1,b>1,a>1\\\infty,&\min\{a,b\}<1,a<1,\end{cases}\tag{7}$$

$$\lim\_{\mathbf{x}\to\mathbf{s}^{\omega}} f(\mathbf{x}) = \lim\_{\mathbf{x}\to\mathbf{s}^{\omega}} h\_f(\mathbf{x}) = \begin{cases} 0, & b > 0, \alpha < 1 \\ \underset{\mathbf{s}\to}{\mathbf{s}}, & b < 0, \alpha > 1. \end{cases} \tag{8}$$

Proof. Straightforward and hence omitted. W

Theorem 2. The mode of the GK distribution is the solution of the equation kðxÞ ¼ 0; where

$$k(\mathbf{x}) = (a - 1) - \frac{2\mathbf{x}^a}{\left(1 - \mathbf{x}^a\right)} + \frac{ab\mathbf{x}^a}{\left(1 - \mathbf{x}^a\right)^b} \left(\boldsymbol{\beta}^{-1} + \left(\frac{1 - \left(1 - \mathbf{x}^a\right)^b}{\beta \left(1 - \mathbf{x}^a\right)^b}\right)^{-1}\right). \tag{9}$$

Proof. The derivative of f(x) in Eq. (3) can be written as

$$\frac{\partial}{\partial \mathbf{x}^{\boldsymbol{a}}} f(\mathbf{x}) = \frac{1}{\beta^{\mathbf{a}} \Gamma(\boldsymbol{a})} \frac{a \mathbf{b} \mathbf{x}^{\mathbf{a} - 2}}{\left(1 - \mathbf{x}^{\mathbf{a}}\right)^{2}} \exp\left(-\beta^{-1} \left(\left(1 - \mathbf{x}^{\mathbf{a}}\right)^{-b} - 1\right)\right) \left(\frac{1 - \left(1 - \mathbf{x}^{\mathbf{a}}\right)^{b}}{\beta \left(1 - \mathbf{x}^{\mathbf{a}}\right)^{b}}\right)^{a-1} k(\mathbf{x}). \tag{10}$$

The critical values of Eq. (10) are the solutions of kðxÞ ¼ 0: W

Next, we discuss the IFR and/or DFR property of the hazard function for the GK distribution. For this, we will consider the result of Lemma 1. According to Lemma 1, if X~GK(a, b, α, β), then <sup>Y</sup> <sup>¼</sup> <sup>1</sup>−ð1−Xa Þ b <sup>ð</sup>1−X<sup>a</sup> Þ <sup>b</sup> ∼Gamma (α, β). In such a case for the random variable Y, the hazard rate function can be written as

$$\frac{1}{r(t)} = \frac{1 - F(t)}{f(t)}$$

$$= \frac{\int\_{t}^{\infty} \frac{1}{\beta^{\alpha} \Gamma(\alpha)} w^{\alpha - 1} \exp\left(-w/\beta\right) dw}{\frac{1}{\beta^{\alpha} \Gamma(\alpha)} t^{\alpha - 1} \exp\left(-t/\sqrt[\alpha]{\beta} \beta t \alpha\right)}\tag{11}$$

$$= \int\_{t}^{\inf t} \left(\frac{w}{t}\right)^{\alpha - 1} \exp\left(-1/\beta (w - t)\right) dw$$

$$= \int\_{0}^{\infty} \left(1 + \frac{u}{t}\right)^{\alpha - 1} \exp\left(-1/\beta u\right) du.$$

Therefore, <sup>r</sup>ðtÞ ¼ <sup>ð</sup><sup>∞</sup> 0 <sup>1</sup> <sup>þ</sup> <sup>u</sup> t � �<sup>α</sup>−<sup>1</sup> exp <sup>ð</sup>−1=βuÞdu � �<sup>−</sup><sup>1</sup> . If <sup>α</sup> <sup>&</sup>gt; 1, 1 <sup>þ</sup> <sup>u</sup> t � �<sup>α</sup>−<sup>1</sup> is decreasing in t and hence r(t) is increasing, thereby and has a IFR. If 0 < α < 1, then

<sup>1</sup> <sup>þ</sup> <sup>u</sup> t � �<sup>α</sup>−<sup>1</sup> is increasing in t, so r(t) decreases and hence has a DFR. Now, since X is a one-to-one function of Y, the hazard rate function of X will also follow the exact pattern.

Let X and Y be two random variables. X is said to be stochastically greater than or equal to Y denoted by X ≥ st <sup>Y</sup> if <sup>P</sup>ð<sup>X</sup> <sup>&</sup>gt; <sup>x</sup>Þ≥Pð<sup>Y</sup> <sup>&</sup>gt; <sup>x</sup><sup>Þ</sup> for all <sup>x</sup> in the support set of <sup>X</sup>.

Theorem 3. Suppose X~GKða1;b1;α;β1Þ and Y~GKða2;b2;α;β2Þ: If β<sup>1</sup> > β2, a<sup>1</sup> > a<sup>2</sup> and b<sup>1</sup> < b2. Then X ≥ st <sup>Y</sup>, for integer values of <sup>a</sup><sup>1</sup> and <sup>a</sup>2.

Proof. At first, we note that the incomplete gamma function Γðα;xÞ is an increasing function of x for fixed α. For any real number x∈ð0; 1Þ, β<sup>1</sup> > β2; a<sup>1</sup> > a2, and b<sup>1</sup> < b2, we have

$$
\beta\_1^{-1} \left( (1 - \mathbf{x}^{a\_1})^{b\_1} - 1 \right) \le \beta\_2^{-1} \left( (1 - \mathbf{x}^{a\_2})^{b\_2} - 1 \right). \tag{12}
$$

This implies that Γ � α;β<sup>−</sup><sup>1</sup> 1 � <sup>ð</sup>1−xa<sup>1</sup> <sup>Þ</sup> b1 −1 ��≤<sup>Γ</sup> � α;β<sup>−</sup><sup>1</sup> 2 � <sup>ð</sup>1−xa<sup>2</sup> <sup>Þ</sup> b2 −1 ��. Equivalently, it implies that PðX > xÞ ≥ PðY > xÞ, and this completes the proof. W

Note: For fractional choices of a<sup>1</sup> and a2, the reverse of the above inequality will hold.

## 4. Moments and mean deviations

For any r ≥ 1,

∂ ∂x

> Þ b

<sup>ð</sup>1−X<sup>a</sup> Þ

function can be written as

then <sup>Y</sup> <sup>¼</sup> <sup>1</sup>−ð1−Xa

Therefore, rðtÞ ¼

denoted by X ≥

This implies that Γ

<sup>1</sup> <sup>þ</sup> <sup>u</sup> t

X ≥

have

ð∞ 0

st <sup>Y</sup>, for integer values of <sup>a</sup><sup>1</sup> and <sup>a</sup>2.

� α;β<sup>−</sup><sup>1</sup> 1 � <sup>ð</sup>1−xa<sup>1</sup> <sup>Þ</sup> b1 −1 �� ≤Γ � α;β<sup>−</sup><sup>1</sup> 2 � <sup>ð</sup>1−xa<sup>2</sup> <sup>Þ</sup> b2 −1 ��

<sup>1</sup> <sup>þ</sup> <sup>u</sup> t � �<sup>α</sup>−<sup>1</sup>

<sup>f</sup>ðxÞ ¼ <sup>1</sup>

<sup>β</sup><sup>α</sup>Γðα<sup>Þ</sup>

128 Advances in Statistical Methodologies and Their Application to Real Problems

The critical values of Eq. (10) are the solutions of kðxÞ ¼ 0: W

¼

¼ ðinfty t

> ¼ ð∞ 0

� �<sup>−</sup><sup>1</sup>

hence r(t) is increasing, thereby and has a IFR. If 0 < α < 1, then

β−1 1 � <sup>ð</sup>1−xa<sup>1</sup> <sup>Þ</sup> b1 −1 � ≤ β<sup>−</sup><sup>1</sup> 2 � <sup>ð</sup>1−xa<sup>2</sup> <sup>Þ</sup> b2 −1 �

PðX > xÞ ≥ PðY > xÞ, and this completes the proof. W

ð∞ t

abxa<sup>−</sup><sup>2</sup> ð1−xaÞ

<sup>2</sup> exp � −β<sup>−</sup><sup>1</sup>

ðð1−xa Þ −b −1Þ

<sup>b</sup> ∼Gamma (α, β). In such a case for the random variable Y, the hazard rate

<sup>w</sup><sup>α</sup>−<sup>1</sup> exp <sup>ð</sup>−w=βÞdw

<sup>α</sup>−<sup>1</sup> exp <sup>ð</sup>−t=%beta<sup>Þ</sup>

−1=βðw−tÞ

exp ð−1=βuÞdu:

. If <sup>α</sup> <sup>&</sup>gt; 1, 1 <sup>þ</sup> <sup>u</sup>

t

� �<sup>α</sup>−<sup>1</sup> is decreasing in t and

: (12)

. Equivalently, it implies that

� dw

Next, we discuss the IFR and/or DFR property of the hazard function for the GK distribution. For this, we will consider the result of Lemma 1. According to Lemma 1, if X~GK(a, b, α, β),

> <sup>¼</sup> <sup>1</sup>−Fðt<sup>Þ</sup> fðtÞ

1 rðtÞ

1 βαΓðα<sup>Þ</sup>

1 <sup>β</sup><sup>α</sup>Γðα<sup>Þ</sup> t

> w t � �<sup>α</sup>−<sup>1</sup>

<sup>1</sup> <sup>þ</sup> <sup>u</sup> t � �<sup>α</sup>−<sup>1</sup>

exp ð−1=βuÞdu

function of Y, the hazard rate function of X will also follow the exact pattern.

exp �

� �<sup>α</sup>−<sup>1</sup> is increasing in t, so r(t) decreases and hence has a DFR. Now, since X is a one-to-one

Let X and Y be two random variables. X is said to be stochastically greater than or equal to Y

Theorem 3. Suppose X~GKða1;b1;α;β1Þ and Y~GKða2;b2;α;β2Þ: If β<sup>1</sup> > β2, a<sup>1</sup> > a<sup>2</sup> and b<sup>1</sup> < b2. Then

Proof. At first, we note that the incomplete gamma function Γðα;xÞ is an increasing function of x for fixed α. For any real number x∈ð0; 1Þ, β<sup>1</sup> > β2; a<sup>1</sup> > a2, and b<sup>1</sup> < b2, we

st <sup>Y</sup> if <sup>P</sup>ð<sup>X</sup> <sup>&</sup>gt; <sup>x</sup>Þ≥Pð<sup>Y</sup> <sup>&</sup>gt; <sup>x</sup><sup>Þ</sup> for all <sup>x</sup> in the support set of <sup>X</sup>.

Note: For fractional choices of a<sup>1</sup> and a2, the reverse of the above inequality will hold.

� <sup>1</sup>−ð1−xa

βð1−xaÞ b !<sup>α</sup>−<sup>1</sup>

Þ b

kðxÞ: (10)

(11)

$$\begin{split} E(X') &= \int\_0^1 x' f(x) dx \\ &= \frac{1}{\Gamma(\alpha)} \int\_0^\alpha \exp\left(-u\right) u^{\alpha-1} \left(1 - (1+u\beta)^{-1/b}\right)^{r/a} du \\ &\quad (\text{on substitution } u = \frac{1 - (1-x^a)^b}{\beta (1-x^a)^b} \\ &= \frac{1}{\Gamma(\alpha)} \sum\_{j=0}^\alpha (-1)^j \binom{r/a}{j} \int\_0^\alpha \exp\left(-u\right) u^{\alpha-1} (1+u\beta)^{-j/b} du \\ &= \frac{1}{\Gamma(\alpha)} \sum\_{j=0}^\alpha \sum\_{k=0}^\alpha (-1)^{j+k} \binom{r/a}{j} \binom{j/b+k-1}{k} \int\_0^\alpha \exp\left(-u\right) \beta^k u^{\alpha+k-1} du \\ &= \frac{\beta^k}{\Gamma(\alpha)} \sum\_{j=0}^\alpha \sum\_{k=0}^j (-1)^{j+k} \binom{r/a}{j} \binom{j/b+k-1}{k} \Gamma(\alpha+k). \end{split} \tag{13}$$

Upper bounds for the <sup>r</sup>-th order moment: Since <sup>n</sup> k � �<sup>≤</sup> nk k! , for 1 ≤ k ≤ n, from Eq. (13), one can write <sup>E</sup>ðXr <sup>Þ</sup><sup>≤</sup> <sup>ð</sup><sup>r</sup> a Þðj <sup>b</sup> þ k−1Þ � � <sup>þ</sup> <sup>β</sup><sup>k</sup> <sup>Γ</sup>ðα<sup>Þ</sup> <sup>∑</sup><sup>∞</sup> <sup>j</sup>¼<sup>0</sup>∑<sup>∞</sup> <sup>k</sup>¼<sup>0</sup>ð−1<sup>Þ</sup> jþk ðr=aÞ j j! ðj=bþk−1Þ k k! � �Γð<sup>α</sup> <sup>þ</sup> <sup>k</sup>Þ, provided r/a and <sup>j</sup>/b+k−1 are both integers. Employing successively, the generalized series expansion of � 1−ð1 þ βuÞ −1=b �<sup>j</sup>=<sup>a</sup> , the

characteristic function for X~GKða;b;α;θÞ will be given by [from Eq. (3)]

$$\begin{split} \phi\_{X}(t) &= \frac{1}{\Gamma(\alpha)} \Big| \int\_{0}^{1} t^{\alpha} f(\mathbf{x}) d\mathbf{x} \\ &= \frac{1}{\Gamma(\alpha)} \Big| \int\_{0}^{\alpha} u^{\alpha - 1} e^{-u} \exp\left(it \Big( \mathbf{1} - (1 + \beta u)^{-1/b} \Big)^{1/a} \right) du \\ &\quad \text{on substitution} \quad u = \frac{1 - (1 - \mathbf{x}^{a})^{b}}{\beta (1 - \mathbf{x}^{a})^{b}} \\ &= \frac{1}{\Gamma(\alpha)} \sum\_{j=0}^{\alpha} \Big| \int\_{0}^{\alpha} \frac{\left(it \Big( \mathbf{1} - (1 + \beta u)^{-1/b} \Big)^{1/a} \right)^{1/a}}{j!} u^{\alpha - 1} e^{-u} du \\ &= \frac{1}{\Gamma(\alpha)} \sum\_{j=0}^{\alpha} \sum\_{k=0}^{\alpha} \sum\_{i=0}^{\alpha} (-1)^{k\_{1} + k\_{2}} \theta^{k\_{2}}(i\mathbf{t})^{j} \binom{j/a}{k\_{1}} \binom{k\_{1}/b}{k\_{2}} \Gamma(\alpha + k\_{2}). \end{split} \tag{14}$$

If j/a and k1/b are integers then in Eq. (14), the second and third summations will stop at j/a and k1/b, respectively.

If we denote the median by T, then the mean deviation from the mean, DðμÞ, and the mean deviation from the median, DðTÞ, can be written as

$$D(\mu) = E|X - \mu| = 2\mu G(\mu) - 2\int\_{-\infty}^{\mu} \mathfrak{x}f(\mathbf{x})d\mathbf{x}.\tag{15}$$

$$D(T) = E|X - T| = \mu \text{-2} \int\_{-\infty}^{T} xf(x)dx. \tag{16}$$

Now, consider

$$\begin{aligned} I\_t &= \int\_0^t xf(x)dx\\ &= \int\_0^t xab \frac{\exp\left(-\frac{1-(1-\mathbf{x}^a)^b}{\beta\left(1-\mathbf{x}^a\right)^b}\right)}{\Gamma(\alpha)\beta^{\alpha}}\\ &\frac{1}{\left(1-\mathbf{x}^a\right)^2} \left(\frac{1-\left(1-\mathbf{x}^a\right)^b}{\beta\left(1-\mathbf{x}^a\right)^b}\right)^{\alpha-1}\mathbf{x}^{a-1}d\mathbf{x}. \end{aligned} \tag{17}$$

Using the substitution <sup>u</sup> <sup>¼</sup> <sup>1</sup>−ð1−xa <sup>Þ</sup> b βð1−xa Þ <sup>b</sup> in Eq. (17), we obtain

·

$$\begin{split} I\_t &= \frac{1}{\Gamma(\alpha)} \overbrace{\begin{pmatrix} \beta^{(1-t^a)} \\ 0 \end{pmatrix}^b}^{1-(1-t^a)^b} \left( 1 - (1+u\beta)^{-1/b} \right)^{1/a} u^{a-1} e^{-u} du \\ = \frac{1}{\Gamma(\alpha)} \sum\_{j=0}^{\infty} \sum\_{k=0}^{\infty} (-1)^j \beta^k (it)^j \binom{1/a}{j} \binom{j/b+k-1}{k} \Gamma(\alpha, \frac{1-(1-t^a)^b}{\beta(1-t^a)^b}), \end{split} \tag{18}$$

where we used successively binomial series expansion.

By using Eqs. (2) and (18), the mean deviation from the mean and the mean deviation from the median are, respectively, given by

$$D(\mu) = 2\mu \frac{\Gamma\left(a, \frac{1 - \left(1 - m^a\right)^b}{\beta \left(1 - m^a\right)^b}\right)}{\Gamma(\alpha)} - 2I\_{\mu}.\tag{19}$$
  $D(M) = \text{mu} - 2I\_M$ .

#### 4.1. Entropy

One useful measure of diversity for a probability model is given by Renyi's entropy. It is defined as IRðρÞ¼ð1−ρÞ −1 log�ð f ρ ðxÞdx� , where ρ > 0 and ρ ≠ 1. If a random variable X has a GK distribution, then we have

$$f^{\rho}(\mathbf{x}) = \left(\frac{ab}{\Gamma(a)\beta^a}\right)^{\rho} \exp\left(-\frac{\rho(1-(1-\mathbf{x}^a)^b)}{\beta(1-\mathbf{x}^a)^b}\right) \times \frac{1}{(1-\mathbf{x}^a)^{2\rho}} \left(\frac{1-(1-\mathbf{x}^a)^b}{\beta(1-\mathbf{x}^a)^b}\right)^{\rho(a-1)} \mathbf{x}^{\rho(a-1)}\tag{20}$$

Next, consider the integral

Gamma-Kumaraswamy Distribution in Reliability Analysis: Properties and Applications http://dx.doi.org/10.5772/66821 131

$$\begin{split} \int\_{0}^{1} f^{\rho}(\mathbf{x}) d\mathbf{x} &= \left(\frac{ab}{\Gamma(a)\beta^{a}}\right)^{\rho} \rho^{-1} \int\_{0}^{\infty} u^{a-1} \exp\left(-u\right) \left(1 - \left(1 + \beta u^{1/\rho}\right)^{-1/b}\right)^{1/a} du \\ &\quad \text{(on substitution } u = \left(\frac{1 - \left(1 - x^{a}\right)^{b}}{\beta \left(1 - x^{a}\right)^{b}}\right)^{\rho} .\end{split} \tag{21}$$

Now, using successive application of the generalized binomial expansion, we can write

$$\left(\mathbf{1} - (\mathbf{1} + \beta \mathbf{u}^{1/\rho})^{-1/b}\right)^{1/a} = (-1)^{\rho - 1} \sum\_{j=0}^{\infty} \sum\_{k=0}^{\infty} (-1)^j \binom{\rho - 1}{j} \binom{1/b + k - 1}{k} \beta^k \mathbf{u}^{k/\rho}.\tag{22}$$

Hence, the integral in Eq. (21) reduces to

DðμÞ ¼ EjX−μj ¼ 2μGðμÞ−2

DðTÞ ¼ EjX−Tj ¼ μ−2

It ¼ ðt 0 xfðxÞdx

exp −

<sup>1</sup>−ð1−xa Þ b

βð1−xaÞ b !<sup>α</sup>−<sup>1</sup>

1−ð1 þ uβÞ

� � <sup>j</sup>=<sup>b</sup> <sup>þ</sup> <sup>k</sup>−<sup>1</sup>

By using Eqs. (2) and (18), the mean deviation from the mean and the mean deviation from the

One useful measure of diversity for a probability model is given by Renyi's entropy. It is

b Þ

·

1 ð1−xaÞ 2ρ

<sup>1</sup>−ð1−m<sup>a</sup> Þ b

!

βð1−maÞ b

<sup>Γ</sup>ðα<sup>Þ</sup> <sup>−</sup>2Iμ:

Γ α;

<sup>b</sup> in Eq. (17), we obtain

�

<sup>j</sup> 1=a j

¼ ðt 0 xab

1 ð1−xaÞ 2

·

130 Advances in Statistical Methodologies and Their Application to Real Problems

b βð1−xa Þ

> ð 1−ð1−t a Þ b

> > 0

DðμÞ ¼ 2μ

exp −

DðMÞ ¼ mu−2IM:

<sup>ρ</sup>ð1−ð1−xa<sup>Þ</sup>

!

βð1−xaÞ b

βð1−t a Þ b

It <sup>¼</sup> <sup>1</sup> ΓðαÞ

where we used successively binomial series expansion.

−1 log �ð f ρ ðxÞdx �

<sup>Γ</sup>ðαÞβ<sup>α</sup> � �<sup>ρ</sup>

Now, consider

Using the substitution <sup>u</sup> <sup>¼</sup> <sup>1</sup>−ð1−xa <sup>Þ</sup>

<sup>¼</sup> <sup>1</sup> <sup>Γ</sup>ðα<sup>Þ</sup> <sup>∑</sup> ∞ j¼0 ∑ ∞ k¼0 ð−1Þ j βk ðitÞ

median are, respectively, given by

4.1. Entropy

defined as IRðρÞ¼ð1−ρÞ

f

Next, consider the integral

GK distribution, then we have

<sup>ρ</sup>ðxÞ ¼ ab

ðμ −∞

ðT −∞

<sup>1</sup>−ð1−xa<sup>Þ</sup> b

!

βð1−xaÞ b

> xa<sup>−</sup><sup>1</sup> dx:

−1=b �<sup>1</sup>=<sup>a</sup> u<sup>α</sup>−<sup>1</sup> e <sup>−</sup>udu

k � �

Γðα;

1−ð1−t a Þ b

βð1−t a Þ b Þ,

, where ρ > 0 and ρ ≠ 1. If a random variable X has a

<sup>1</sup>−ð1−xa<sup>Þ</sup> b

βð1−xaÞ b !<sup>ρ</sup>ðα−1<sup>Þ</sup>

<sup>Γ</sup>ðαÞβ<sup>α</sup>

xfðxÞdx: (15)

xfðxÞdx: (16)

(17)

(18)

(19)

x<sup>ρ</sup>ða−1<sup>Þ</sup> (20)

$$\int\_{0}^{1} f^{\rho}(\mathbf{x})d\mathbf{x} = \left(\frac{a\mathbf{b}}{\Gamma(\alpha)\delta^{\alpha}}\right)^{\rho} \rho^{\rho\alpha-k} (-1)^{\rho-1} \sum\_{j=0}^{\infty} \sum\_{k=0}^{\infty} (-1)^{j} \binom{\rho-1}{j} \binom{1/b+k-1}{k} \beta^{k} \Gamma(\rho\alpha+k) = \delta(\rho, \alpha, \beta, \mathbf{a}, b), \text{say} \tag{23}$$

Therefore, the expression for the Renyi's entropy will be

$$I\_R(\rho) = (1 - \rho)^{-1} \log \left( \delta(\rho, \alpha, \beta, a, b) \right) \tag{24}$$

#### 5. Maximum likelihood estimation

In this section, we address the parameter estimation of the GK(α, β, a, b) under the classical set up. Let X1, X2, …, Xn be a random sample of size n drawn from the density Eq. (3). The loglikelihood function is given by

$$\begin{split} \ell &= -a \log \beta - n \log \Gamma(a) + n \log a + n \log b + (a - 1) \sum\_{i=1}^{n} \log X\_i \\ &- \sum\_{i=1}^{n} \frac{1 - (1 - \mathbf{X}\_i^a)^b}{\beta (1 - \mathbf{X}\_i^a)^b} - 2 \sum\_{i=1}^{n} \log (1 - \mathbf{X}\_i^a) + (a - 1) \sum\_{i=1}^{n} \log \left( \frac{1 - (1 - \mathbf{X}\_i^a)^b}{\beta (1 - \mathbf{X}\_i^a)^b} \right). \end{split} \tag{25}$$

The derivatives of Eq. (13) with respect to α, β, a, and b are given by

$$\frac{\partial}{\partial \alpha} \ell = -n \log \beta \text{-} \mathcal{V}(\alpha) + \sum\_{i=1}^{n} \log \left( \frac{1 - (1 - X\_i^a)^b}{\beta (1 - X\_i^a)^b} \right), \tag{26}$$

where <sup>Ψ</sup>ðαÞ ¼ <sup>∂</sup> <sup>∂</sup><sup>α</sup> logΓðαÞ,

$$\frac{\partial}{\partial \beta} \ell = -\frac{\alpha}{\beta} + \beta^{-2} \sum\_{i=1}^{n} \left( \frac{1 - (1 - \mathbf{X}\_i^a)^b}{\beta (1 - \mathbf{X}\_i^a)^b} - (\alpha - 1) \log \left( \frac{1 - (1 - \mathbf{X}\_i^a)^b}{\beta (1 - \mathbf{X}\_i^a)^b} \right) \right). \tag{27}$$

$$\begin{split} \frac{\partial}{\partial a} \ell &= \frac{n}{a} + \sum\_{i=1}^{n} \log \mathcal{X}\_{i} + 2 \sum\_{i=1}^{n} \mathcal{X}\_{i}^{a} (1 - \mathcal{X}\_{i}^{a}) - 1 \log \mathcal{X}\_{i} \\ &+ \frac{b(\alpha - 1)}{\beta} \sum\_{i=1}^{n} \left( \frac{1 - (1 - \mathcal{X}\_{i}^{a})^{b}}{\beta (1 - \mathcal{X}\_{i}^{a})^{b}} \right)^{-1} \frac{\mathcal{X}\_{i}^{a} \log \mathcal{X}\_{i}}{(1 - \mathcal{X}\_{i}^{a})^{b + 1}} - \frac{1}{\beta} \sum\_{i=1}^{n} \frac{\mathcal{X}\_{i}^{a} \log \mathcal{X}\_{i}}{(1 - \mathcal{X}\_{i}^{a})^{b + 1}} \end{split} \tag{28}$$

$$\frac{\partial}{\partial b}\ell = \frac{n}{b} + \frac{1}{\beta} \left( -1 + \sum\_{i=1}^{n} \log(1 - \mathbf{X}\_i^a) \left( 1 - \left( \frac{\alpha - 1}{\beta} \right) \frac{1 - \left( 1 - \mathbf{X}\_i^a \right)^b}{\left( 1 - \mathbf{X}\_i^a \right)^b} \right) \right). \tag{29}$$

The MLEs α^, β^; ^a, and ^b are obtained by setting Eqs. (26−29) to zero and solving them simultaneously.

To estimate the model parameters, numerical iterative techniques must be used to solve these equations. We may investigate the global maxima of the log likelihood by setting different starting values for the parameters. The information matrix will be required for interval estimation. The elements of the 4 +4 total observed information matrix (since expected values are difficult to calculate), Jðθ ! Þ ¼ Jr;<sup>s</sup>ðθ ! Þ (for r;s ¼ α;β;a;b), can be obtained from the authors under request, where θ ! ¼ ðα, β;a;bÞ. The asymptotic distribution of ð ^ θ ! −θ ! Þ is N4ð0 ! ;KðθÞ −1 Þ, under the regularity conditions, where KðθÞ ¼ E � Jðθ ! Þ � is the expected information matrix, and Jð ^ θ !Þ <sup>−</sup><sup>1</sup> is the observed information matrix. The multivariate normal N4ð0 ! ;KðθÞ −1 Þ distribution can be used to construct approximate confidence intervals for the individual parameters.

#### 5.1. Simulation study

In order to assess the performance of the MLEs, a small simulation study is performed using the statistical software R through the package (stats4), command MLE. The number of Monte Carlo replications was 20,000 For maximizing the log-likelihood function, we use the MaxBFGS subroutine with analytical derivatives. The evaluation of the estimates was performed based on the following quantities for each sample size; the empirical mean squared errors (MSEs) are calculated using the R package from the Monte Carlo replications. The MLEs are determined for each simulated data, say, <sup>ð</sup>α^i;β^ i ;^ai;^bi<sup>Þ</sup> for <sup>i</sup> <sup>¼</sup> <sup>1</sup>; <sup>2</sup>;…; <sup>20</sup>; 000, and the biases and MSEs are computed by

$$bias\_{\hbar}(n) = \frac{1}{20000} \sum\_{i=1}^{20000} (\hat{h}\_i - \hbar),\tag{30}$$

and

$$MSE\_h(n) = \frac{1}{20000} \sum\_{i=1}^{20000} \left(\hat{h}\_i - h\right)^2,\tag{31}$$

for h ¼ α;β;a;b. We consider the sample sizes at n = 100, 200, and 500 and consider different values for the parameters . The empirical results are given in Table 1. The figures in Table 1 indicate that the estimates are quite stable and, more importantly, are close to the true values for these sample sizes. Furthermore, as the sample size increases, the MSEs decrease as expected.

Gamma-Kumaraswamy Distribution in Reliability Analysis: Properties and Applications http://dx.doi.org/10.5772/66821 133


Table 1. Bias and MSE of the estimates under the maximum likelihood method.

#### 6. Reliability parameter

∂ ∂a <sup>ℓ</sup> <sup>¼</sup> <sup>n</sup> a þ ∑ n i¼1

132 Advances in Statistical Methodologies and Their Application to Real Problems

bðα−1Þ <sup>β</sup> <sup>∑</sup> n i¼1

+

Þ ¼ Jr;<sup>s</sup>ðθ !

the observed information matrix. The multivariate normal N4ð0

!

i ;^ai;

þ

∂ ∂b <sup>ℓ</sup> <sup>¼</sup> <sup>n</sup> b þ 1

simultaneously.

tion. The elements of the 4

difficult to calculate), Jðθ

!

regularity conditions, where KðθÞ ¼ E

request, where θ

5.1. Simulation study

simulated data, say, <sup>ð</sup>α^i;β^

and

logXi þ 2 ∑

logð1−X<sup>a</sup>

The MLEs α^, β^; ^a, and ^b are obtained by setting Eqs. (26−29) to zero and solving them

To estimate the model parameters, numerical iterative techniques must be used to solve these equations. We may investigate the global maxima of the log likelihood by setting different starting values for the parameters. The information matrix will be required for interval estima-

<sup>1</sup>−ð1−Xa iÞ b

!<sup>−</sup><sup>1</sup>

<sup>β</sup>ð1−X<sup>a</sup> iÞ b

> n i¼1

¼ ðα, β;a;bÞ. The asymptotic distribution of ð

used to construct approximate confidence intervals for the individual parameters.

biashðnÞ ¼ <sup>1</sup>

MSEhðnÞ ¼ <sup>1</sup>

In order to assess the performance of the MLEs, a small simulation study is performed using the statistical software R through the package (stats4), command MLE. The number of Monte Carlo replications was 20,000 For maximizing the log-likelihood function, we use the MaxBFGS subroutine with analytical derivatives. The evaluation of the estimates was performed based on the following quantities for each sample size; the empirical mean squared errors (MSEs) are calculated using the R package from the Monte Carlo replications. The MLEs are determined for each

> <sup>20000</sup> <sup>∑</sup> 20000 i¼1

<sup>20000</sup> <sup>∑</sup> 20000 i¼1

for h ¼ α;β;a;b. We consider the sample sizes at n = 100, 200, and 500 and consider different values for the parameters . The empirical results are given in Table 1. The figures in Table 1 indicate that the estimates are quite stable and, more importantly, are close to the true values for these sample sizes. Furthermore, as the sample size increases, the MSEs decrease as expected.

� Jðθ ! Þ �

<sup>β</sup> <sup>−</sup><sup>1</sup> <sup>þ</sup> <sup>∑</sup>

n i¼1 Xa ið1−X<sup>a</sup>

Xa <sup>i</sup> logXi <sup>ð</sup>1−Xa iÞ <sup>b</sup>þ<sup>1</sup> <sup>−</sup> 1 β ∑ n i¼1

<sup>i</sup><sup>Þ</sup> <sup>1</sup><sup>−</sup> <sup>α</sup>−<sup>1</sup> β

! !

<sup>i</sup>Þ−1logXi

� � <sup>1</sup>−ð1−Xa

4 total observed information matrix (since expected values are

^ θ ! −θ !

^bi<sup>Þ</sup> for <sup>i</sup> <sup>¼</sup> <sup>1</sup>; <sup>2</sup>;…; <sup>20</sup>; 000, and the biases and MSEs are computed by

<sup>ð</sup>^hi−h<sup>Þ</sup> 2

Þ (for r;s ¼ α;β;a;b), can be obtained from the authors under

is the expected information matrix, and Jð

! ;KðθÞ −1

Þ is N4ð0 ! ;KðθÞ −1

<sup>ð</sup>^hi−hÞ, (30)

; (31)

Xa <sup>i</sup> logXi <sup>ð</sup>1−Xa iÞ bþ1

<sup>ð</sup>1−Xa iÞ b

iÞ b (28)

: (29)

Þ, under the

^ θ !Þ <sup>−</sup><sup>1</sup> is

Þ distribution can be

The reliability parameter R is defined as R ¼ PðX > YÞ, where X and Y are independent random variables. For a detailed study on the possible applications of the reliability parameter, an interested reader is suggested to look at Ref. [4, 5]. If X and Y are two continuous and independent random variables with the cdf's F1ðxÞ and F2ðyÞ and their pdf's f <sup>1</sup>ðxÞ and f <sup>2</sup>ðyÞ, respectively, then the reliability parameter R can be written as

$$R = P(X > Y) = \int\_{-\infty}^{\infty} F\_2(t) f\_1(t) dt. \tag{32}$$

Theorem 4. Let X~GK(a, b, α1, β1) and Y~(a, b, α2, β2), then

$$R = \sum\_{p=0}^{\infty} \frac{(-1)^p}{p!(\alpha\_2 + p)\Gamma(\alpha\_1)} \left(\frac{\beta\_1}{\beta\_2}\right)^{p+\alpha\_2} \Gamma(\alpha\_1 + \alpha\_2 + p). \tag{33}$$

Proof: From Eqs. (2) and (3), we have

$$R = \int\_0^1 \mathcal{V}\_1 \left( a\_2, \frac{1 - (1 - t^a)^b}{\beta\_2 (1 - t^a)^b} \right) \frac{ab \exp\left( -\frac{1 - (1 - t^a)^b}{\beta\_1 (1 - t^a)^b} \right)}{\Gamma(a\_1) \beta\_1^a} \times \frac{1}{(1 - t^a)^2} \left( \frac{1 - (1 - t^a)^b}{\beta\_1 (1 - t^a)^b} \right)^{a\_1 - 1} t^{a - 1} dt. \tag{34}$$

Using the series expansion for the incomplete gamma function <sup>γ</sup>1ðk;xÞ ¼ xk∑<sup>∞</sup> p¼0 ð−xÞ p k!ðkþpÞ , and using the substitution <sup>u</sup> <sup>¼</sup> <sup>1</sup>−ð1−<sup>t</sup> aÞ b β1ð1−t aÞ <sup>b</sup> , Eq. (34) reduces to

$$\begin{split} R &= \sum\_{p=0}^{\bullet} \frac{(-1)^p}{p!(\alpha\_2 + p)\Gamma(\alpha\_1)} \left(\frac{\imath \beta\_1}{\beta\_2}\right)^{p+\alpha\_2} u^{\alpha\_1 - 1} \exp\left(-u\right) du \\ &= \sum\_{p=0}^{\bullet} \frac{(-1)^p}{p!(\alpha\_2 + p)\Gamma(\alpha\_1)} \left(\frac{\beta\_1}{\beta\_2}\right)^{p+\alpha\_2} \Gamma(\alpha\_1 + \alpha\_2 + p). \end{split} \tag{35}$$

Hence the proof. W

#### 7. Order statistics

Here, we derive the general r-th order statistic and the large sample distribution of the sample minimum and the sample maximum based on a random sample of size n from the GK(α, β, a, b) distribution. The corresponding density function of the r-th order statistic, Xr:<sup>n</sup>; from Eq. (3) will be

$$f\_{X\_{rx}}(\mathbf{x}) = \frac{1}{B(r, n - r + 1)} (F(\mathbf{x}))^{r - 1} (1 - F(\mathbf{x}))^{n - r} f(\mathbf{x})$$

$$\mathbf{x} = \frac{f(\mathbf{x})}{B(r, n - r + 1)} \sum\_{j = 0}^{r - 1} (-1)^j \binom{r - 1}{j} \left( \frac{\Gamma(a, \boldsymbol{\beta}^{-1} \frac{1 - (1 - \boldsymbol{x}\_{rx}^{\boldsymbol{a}})^b}{(1 - \boldsymbol{x}\_{rx}^{\boldsymbol{a}})^b}}{\Gamma(a)} \right)^{n - r + j} \times I(0 < \mathbf{x} < 1). \tag{36}$$

Using the series expression for the incomplete gamma function: <sup>γ</sup>1ðα;xÞ ¼ <sup>∑</sup><sup>∞</sup> k¼0 <sup>e</sup>−xðx<sup>Þ</sup> αþk αðαþ1Þ⋯ðαþkÞ , the pdf of Xr:<sup>n</sup> can be written as

$$\begin{split} f\_{rx}(\mathbf{x}) &= \frac{1}{B(r,n-r+1)} f(\mathbf{x}) \sum\_{j=0}^{r-1} (-1)^j \binom{r-1}{j} \left( \sum\_{k=0}^{n} \frac{\exp\left(-\frac{(1-\mathbf{1}\_{r-1}^{x}\mathbf{x}\_{r}^{b})}{\beta(1-\mathbf{1}\_{r-1}^{x}\mathbf{x}\_{r}^{b})}\right)^{a+k+j}}{\Gamma(\alpha)\alpha(\alpha+1)\cdots(\alpha+k)} \right)^{n-r+j} \\ &= \frac{f(\mathbf{x})}{B(r,n-r+1)} \sum\_{j=0}^{r-1} \sum\_{k\_{n-r}=0}^{\infty} (-1)^{j+s\_{k}} \binom{r-1}{j} \exp\left(-(n-r+j)\frac{\mathbf{1}\cdot(\mathbf{1}\_{r-1}^{x}\mathbf{x}\_{r}^{b})}{\beta(1-\mathbf{1}\_{r-1}^{x}\mathbf{x}\_{r}^{b})}\right) \\ &\times \frac{\left(\frac{1-(1-\mathbf{1}\_{r-1}^{x}\mathbf{x}\_{r}^{b})}{(1-\mathbf{1}\_{r-1}^{x})}\right)^{n+(n-r+j)\alpha} p\_{k}}{(\Gamma(\alpha))^{n-r+j}\beta^{n+(n-r+j)\alpha} p\_{k}} \\ &= \frac{1}{B(r,n-r+j)} \sum\_{k=0}^{n-r} \sum\_{i=0}^{\infty} (-1)^{i+s\_{k}} \binom{r-1}{j} \\ &\times \frac{\Gamma(s\_{k}+(r+j)\alpha)}{(\Gamma(\alpha))^{n+r+j}p\_{k}} f(x|s\_{k}+(n-r+j)\alpha, \beta, a, b), \end{split} \tag{37}$$

$$\text{where } \mathbf{s}\_k = \sum\_{i=1}^{n-r+j} k\_i \text{ and } p\_k = \prod\_{i=1}^{n-r+j} (k\_i + \alpha).$$

From Eq. (37), it is interesting to note that the pdf of the r-th order statistic Xr:<sup>n</sup> can be expressed as an infinite sum of the GK pdf 's.

## 8. Application

R ¼ ð1 0 γ<sup>1</sup> α2;

Hence the proof. W

7. Order statistics

using the substitution <sup>u</sup> <sup>¼</sup> <sup>1</sup>−ð1−<sup>t</sup>

1−ð1−t a Þ b

134 Advances in Statistical Methodologies and Their Application to Real Problems

aÞ b

uβ<sup>1</sup> β2 � �<sup>p</sup>þα<sup>2</sup>

β1 β2 � �<sup>p</sup>þα<sup>2</sup>

Here, we derive the general r-th order statistic and the large sample distribution of the sample minimum and the sample maximum based on a random sample of size n from the GK(α, β, a, b) distribution. The corresponding density function of the r-th order statistic, Xr:<sup>n</sup>; from Eq. (3) will be

� � <sup>Γ</sup>ðα;β<sup>−</sup><sup>1</sup> <sup>1</sup>−ð1−x<sup>a</sup>

∑ ∞ k¼0

0 B@

0 B@ ðFðxÞÞ<sup>r</sup>−<sup>1</sup>

·

1 ð1−t a Þ 2 1−ð1−t a Þ b

t a−1

> p¼0 ð−xÞ p k!ðkþpÞ

dt: (34)

, and

(35)

(36)

, the

(37)

β1ð1−t a Þ b !<sup>α</sup>1−<sup>1</sup>

<sup>u</sup><sup>α</sup>1−<sup>1</sup> exp <sup>ð</sup>−uÞdu

Γðα<sup>1</sup> þ α<sup>2</sup> þ pÞ:

ð1−FðxÞÞ<sup>n</sup>−<sup>r</sup>

1 CA

n−rþj

r:nÞ b

ΓðαÞαðα þ 1Þ⋯ðα þ kÞ

!

<sup>1</sup>−ð1−xa <sup>r</sup>:<sup>n</sup>Þ b

βð1−xa <sup>r</sup>:<sup>n</sup>Þ b

<sup>β</sup>ð1−xa r:nÞ b � � <sup>1</sup>−ð1−xa

r:nÞ b

exp − <sup>1</sup>−ð1−xa

exp −ðn−r þ jÞ

<sup>ð</sup>1−xa r:nÞ <sup>b</sup> Þ

ΓðαÞ

fðxÞ

· Ið0 < x < 1Þ:

r:nÞ b

<sup>β</sup>ð1−xa r:nÞ b � �<sup>α</sup>þ<sup>k</sup>

k¼0

1 CA

<sup>e</sup>−xðx<sup>Þ</sup> αþk αðαþ1Þ⋯ðαþkÞ

n−rþj

β1ð1−t aÞ b � �

<sup>Γ</sup>ðα1Þβ<sup>α</sup> 1

Using the series expansion for the incomplete gamma function <sup>γ</sup>1ðk;xÞ ¼ xk∑<sup>∞</sup>

ð−1Þ p

p!ðα<sup>2</sup> þ pÞΓðα1Þ

ð−1Þ p

p!ðα<sup>2</sup> þ pÞΓðα1Þ

<sup>b</sup> , Eq. (34) reduces to

β2ð1−t a Þ b ! ab exp <sup>−</sup> <sup>1</sup>−ð1−<sup>t</sup>

> aÞ b

β1ð1−t aÞ

R ¼ ∑ ∞ p¼0

> ¼ ∑ ∞ p¼0

f Xr:<sup>n</sup>

r−1 j¼0 ð−1Þ

fðxÞ ∑ r−1 j¼0 ð−1Þ

ð−1Þ

ð−1Þ

fðxjsk þ ðn−r þ jÞα;β;a;bÞ,

<sup>¼</sup> <sup>f</sup>ðx<sup>Þ</sup> <sup>B</sup>ðr;n−<sup>r</sup> <sup>þ</sup> <sup>1</sup><sup>Þ</sup> <sup>∑</sup>

Bðr;n−r þ 1Þ

r−1 j¼0 ∑ ∞ k1 ⋯ ∑ ∞ kn<sup>−</sup>rþj¼0

<sup>β</sup>skþðn−rþjÞ<sup>α</sup>pk

n−r j¼0 ∑ ∞ k1 ⋯ ∑ ∞ kn<sup>−</sup>rþj¼0

pk

pdf of Xr:<sup>n</sup> can be written as

<sup>f</sup> <sup>r</sup>:<sup>n</sup>ðxÞ ¼ <sup>1</sup>

<sup>1</sup>−ð1−xa r:nÞ b

<sup>ð</sup>1−x<sup>a</sup> r:nÞ b � �skþðn−rþjÞ<sup>α</sup>

ðΓðαÞÞ<sup>n</sup>−rþ<sup>j</sup>

<sup>B</sup>ðr;n−<sup>r</sup> <sup>þ</sup> <sup>1</sup><sup>Þ</sup> <sup>∑</sup>

Γðsk þ ðr þ jÞαÞ ðΓðαÞÞ<sup>n</sup>−rþ<sup>j</sup>

<sup>¼</sup> <sup>1</sup>

<sup>¼</sup> <sup>f</sup>ðx<sup>Þ</sup> <sup>B</sup>ðr;n−<sup>r</sup> <sup>þ</sup> <sup>1</sup><sup>Þ</sup> <sup>∑</sup>

·

·

<sup>ð</sup>xÞ ¼ <sup>1</sup>

Bðr;n−r þ 1Þ

<sup>j</sup> r−1 j

Using the series expression for the incomplete gamma function: <sup>γ</sup>1ðα;xÞ ¼ <sup>∑</sup><sup>∞</sup>

<sup>j</sup> r−1 j � �

<sup>j</sup>þsk r−1 j � �

<sup>j</sup>þsk r−1 j � � Here, we consider two well-known illustrative data sets which are used to show the efficacy of the GK distribution. For details on these two data sets [6, 7], the second data set in Table 2 is from Ref. [8], and it represents the fatigue life of 6061-T6 aluminum coupons cut parallel with the direction of rolling and oscillated at 18 cycles per second. The GK distribution is fitted to the first data set and compared the result with the Kumaraswamy, gamma-uniform [9], and beta-Pareto [10]. These results are reported in Table 3. The results show that gamma-uniform, GK distributions provide adequate fit to the data. Figure 3 displays the empirical and the fitted cumulative distribution functions. This figure supports the results in Table 3. A close look at Figure 3 indicates that GK distribution provides better fit to the left tail than the gamma-uniform distribution. This is due to the fact that GK distribution can have longer left tail (Figure 3).


Table 2. Fatigue life of 6061-T6 aluminum data.

In addition, to check the goodness-of-fit of all statistical models, several other goodness-of-fit statistics are used and are computed using computational package Mathematica. The MLEs are computed using N maximize technique as well as the measures of goodness-of-fit statistics including the log-likelihood function evaluated at the MLEs (l), Akaike information criterion (AIC), corrected Akaike information criterion (AICC), consistent Akaike information criterion (CAIC), the Anderson-Darling (A\* ), the Cramer-von Mises (W\* ), and the Kolmogrov-Smirnov (K-S) statistics with their p values to compare the fitted models. These statistics are used to evaluate how closely a specific distribution with cdf (2) fits the corresponding empirical distribution for a given data set. The distribution with better fit than the others will be the one having the smallest statistics and largest p value. Alternatively, the distribution for which one obtains the smallest of each of these criteria (i.e., AIC, AICC, K-S, etc.) will be most suitable one. The mathematical equations of those statistics are given by


$$\bullet \qquad A\_0^\* = \left(\frac{2.25}{n^2} + \frac{0.75}{n} + 1\right) \left(-n - \frac{1}{n} \sum\_{i=1}^n (2i - 1) \log\left(z\_i (1 - z\_{n-i+1})\right)\right).$$

$$\bullet \quad \mathcal{W}\_0^\* = \left(\frac{0.5}{n} + 1\right) \left[ \left(z\_i - \frac{2i - 1}{2n}\right)^2 + \frac{1}{12n} \right]$$

$$\bullet \quad \mathcal{K} \\ \bullet \quad \mathcal{S} = \mathcal{M} \text{ax} \left( \frac{i}{n} \neg z\_i, z\_i \neg \frac{i-1}{n} \right),$$


Table 3. Goodness of fit of deep-groove ball bearings data.

where <sup>ℓ</sup>ðθ^<sup>Þ</sup> denotes the log-likelihood function evaluated at the maximum likelihood estimates, q is the number of parameters, n is the sample size and zi ¼ cdfðyi Þ, the yi 's being the ordered observations.

Lieblein and Zelen [6] proposed a five parameter beta generalized Pareto distribution and fitted the data in Table 4 and compared the result with beta-Pareto and other known distributions. The results of fitting beta generalized Pareto and beta-Pareto from Ref. [8] are reported in Table 4 along with the results of fitting the Pareto (IV) and GK distributions to the data. The KS value from Table 4 indicates that the GK distribution provides the best fit. The fact that GK distribution has the least number of parameters than beta generalized Pareto and beta-Pareto adds an extra advantage over them. Figure 4 displays the empirical and the fitted cumulative distribution functions. This figure supports the results in Table 4.

Figure 3. cdf for fitted distributions of the endurance of deep-groove ball bearings data.

obtains the smallest of each of these criteria (i.e., AIC, AICC, K-S, etc.) will be most suitable

one. The mathematical equations of those statistics are given by

136 Advances in Statistical Methodologies and Their Application to Real Problems

<sup>n</sup> <sup>∑</sup><sup>n</sup>

12n

<sup>i</sup>¼<sup>1</sup>ð2i−1Þlog

�

Distribution Kumaraswamy Gamma-uniform Beta-Pareto Gamma-Kumaraswamy

where <sup>ℓ</sup>ðθ^<sup>Þ</sup> denotes the log-likelihood function evaluated at the maximum likelihood esti-

Lieblein and Zelen [6] proposed a five parameter beta generalized Pareto distribution and fitted the data in Table 4 and compared the result with beta-Pareto and other known distributions. The results of fitting beta generalized Pareto and beta-Pareto from Ref. [8] are reported in Table 4 along with the results of fitting the Pareto (IV) and GK distributions to the data. The KS value from Table 4 indicates that the GK distribution provides the best fit. The fact that GK

Þ, the yi 's being the

mates, q is the number of parameters, n is the sample size and zi ¼ cdfðyi

� ��

Parameter estimates ^a ¼ 0:653 α^ ¼ 7:528 ^c ¼ 5:048 α^ ¼ 7:891

Log likelihood −162.34 −116.58 −113.36 −113.25 AIC 217.38 119.45 167.78 107.85 AICC 218.36 120.37 168.91 108.43 CAIC 218.36 120.37 168.91 108.43

<sup>0</sup> 12.164 0.5951 1.3125 0.4282

<sup>0</sup> 2.8937 0.0931 0.1317 0.04893 K-S 0.5290 0.09521 0.4245 0.0492 K-S p-value 0.0000 0.9140 0.8374 0.9978

zið1−zn<sup>−</sup>iþ<sup>1</sup>Þ

^<sup>b</sup> <sup>¼</sup> <sup>1</sup>:<sup>1182</sup> <sup>β</sup>^ <sup>¼</sup> <sup>2</sup>:<sup>731</sup> <sup>β</sup>^ <sup>¼</sup> <sup>0</sup>:<sup>401</sup> <sup>β</sup>^ <sup>¼</sup> <sup>0</sup>:<sup>785</sup>

^<sup>a</sup> <sup>¼</sup> <sup>6</sup>:<sup>49</sup> <sup>θ</sup>^ <sup>¼</sup> <sup>6</sup>:<sup>417</sup> ^<sup>a</sup> <sup>¼</sup> <sup>5</sup>:<sup>352</sup> ^<sup>b</sup> <sup>¼</sup> <sup>0</sup>:<sup>932</sup> ^<sup>b</sup> <sup>¼</sup> <sup>1</sup>:<sup>735</sup>

• AIC <sup>¼</sup> <sup>−</sup>2ℓðθ^Þ þ <sup>2</sup><sup>q</sup>

• AICC <sup>¼</sup> AIC <sup>þ</sup> <sup>2</sup>qðqþ1<sup>Þ</sup>

• CAIC <sup>¼</sup> <sup>−</sup>2ℓðθ^Þ þ <sup>2</sup>qn

<sup>n</sup><sup>2</sup> <sup>þ</sup> <sup>0</sup>:<sup>75</sup> <sup>n</sup> <sup>þ</sup> <sup>1</sup> � � <sup>−</sup>n<sup>−</sup> <sup>1</sup>

<sup>n</sup> <sup>þ</sup> <sup>1</sup> � � zi<sup>−</sup> <sup>2</sup>i−<sup>1</sup>

<sup>n</sup> <sup>−</sup>zi;zi<sup>−</sup> <sup>i</sup>−<sup>1</sup> n � �,

Table 3. Goodness of fit of deep-groove ball bearings data.

<sup>0</sup> <sup>¼</sup> <sup>2</sup>:<sup>25</sup>

<sup>0</sup> <sup>¼</sup> <sup>0</sup>:<sup>5</sup>

ordered observations.

• <sup>K</sup>−<sup>S</sup> <sup>¼</sup> Max <sup>i</sup>

• A�

• W�

A�

W�

n−q−1

n−q−1

2n � �<sup>2</sup> <sup>þ</sup> <sup>1</sup>

h i


Table 4. Parameter estimates for the fatigue life of 6061-T6 aluminum coupons data.

Figure 4. cdf for fitted distributions of the fatigue life of 6061-T6 Aluminum data.

## 9. Characterization of GK distribution

In this section, we present characterizations of GK distribution in terms of the ratio of two truncated moments. For the previous works done in this direction, we refer the interested readers to Glänzel [11–14] and Hamedani [15–17]. For our characterization results, we employ a theorem due to Ref. [11], see for further details. The advantage of the characterizations given here is that cdf F need not have a closed form. We present here a corollary as a direct application of the theorem discussed in details in Ref. [11].

Corollary 1. Let X : Ω ! ð0; 1Þ be a continuous random variable and let <sup>h</sup>ðxÞ ¼ <sup>β</sup><sup>α</sup>−<sup>1</sup>ð1−xa Þ bðα−2Þþ1 <sup>½</sup>1−ð1−xa<sup>Þ</sup> b � <sup>1</sup>−<sup>α</sup> and <sup>g</sup>ðxÞ ¼ <sup>h</sup>ðx<sup>Þ</sup> exp <sup>−</sup> <sup>1</sup>−ð1−xa<sup>Þ</sup> b βð1−xaÞ b � � for <sup>x</sup>∈ð0; <sup>1</sup>Þ: Then <sup>X</sup> has pdf (3) if and only if the function η defined in Theorem 5 has the form

$$\eta(\mathbf{x}) = \frac{1}{2} \exp\left(-\frac{1 - (1 - \mathbf{x}^a)^b}{\beta (1 - \mathbf{x}^a)^b}\right), \quad 0 < \mathbf{x} < 1. \tag{38}$$

Proof. Let X has pdf (3), then

$$\mathbb{E}\left(\mathbf{1} - F(\mathbf{x})\right) \quad \mathbf{E}[h(\mathbf{X}) \mid \mathbf{X} \ge \mathbf{x}] = \frac{1}{\beta^{a-1} \Gamma(a)} \exp\left(-\frac{\mathbf{1} \cdot (\mathbf{1} - \mathbf{x}^a)^b}{\beta (\mathbf{1} - \mathbf{x}^a)^b}\right), \quad 0 < \mathbf{x} < 1,\tag{39}$$

and

$$\mathbb{E}\left(\mathbf{1} - F(\mathbf{x})\right) \cdot \mathbb{E}[\mathbf{g}(\mathbf{X}) \mid \mathbf{X} \ge \mathbf{x}] = \frac{1}{2\beta^{a-1}\Gamma(a)} \exp\left\{-2\left(\frac{\mathbf{1} - (\mathbf{1} - \mathbf{x}^a)^b}{\beta (\mathbf{1} - \mathbf{x}^a)^b}\right)\right\}, \quad \mathbf{0} < \mathbf{x} < 1,\tag{40}$$

and finally

$$\eta(\mathbf{x})h(\mathbf{x}) - \mathbf{g}(\mathbf{x}) = -\frac{h(\mathbf{x})}{2} \exp\left(-\frac{1 - \left(1 - \mathbf{x}^a\right)^b}{\beta \left(1 - \mathbf{x}^a\right)^b}\right) < 0, \quad \text{for } 0 < \mathbf{x} < 1. \tag{41}$$

Conversely, if η is given as above, then

$$s'(\mathbf{x}) = \frac{\eta'(\mathbf{x})}{\eta(\mathbf{x})} \frac{h(\mathbf{x})}{h(\mathbf{x}) - \mathbf{g}(\mathbf{x})} = \frac{ab}{\beta} \mathbf{x}^{a-1} (1 - \mathbf{x}^a)^{-(b+1)}, \quad 0 < \mathbf{x} < 1,\tag{42}$$

and hence

9. Characterization of GK distribution

<sup>h</sup>ðxÞ ¼ <sup>β</sup><sup>α</sup>−<sup>1</sup>ð1−xa

Þ bðα−2Þþ1

Proof. Let X has pdf (3), then

� 1−FðxÞ �

application of the theorem discussed in details in Ref. [11].

Figure 4. cdf for fitted distributions of the fatigue life of 6061-T6 Aluminum data.

138 Advances in Statistical Methodologies and Their Application to Real Problems

<sup>½</sup>1−ð1−xa<sup>Þ</sup>

b �

<sup>η</sup>ðxÞ ¼ <sup>1</sup>

pdf (3) if and only if the function η defined in Theorem 5 has the form

<sup>E</sup>½hðXÞ j <sup>X</sup> <sup>≥</sup> <sup>x</sup>� ¼ <sup>1</sup>

<sup>2</sup> exp <sup>−</sup>

In this section, we present characterizations of GK distribution in terms of the ratio of two truncated moments. For the previous works done in this direction, we refer the interested readers to Glänzel [11–14] and Hamedani [15–17]. For our characterization results, we employ a theorem due to Ref. [11], see for further details. The advantage of the characterizations given here is that cdf F need not have a closed form. We present here a corollary as a direct

Corollary 1. Let X : Ω ! ð0; 1Þ be a continuous random variable and let

<sup>1</sup>−ð1−xa<sup>Þ</sup> b

!

βð1−xaÞ b

βα<sup>−</sup><sup>1</sup> ΓðαÞ

<sup>1</sup>−<sup>α</sup> and <sup>g</sup>ðxÞ ¼ <sup>h</sup>ðx<sup>Þ</sup> exp <sup>−</sup> <sup>1</sup>−ð1−xa<sup>Þ</sup>

exp −

<sup>1</sup>−ð1−xa<sup>Þ</sup> b

!

βð1−xaÞ b

b βð1−xaÞ b � �

for x∈ð0; 1Þ: Then X has

; 0 < x < 1; (39)

; 0 < x < 1: (38)

$$\mathbf{s}(\mathbf{x}) = \frac{1}{\beta} (\mathbf{1} \mathbf{-} \mathbf{x}^a)^{-b}, \quad 0 < \mathbf{x} < 1. \tag{43}$$

Now, in view of Theorem 5, X has pdf (3).

Corollary 2. Let X : Ω ! ð0; 1Þ be a continuous random variable and let h(x) be as in Proposition 1. Then, X has pdf (3) if and only if there exist functions g and η defined in Theorem 5 satisfying the differential equation

$$\frac{\eta'(\mathbf{x})h(\mathbf{x})}{\eta(\mathbf{x})h(\mathbf{x}) - \mathbf{g}(\mathbf{x})} = \frac{ab}{\beta} \mathbf{x}^{a-1} (\mathbf{1} - \mathbf{x}^a)^{-(b+1)}, \quad 0 < \mathbf{x} < 1. \tag{44}$$

Remarks 1. (a) The general solution of the differential equation in Corollary 1 is

$$\eta(\mathbf{x}) = \exp\left(\frac{\mathbf{1} - (\mathbf{1} - \mathbf{x}^a)^b}{\beta (\mathbf{1} - \mathbf{x}^a)^b}\right) \left[ - \left\{ \frac{ab}{\beta} \mathbf{x}^{a-1} (\mathbf{1} - \mathbf{x}^a)^{-(b+1)} \exp\left( - \frac{\mathbf{1} - (\mathbf{1} - \mathbf{x}^a)^b}{\beta (\mathbf{1} - \mathbf{x}^a)^b} \right) \left( h(\mathbf{x}) \right)^{-1} g(\mathbf{x}) d\mathbf{x} + D \right], \tag{45}$$

for 0 < x < 1, where D is a constant. One set of appropriate functions is given in Proposition 1 with D = 0

(b) Clearly, there are other triplets of functions (h, g, η) satisfying the conditions of Theorem 5. We presented one such triplet in Proposition 1.

#### 10. Concluding remarks

A special case of the gamma-generated family of distributions, the gamma-Kumaraswamy distribution, is defined and studied. Various properties of the gamma-Kumaraswamy distribution are investigated, including moments, hazard function, and reliability parameter. The new model includes as special sub-models the gamma and Kumaraswamy distribution. Also, we provide various characterizations of the gamma-Kumaraswamy distribution. An application to a real data set shows that the fit of the new model is superior to the fits of its main submodels. As future work related to this univariate GK model, we will consider the following:

• A natural bivariate extension to the model in Eq. (1) would be

$$f(\mathbf{x}, y) \approx \frac{\mathbf{g}(\mathbf{x}, y)}{\Gamma(\alpha) \beta^{\alpha} \overline{\mathbf{G}}^{2}(\mathbf{x}, y)} \exp\left(-\frac{\mathbf{g}(\mathbf{x}, y)}{\beta b a \mathbf{r} \mathbf{G}(\mathbf{x}, y)}\right) \left(\frac{\mathbf{g}(\mathbf{x}, y)}{\beta b a \mathbf{r} \mathbf{G}(\mathbf{x}, y)}\right)^{\alpha - 1}, \quad \mathbf{x} > 0, y > 0. \tag{46}$$

In this case, exact evaluation of the normalizing constant would be difficult to obtain, even for a simple analytic expression of a baseline bivariate distribution function, G(x, y). Numerical methods such as Monte Carlo methods of integration might be useful here. We will study and discuss structural properties of such a bivariate GK model.


## Author details

Indranil Ghosh<sup>1</sup> \* and Gholamhossein G. Hamedani<sup>2</sup>

\*Address all correspondence to: jamesbond.indranil@gmail.com

1 Department of Mathematics and Statistics, University of North Carolina, Wilmington, Wilmington, NC, USA

2 Department of Mathematics, Statistics and Computer Science, Marquette University, Milwaukee, WI, USA

## References

model includes as special sub-models the gamma and Kumaraswamy distribution. Also, we provide various characterizations of the gamma-Kumaraswamy distribution. An application to a real data set shows that the fit of the new model is superior to the fits of its main submodels. As future work related to this univariate GK model, we will consider the following:

• A natural bivariate extension to the model in Eq. (1) would be

exp <sup>−</sup> <sup>g</sup>ðx;y<sup>Þ</sup>

study and discuss structural properties of such a bivariate GK model.

distribution function, a data-driven prior approach will be more suitable.

\* and Gholamhossein G. Hamedani<sup>2</sup>

\*Address all correspondence to: jamesbond.indranil@gmail.com

βbarGðx;yÞ

<sup>g</sup>ðx;y<sup>Þ</sup>

In this case, exact evaluation of the normalizing constant would be difficult to obtain, even for a simple analytic expression of a baseline bivariate distribution function, G(x, y). Numerical methods such as Monte Carlo methods of integration might be useful here. We will

• Extension of the proposed univariate GK model to multivariate GK models and discuss the associated inferential issues. It is noteworthy to mention that classical methods of estimation, such as for example, maximum likelihood method of estimation might not be a good strategy because of the enormous number of model parameters. An appropriate Bayesian inference might be the only remedy. In that case, we will separately study two different cases of estimation: (a) with non-informative priors and (b) with full conditional conjugate priors (Gibbs sampling). Since the GK distribution is in the one parameter exponential family, a reasonable choice for priors for α and β might well be gamma priors with appropriate choice of hyper-parameters. For prior choices of the parameters that are from the baseline G(.)

• A discrete analog of the univariate GK model with a possible application in modeling rare

• Construction of a new class of GK mixture models by adopting Marshall-Olkin method of

1 Department of Mathematics and Statistics, University of North Carolina, Wilmington,

2 Department of Mathematics, Statistics and Computer Science, Marquette University,

βbarGðx;yÞ <sup>α</sup>−<sup>1</sup>

; x > 0;y > 0: (46)

ðx;yÞ

140 Advances in Statistical Methodologies and Their Application to Real Problems

<sup>f</sup>ðx;yÞ<sup>∝</sup> <sup>g</sup>ðx;y<sup>Þ</sup> <sup>Γ</sup>ðαÞβαG<sup>2</sup>

events.

Author details

Indranil Ghosh<sup>1</sup>

Wilmington, NC, USA

Milwaukee, WI, USA

obtaining new distribution.


**Provisional chapter**

## **Nonlinear Transformations and Radar Detector Design**

**Nonlinear Transformations and Radar Detector Design**

Graham V. Weinberg Graham V. Weinberg Additional information is available at the end of the chapter

[16] Hamedani, G.G.: Characterizations of univariate continuous distributions. Studia Scientia-

[17] Hamedani, G.G.: Characterizations of continuous univariate distributions based on the truncated moments of functions of order statistics. Studia Scientiarum Mathematicarum

rum Mathematicarum Hungarica, 2006; 43: 361–385.

142 Advances in Statistical Methodologies and Their Application to Real Problems

Hungarica, 2010; 47: 462–484.

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/65677

## **Abstract**

A nonlinear transformation is introduced, which can be used to compress a series of random variables. For a certain class of random variables, the compression results in the removal of unknown distributional parameters from the resultant series. Hence, the application of this transformation is investigated from a radar target detection perspective. It will be shown that it is possible to achieve the constant false alarm rate property through a simple manipulation of this transformation. Due to the effect the transformation has on the cell under test, it is necessary to couple the approach with binary integration to achieve reasonable results. This is demonstrated in an X-band maritime surveillance radar detection context.

**Keywords:** transformations, random variable properties, radar detection, mathematical statistics, radar

## **1. Introduction**

The fundamental problem to be examined in this chapter is the detection of targets embedded within the sea surface, from an airborne maritime surveillance radar. Artifacts of interest could be lifeboats or aircraft wreckage resulting from aviation or maritime disasters. From a military perspective, one may be interested in the detection and tracking of submarine periscopes. Another scenario may be the detection of illegal fishing vessels or small boats used for smuggling of people or contraband. An airborne maritime surveillance radar has a difficult task in the detection of such objects from high altitude, while surveying a very large surveillance volume.

Such radars operate at X-band and are high resolution, and as such are affected by backscattering from the sea surface, which is referred to as clutter. This backscattering tends to mask small targets and makes the surveillance task extremely difficult. One of the major issues with the design of radar detection schemes is the minimization of the detection of false tar-

gets, while maximizing the detection of real targets. As a statistical hypothesis test, one can apply the Neyman-Pearson Lemma to produce a decision rule that achieves these objectives. However, in many cases, such a decision rule requires clutter model parameter approximations as well as estimates of the target strength based upon sampled returns. An issue, well known within the radar community, is that small variations in the clutter power level can result in huge increases in the number of false alarms. Since clutter power is a function of the underlying clutter model's parameters, approximations of the latter will have an inevitable effect on the former. Hence a large body of research has been devoted to designing radar detection strategies that maintain a fixed level of false alarms. A detector that achieves this objective is said to have the constant false alarm rate (CFAR) property [1].

In order to maintain a fixed rate of false alarms, sliding window decision rules were examined in early studies of radar detection strategies [2–6]. These investigations have been extended to account for different clutter models and to address issues with earlier detector design in a number of subsequent analyses [7–15]. Such decision rules can be formulated as follows. Suppose that the statistic *Z* is the return to be tested for the presence of a target. Let *Z*1 , *Z*<sup>2</sup> , …, *ZN* be *N* statistics from which a measurement of the level of clutter is taken, via some function *f* = *f* (*Z*<sup>1</sup> , *Z*<sup>2</sup> , …, *ZN*). Then a target is declared present in the case where *Z* is larger than a constant times *f*. The constant is selected so that in ideal scenarios, the false alarm rate remains fixed. It is generally assumed that the clutter statistics are independent and identically distributed in ideal settings, and also independent of the statistic *Z*. This can be formulated as a statistical hypothesis test by letting *H*<sup>0</sup> be the hypothesis that the cell under test (CUT) statistic *Z* does not contain a target, and *H*<sup>1</sup> the alternative that it contains a target embedded within clutter. Then the test is written

$$\mathbb{Z} \lessapprox\_{H\_{i}} \mathfrak{r}f(\mathbb{Z}\_{\mathbb{V}}, \mathbb{Z}\_{\mathbb{V}}, \dots, \mathbb{Z}\_{\mathbb{V}}) \tag{1}$$

where *τ* > 0 is the threshold constant and the notation used in Eq. (1) means that *H*<sup>0</sup> is rejected when *Z* > *τf* (*Z*<sup>1</sup> , *Z*<sup>2</sup> , …, *ZN*). The probability of false alarm is given by

$$\text{Pfia} = \mathbb{P}\{Z > \text{rf}(Z\_{\mathbf{i'}}Z\_{\mathbf{i'}}, \dots, Z\_{\mathbf{i}})\Big|\: H\_0\}.\tag{2}$$

If *τ* can be determined, for a specified Pfa in Eq. (2), such that it is independent of clutter parameters, then the decision rule in Eq. (1) will be able to maintain the CFAR property in ideal scenarios. In practical radar systems, a detection scheme such as in Eq. (1) can be run across the data returns sequentially to allow binary decisions on the presence of targets to be made, which are then passed to a tracking algorithm. A comprehensive examination of such detection processes is included in [1].

This chapter examines an alternative approach to achieve the CFAR property, based upon a nonlinear transformation that is used to compress the original clutter sequence. The consequence of this is that the resulting transformed series of random variables will have a fixed clutter power level and so permits a CFAR detector to be proposed. It is then shown how this transformation can be used to produce a practical radar detection scheme.

The chapter is organized as follows. Section 2 introduces the nonlinear mapping and formulates a decision rule. Section 3 specializes this to the case of Pareto distributed sequences, since the Pareto model is suitable for X-band maritime surveillance radar clutter returns. Section 4 demonstrates detector performance in homogeneous clutter, while Section 5 applies the decision rules directly to synthetic target detection in real X-band radar clutter.

## **2. Transformations and decision rule**

#### **2.1. Mapping**

gets, while maximizing the detection of real targets. As a statistical hypothesis test, one can apply the Neyman-Pearson Lemma to produce a decision rule that achieves these objectives. However, in many cases, such a decision rule requires clutter model parameter approximations as well as estimates of the target strength based upon sampled returns. An issue, well known within the radar community, is that small variations in the clutter power level can result in huge increases in the number of false alarms. Since clutter power is a function of the underlying clutter model's parameters, approximations of the latter will have an inevitable effect on the former. Hence a large body of research has been devoted to designing radar detection strategies that maintain a fixed level of false alarms. A detector that achieves this

In order to maintain a fixed rate of false alarms, sliding window decision rules were examined in early studies of radar detection strategies [2–6]. These investigations have been extended to account for different clutter models and to address issues with earlier detector design in a number of subsequent analyses [7–15]. Such decision rules can be formulated as follows. Suppose that the statistic *Z* is the return to be tested for the presence of a target. Let

, …, *ZN* be *N* statistics from which a measurement of the level of clutter is taken, via

larger than a constant times *f*. The constant is selected so that in ideal scenarios, the false alarm rate remains fixed. It is generally assumed that the clutter statistics are independent and identically distributed in ideal settings, and also independent of the statistic *Z*. This can

> *H*0 *H*1 *τf*(*Z*<sup>1</sup> , *Z*<sup>2</sup>

transformation can be used to produce a practical radar detection scheme.

where *τ* > 0 is the threshold constant and the notation used in Eq. (1) means that *H*<sup>0</sup>

, …, *ZN*). The probability of false alarm is given by

, *Z*<sup>2</sup>

If *τ* can be determined, for a specified Pfa in Eq. (2), such that it is independent of clutter parameters, then the decision rule in Eq. (1) will be able to maintain the CFAR property in ideal scenarios. In practical radar systems, a detection scheme such as in Eq. (1) can be run across the data returns sequentially to allow binary decisions on the presence of targets to be made, which are then passed to a tracking algorithm. A comprehensive examination of such

This chapter examines an alternative approach to achieve the CFAR property, based upon a nonlinear transformation that is used to compress the original clutter sequence. The consequence of this is that the resulting transformed series of random variables will have a fixed clutter power level and so permits a CFAR detector to be proposed. It is then shown how this

The chapter is organized as follows. Section 2 introduces the nonlinear mapping and formulates a decision rule. Section 3 specializes this to the case of Pareto distributed sequences, since

, …, *ZN*). Then a target is declared present in the case where *Z* is

be the hypothesis that the cell under

is rejected

the alternative that it contains a target

, …,*ZN*) (1)

, …,*ZN*)| *H*0). (2)

objective is said to have the constant false alarm rate (CFAR) property [1].

*Z*1 , *Z*<sup>2</sup>

some function *f* = *f* (*Z*<sup>1</sup>

when *Z* > *τf* (*Z*<sup>1</sup>

, *Z*<sup>2</sup>

be formulated as a statistical hypothesis test by letting *H*<sup>0</sup>

144 Advances in Statistical Methodologies and Their Application to Real Problems

test (CUT) statistic *Z* does not contain a target, and *H*<sup>1</sup>

embedded within clutter. Then the test is written

Pfa = IP(*Z* > *τf*(*Z*<sup>1</sup>

*Z* ≷

, *Z*<sup>2</sup>

detection processes is included in [1].

In X-band maritime surveillance radar, the Pareto distribution has become of much interest as a clutter intensity model due to its validation relative to real radar clutter returns [16–18]. This model arises as the intensity distribution of a compound Gaussian model with inverse Gamma texture. Consequently, the Pareto distribution fits into the currently accepted radar clutter model phenomenology [19]. Hence, there have been a number of recent advances in the design of CFAR processes under a Pareto clutter model assumption [20–25].

A random variable *X* has a Pareto distribution [26] with shape parameter *α* > 0 and scale parameter *β* > 0 if its cumulative distribution function (cdf) is

$$F\_{\chi}(t) \colon \mathbb{P}(X \le t) = 1 - \left(\frac{\beta}{t}\right)^{a},\tag{3}$$

for *t* ≥ *β*. The density of *X* follows by differentiation of Eq. (3). In order to ensure the existence of the first two moments, it is usually assumed that *α* > 2, which is an assumption that has been validated in fits of this model to real data [18]. This Pareto model possesses what is referred to as a duality property in Ref. [20]. To introduce this, recall that if *Y* is an Exponential random variable with unity mean, its cdf is given by

$$F\_{\chi}(t) = \mathbf{1} - e^{-t} \tag{4}$$

for *t* ≥ 0. Then it can be shown that the Pareto model in Eq. (3) can be related to Eq. (4) via the random variable relationship

$$X = \mathcal{J}e^{a \cdot \chi}.\tag{5}$$

Other random variables of interest in radar signal processing, such as the Weibull, can also be expressed in a form similar to Eq. (5). Hence, for the purposes of generality, suppose {*Xj* , *j* ∈ IN := {0, 1, 2, …}} is a sequence of homogeneous random variables with common support and that *θ*<sup>1</sup> and *θ*<sup>2</sup> are two fixed real constants. Define a sequence of random variables {*Zj* , *j* ∈ IN} by

$$Z\_{\parallel} = \, \Theta\_{\mathbf{i}} \, X\_{\parallel}^{\theta\_{\mathbf{i}}}.\tag{6}$$

The sequence produced via Eq. (6) is a generalization of the Pareto model (3). Next define a nonlinear mapping *ζ* : IR+ × IR+ × IR+ × IR+ → IR+ ∪ {0} by

$$\left| \tilde{\mathsf{L}} \{ \mathbf{x}\_{1'}, \mathbf{x}\_{2'}, \mathbf{x}\_{3'}, \mathbf{x}\_4 \} \right| = \left| \frac{\log \{ \mathbf{x}\_1 \} - \log \{ \mathbf{x}\_2 \}}{\log \{ \mathbf{x}\_3 \} - \log \{ \mathbf{x}\_4 \}} \right| \tag{7}$$

where each *xj* > 0, *x*<sup>3</sup> =*x*<sup>4</sup> and IR+ is the positive real numbers. Then the following result is relatively easy to prove:

**Lemma 2.1** *Suppose* {*Zj* , *j* ∈ IN} *is a sequence of random variables defined via Eq. (6). Then the sequence* {*Wj* , *j* ∈ IN} *with W<sup>j</sup>* := *ζ*(*Zj* , *Zj* <sup>+</sup> <sup>1</sup> , *Zj* <sup>+</sup> <sup>2</sup> , *Zj* <sup>+</sup> <sup>3</sup> ) *does not depend on θ*<sup>1</sup> *and θ*<sup>2</sup> *.*

The proof of Lemma 2.1 is now outlined. Supposing that *Zj* , *Zj* <sup>+</sup> <sup>1</sup> , *Zj* <sup>+</sup> <sup>2</sup> and *Zj* <sup>+</sup> <sup>3</sup> are represented in the form defined via Eq. (6) it follows that

$$\begin{aligned} \text{Definition.} & \text{approxing that } \mathcal{L}\_{j'} \mathcal{L}\_{j+1'} \mathcal{L}\_{j+2} \text{ and } \mathcal{L}\_{j+3} \text{ are represented as follows that} \\ & \begin{aligned} &W\_{j} = \left| \frac{\log\left(\theta\_{1} X\_{j}^{\alpha\_{i}}\right) - \log\left(\theta\_{1} X\_{j:i}^{\alpha\_{i}}\right)}{\log\left(\theta\_{1} X\_{j:i}^{\alpha\_{i}}\right) - \log\left(\theta\_{1} X\_{j:i}^{\alpha\_{i}}\right)} \right| \\ &= \left| \frac{\theta\_{2} \log\left(X\_{j}\right) - \theta\_{2} \log\left(X\_{j:i}\right)}{\theta\_{2} \log\left(X\_{j:i}\right) - \theta\_{2} \log\left(X\_{j:i}\right)} \right| \\ &= \left| \frac{\log\left(X\_{j}\right) - \log\left(X\_{j:i}\right)}{\log\left(X\_{j:i}\right) - \log\left(X\_{j:i}\right)} \right| \end{aligned} \tag{8}$$

where properties of the logarithmic function have been utilized. Since Eq. (8) does not depend on *θ*<sup>1</sup> and *θ*<sup>2</sup> , the proof is completed.

Lemma 2.1 suggests that if the original sequence of random variables is processed in 4-tuples, the compressed sequences' statistical structure is only dependent on the random variables *Xj* . Observe that the Lemma does not require an independence assumption. Thus if sequence {*Xj* } has no unknown statistical parameters, the process generated by Eq. (7) also has no unknown parameters. This suggests that processing of a data sequence in terms of 4-tuples may be an effective may in which to achieve the CFAR property. The next subsection clarifies this.

#### **2.2. Decision rule**

In order to propose a decision rule exploiting the transformation introduced in Lemma 2.1, it is necessary to focus first on a series of four returns. Hence, suppose we have a CUT statistic *Z*, and three clutter measurements are available, denoted *Z*<sup>1</sup> , *Z*<sup>2</sup> and *Z*<sup>3</sup> . Let *H*<sup>0</sup> be the hypothesis that the CUT contains no target, and *H*<sup>1</sup> the hypothesis that it does contain a target embedded within clutter. Then, based upon Eq. (7), a linear threshold test takes the form

$$\zeta(\mathbf{Z}, \mathbf{Z}\_1, \mathbf{Z}\_2, \mathbf{Z}\_3) \stackrel{H\_i}{\underset{H\_i}{\gtrless}} \mathbf{r}\_i \tag{9}$$

where *τ* > 0 is the threshold. Based upon Lemma 2.1 if the clutter is modelled by Eq. (6), then it is clear that under *H*<sup>0</sup> , the Pfa of the test in Eq. (9) will not depend on *θ*<sup>1</sup> or *θ*<sup>2</sup> , implying it is CFAR with respect to these parameters. Furthermore, an auxiliary motivation for defining a linear threshold detector such as Eq. (9) is that in the cases where it is assumed that one has *a priori* knowledge of clutter parameters, linear threshold detectors are ideal, or asymptotically optimal, and hence provide the maximum probability of detection within the class of sliding window decision rules [27].

The test in Eq. (9) can also be re-expressed in terms of the preprocessed clutter statistics. In particular, it can be shown to be equivalent to rejecting *H*<sup>0</sup> if

$$\begin{aligned} & \text{ivalent to rejects } H\_0 \text{ if } \\ & Z > Z\_1 e^{\tau \lfloor \log \langle Z\_\gamma \rangle - \log \langle Z\_\gamma \rangle \rfloor} \\ & \text{or} \\ & Z > Z\_1 e^{\tau \lfloor \log \langle Z\_\gamma \rangle - \log \langle Z\_\gamma \rangle \rfloor} \end{aligned} \tag{10}$$

with the appropriate choice for *τ*, which can be determined from the corresponding Pfa expression for Eqs. (9) or (10).

Observe that this test is not of the usual form found in the radar signal processing literature, since it compares a CUT with a measurement of clutter based upon three statistics, and not upon a sample of predetermined size. This will be discussed subsequently in terms of practical implementation of the test in Eq. (10). The next section discusses the application of Eq. (10) to the Pareto clutter case, enabling the determination of *τ*.

## **3. Specialization to the Pareto Clutter model**

## **3.1. Distributions under** *H***<sup>0</sup>**

Since the motivation of the work developed here is the design of radar detection schemes for maritime surveillance radar, the results of the previous section are specialized to the Pareto case. In order to apply Lemma 2.1, it is necessary to determine the distribution of the resultant sequence produced by *ζ* under *H*<sup>0</sup> . The following is the key result:

**Corollary 3.1** *In the case where the sequence of random variables in Lemma 2.1 is Pareto distributed and independent, the cdf of the sequence processed by ζ is given by*

$$F\_p(t) = \frac{t}{t+1},\tag{11}$$

for *t* ≥ 0.

where each *xj* > 0, *x*<sup>3</sup>

tively easy to prove:

*sequence* {*Wj*

on *θ*<sup>1</sup>

and *θ*<sup>2</sup>

variables *Xj*

if sequence {*Xj*

tion clarifies this.

**2.2. Decision rule**

it is clear that under *H*<sup>0</sup>

window decision rules [27].

**Lemma 2.1** *Suppose* {*Zj*

=*x*<sup>4</sup>

, *j* ∈ IN} *with W<sup>j</sup>* := *ζ*(*Zj*

in the form defined via Eq. (6) it follows that

, the proof is completed.

and three clutter measurements are available, denoted *Z*<sup>1</sup>

that the CUT contains no target, and *H*<sup>1</sup>

*ζ*(*Z*, *Z*<sup>1</sup>

and IR+

146 Advances in Statistical Methodologies and Their Application to Real Problems

The proof of Lemma 2.1 is now outlined. Supposing that *Zj*

, *Zj* <sup>+</sup> <sup>1</sup>

*Wj* =

=

, *Zj* <sup>+</sup> <sup>2</sup>


, *Zj* <sup>+</sup> <sup>3</sup>

*θ*2

*θ*2

<sup>|</sup> log(*Xj*) <sup>−</sup> log(*Xj*+1) \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ log(*Xj*+2) <sup>−</sup> log(*Xj*+3)|,

where properties of the logarithmic function have been utilized. Since Eq. (8) does not depend

Lemma 2.1 suggests that if the original sequence of random variables is processed in 4-tuples, the compressed sequences' statistical structure is only dependent on the random

has no unknown parameters. This suggests that processing of a data sequence in terms of 4-tuples may be an effective may in which to achieve the CFAR property. The next subsec-

In order to propose a decision rule exploiting the transformation introduced in Lemma 2.1, it is necessary to focus first on a series of four returns. Hence, suppose we have a CUT statistic *Z*,

> , *Z*<sup>2</sup> , *Z*3) ≷ *H*0 *H*1

where *τ* > 0 is the threshold. Based upon Lemma 2.1 if the clutter is modelled by Eq. (6), then

CFAR with respect to these parameters. Furthermore, an auxiliary motivation for defining a linear threshold detector such as Eq. (9) is that in the cases where it is assumed that one has *a priori* knowledge of clutter parameters, linear threshold detectors are ideal, or asymptotically optimal, and hence provide the maximum probability of detection within the class of sliding

, the Pfa of the test in Eq. (9) will not depend on *θ*<sup>1</sup>

within clutter. Then, based upon Eq. (7), a linear threshold test takes the form

)<sup>|</sup> = <sup>|</sup> *<sup>θ</sup>*<sup>2</sup> log(*Xj*) <sup>−</sup> *<sup>θ</sup>*<sup>2</sup> log(*Xj*+1) \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ *<sup>θ</sup>*<sup>2</sup> log(*Xj*+2) <sup>−</sup> *<sup>θ</sup>*<sup>2</sup> log(*Xj*+3)|

) − log(*θ*<sup>1</sup> *Xj*+1

) − log(*θ*<sup>1</sup> *Xj*+3

. Observe that the Lemma does not require an independence assumption. Thus

} has no unknown statistical parameters, the process generated by Eq. (7) also

, *Z*<sup>2</sup>

and *Z*<sup>3</sup>

the hypothesis that it does contain a target embedded

. Let *H*<sup>0</sup>

*τ*, (9)

or *θ*<sup>2</sup>

be the hypothesis

, implying it is

) \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ log(*θ*<sup>1</sup> *Xj*+2

is the positive real numbers. Then the following result is rela-

) *does not depend on θ*<sup>1</sup> *and θ*<sup>2</sup>

, *Zj* <sup>+</sup> <sup>2</sup>

, *Zj* <sup>+</sup> <sup>1</sup>

*.*

are represented

(8)

and *Zj* <sup>+</sup> <sup>3</sup>

, *j* ∈ IN} *is a sequence of random variables defined via Eq. (6). Then the* 

*θ*2

*θ*2

This can be recognized as a Pareto distribution, with support the nonnegative real line and shape and scale parameter unity. More specifically, *P* = *X* + 1, where *X* has density (3) with *α* = *β* = 1. This illustrates the cost of the nonlinear transformation approach: although the resultant series of clutter has no unknown clutter parameters, it is from a distribution with no finite moments. The independence assumption is adopted for analytical tractability and is consistent with the assumption that independent and identically distributed clutter returns are available, as in the formulation of the test in Eq. (1).

To prove Corollary 3.1, suppose that *η*<sup>1</sup> and *η*<sup>2</sup> are two independent random variables with cdf (4). Then by analyzing the difference *η*<sup>1</sup> − *η*<sup>2</sup> , it can be shown that it has cdf

$$\text{(4) Then by analyzing the difference }\overset{\cdot}{\eta}\_{1}-\eta\_{2}\text{ it can be shown that it has cutoff}$$

$$F\_{\eta\_{i}-\eta\_{i}}(t) = \begin{cases} 1-\frac{1}{2}e^{-t}, & \text{for } t \ge 0\\ \frac{1}{2}e^{t} & \text{for } t < 0, \end{cases} \tag{12}$$

which is that of a Laplace distribution. Then it follows that

$$F\_{|\eta\_i - \eta\_i|}(t) = \mathbb{P}(-t \le \eta\_1 - \eta\_2 \le t) = 1 - e^{-t},\tag{13}$$

where Eq. (12) has been applied, and *t* > 0. Thus the modulus of the difference is also exponentially distributed with unit mean.

Supposing that *κ*<sup>1</sup> and *κ*<sup>2</sup> are two independent random variables with cdf Eq. (13), then by statistical conditioning

$$F\_{\left(\kappa\_i \kappa\_\emptyset\right)}(t) = \mathbb{P}(\kappa\_1 \le t \,\kappa\_2) = \int\_0^\cdot \mathbb{P}(\kappa\_1 \le t\omega) \, e^{-\omega} \, d\omega,\tag{14}$$

and an application of Eqs. (4)–(14) shows that the ratio has cdf Eq. (11) with an evaluation of the integral. This establishes the result in Corollary 3.1, as required.

#### **3.2. Thresholds and the CUT**

Based upon Corollary 3.1 the univariate threshold for the Pareto case is given by

$$
\tau = P \text{f} \text{\textquotedbl{}} \text{\textquotedbl{}} - 1. \tag{15}
$$

The threshold (Eq. (15)) illustrates the issues with the nonlinear mapping, as this threshold will be quite large for appropriate Pfa. Note that for a Pfa of 10<sup>−</sup> <sup>6</sup> , *τ* = 106 − 1. This threshold will increase as the Pfa decreases. In the Pareto setting, it is shown in Ref. [20] that an ideal detector has its threshold set via *β*(Pfa)<sup>−</sup> 1/*<sup>α</sup>*. In the case where *α* = 4.7241 and *β* = 0.0446 (which correspond to spiky clutter returns) and with the Pfa set to 10− <sup>6</sup> , this threshold is 0.8312 by contrast. Thus the nonlinear mapping, in the process of compressing the original data series, can be used to achieve the CFAR property with Eq. (10), but detection performance may be unacceptable.

To explore this further, it is informative to examine the detection scheme in Eq. (10) when there is a target model present. Suppose *Ξ* is the CUT statistic, in the case where a target is present in the clutter, in the pretransformed data. Let *Ξ* ^ be the CUT in the transformed domain, meaning the detector Eq. (10) when there is a target present so that *T* is the intensity measurement of a return signal and clutter in the complex domain. Then by applying the lefthand expression for Pareto random variables in Eq. (5), we can write

$$\stackrel{\Delta}{\boldsymbol{\Xi}} = \alpha \left| \frac{\log(\boldsymbol{\Xi}|\boldsymbol{\beta}) - \boldsymbol{E}\_{\mathrm{i}}}{E\_{2} - E\_{3}} \right|,\tag{16}$$

where each *E<sup>j</sup>* is an independent exponentially distributed random variable with unit mean. Then with an application of results from the proof of Corollary 3.1, since |*E*<sup>2</sup> − *E*<sup>3</sup> | has the same exponential distribution, one can apply statistical conditioning on *E*<sup>1</sup> and |*E*<sup>2</sup> − *E*<sup>3</sup> | to show that the distribution function of the transformed CUT is

$$\begin{split} & \text{1000 million on the transmission C.D.1 is} \\ & F\_{\underline{\mathbf{A}}}(t) = \int\_{0}^{\omega} \int\_{0}^{\omega} e^{-\phi} \, e^{-\psi} \, \mathbb{P} \left( \left| \log \left( \Xi \beta \right) - \theta \right| \leq \frac{\rho t}{\alpha} \right) \text{d}\theta \text{d}\phi \\ & = \int\_{0}^{\omega} \int\_{0}^{\omega} e^{-\phi} e^{-\psi} \, \mathbb{P} \left( \beta \, e^{\phi \frac{-\theta t}{\alpha}} \leq \Xi \leq \beta \, e^{\phi \frac{-\theta t}{\alpha}} \right) \text{d}\theta \text{d}\phi \\ & = \int\_{0}^{1} \int\_{0}^{1} \mathbb{P} \left( \beta \, x^{-1} \, y^{\frac{t}{\alpha}} \leq \Xi \leq \beta \, x^{-1} \, y^{-\frac{t}{\alpha}} \right) \text{d}\mathbf{x} \text{dy} \\ & = \int\_{0}^{1} \int\_{0}^{1} F\_{\pm} \left( \beta \, x^{-1} \, y^{-\frac{t}{\alpha}} \right) - F\_{\pm} \left( \beta \, x^{-1} \, y^{\frac{t}{\alpha}} \right) \text{d}\mathbf{x} \text{dy}, \end{split} \tag{17}$$

**Figure 1.** Comparison of CUT for the pretransformed data (denoted pre) and data processed via the nonlinear mapping (denoted post). The CUT is plotted for a Swerling 1 target model with a given SCR as indicated.

where the change of variables *x* = *e*<sup>−</sup> *<sup>θ</sup>* and *y* = *e*<sup>−</sup> *<sup>φ</sup>* has been applied. Thus the transformed CUT can be generated from the pretransformed CUT via Eq. (17). To examine this, **Figure 1** plots Eq. (17) in the case of a Swerling 1 target model embedded within Pareto distributed clutter with *α* = 4.7241 and *β* = 0.0446. A Swerling I target model is essentially a bivariate Gaussian model, which is combined with the Pareto model by embedding the latter into a compound Gaussian process with inverse Gamma texture in the complex domain, and then taking modulus squared to produce the intensity measurement [20]. The distribution function of *Ξ* can also be found in Ref. [20] for the case of interest. **Figure 1** shows the pretransformed CUT as well as Eq. (17), in the cases where the signal to clutter (SCR) ratio is 1, 10, 50, and 100 dB. For the case of a 1 dB target model, the CUT has its range of potential values increased under the transformation. This is also the same for the 10 dB case. Interestingly, for the 50 dB and 100 dB cases, the situation is reversed. Hence, as the SCR increases, the nonlinear mapping suppresses the target SCR, reducing the range of admissible values for the transformed CUT. This suggests that although the nonlinear mapping removes unknown clutter parameters, it may also impede detection due to target suppression. If the threshold is set via Eq. (15), then it is clear from **Figure 1** that it will be very difficult to detect targets with a reasonably small Pfa. Hence, the new detection scheme must be combined with an integration process to rectify this.

## **4. Performance in homogeneous clutter**

#### **4.1. Methodology and data**

which is that of a Laplace distribution. Then it follows that

148 Advances in Statistical Methodologies and Their Application to Real Problems


(*κ*1/*κ*2)

(*t*) = IP(−*t* ≤ *η*<sup>1</sup> − *η*<sup>2</sup> ≤ *t*) = 1 − *e* <sup>−</sup>*<sup>t</sup>* , (13)

IP(*κ*<sup>1</sup> ≤ *tω*) *e*<sup>−</sup>*<sup>ω</sup> dω*, (14)

, *τ* = 106 − 1. This threshold will

^ be the CUT in the transformed



and |*E*<sup>2</sup> − *E*<sup>3</sup>

(17)

, this threshold is 0.8312 by contrast. Thus

*<sup>E</sup>*<sup>2</sup> <sup>−</sup> *<sup>E</sup>*<sup>3</sup> |, (16)

*<sup>α</sup>* )dd

*α* )dd

are two independent random variables with cdf Eq. (13), then by

where Eq. (12) has been applied, and *t* > 0. Thus the modulus of the difference is also exponen-

∫

and an application of Eqs. (4)–(14) shows that the ratio has cdf Eq. (11) with an evaluation of

 *τ* = Pfa<sup>−</sup><sup>1</sup> − 1. (15) The threshold (Eq. (15)) illustrates the issues with the nonlinear mapping, as this threshold will

increase as the Pfa decreases. In the Pareto setting, it is shown in Ref. [20] that an ideal detector has its threshold set via *β*(Pfa)<sup>−</sup> 1/*<sup>α</sup>*. In the case where *α* = 4.7241 and *β* = 0.0446 (which correspond

the nonlinear mapping, in the process of compressing the original data series, can be used to achieve the CFAR property with Eq. (10), but detection performance may be unacceptable.

To explore this further, it is informative to examine the detection scheme in Eq. (10) when there is a target model present. Suppose *Ξ* is the CUT statistic, in the case where a target

domain, meaning the detector Eq. (10) when there is a target present so that *T* is the intensity measurement of a return signal and clutter in the complex domain. Then by applying the left-


*<sup>e</sup>* <sup>−</sup>*<sup>θ</sup> <sup>e</sup>*<sup>−</sup>*<sup>ϕ</sup>* IP(|log(*Ξ*/*β*) <sup>−</sup> *<sup>θ</sup>*<sup>|</sup> <sup>≤</sup> *<sup>ϕ</sup><sup>t</sup>* \_\_\_

=

*<sup>α</sup>* ≤ *Ξ* ≤ *β x*<sup>−</sup><sup>1</sup> *y*<sup>−</sup>\_\_*<sup>t</sup>*

) <sup>−</sup> *<sup>F</sup>Ξ*(*<sup>β</sup> <sup>x</sup>*<sup>−</sup><sup>1</sup> *<sup>y</sup>*\_\_*<sup>t</sup>*

*<sup>α</sup>* ≤ *Ξ* ≤ *β e <sup>θ</sup>*<sup>+</sup>

*ϕt* \_\_\_

*α* )dxdy

*α* )dxdy,

*ϕt* \_\_\_

*α*

is an independent exponentially distributed random variable with unit mean.

0

∞

(*t*) = IP(*κ*<sup>1</sup> ≤ *t κ*2) =

Based upon Corollary 3.1 the univariate threshold for the Pareto case is given by

the integral. This establishes the result in Corollary 3.1, as required.

be quite large for appropriate Pfa. Note that for a Pfa of 10<sup>−</sup> <sup>6</sup>

is present in the clutter, in the pretransformed data. Let *Ξ*

hand expression for Pareto random variables in Eq. (5), we can write

^ = *α*

same exponential distribution, one can apply statistical conditioning on *E*<sup>1</sup>

*e* <sup>−</sup>*<sup>θ</sup>e*<sup>−</sup>*<sup>ϕ</sup>* IP(*β e <sup>θ</sup>*<sup>−</sup>

*<sup>F</sup>Ξ*(*<sup>β</sup> <sup>x</sup>*<sup>−</sup><sup>1</sup> *<sup>y</sup>*<sup>−</sup>\_\_*<sup>t</sup>*

<sup>1</sup> IP(*β x*<sup>−</sup><sup>1</sup> *y*\_\_*<sup>t</sup>*

show that the distribution function of the transformed CUT is

0 ∞

∫

0 ∞

0

0 1

∫

= ∫

> = ∫

0 ∞

0 1 ∫

> 0 1 ∫

∫

0 ∞

*FΞ* ^ (*t*) = ∫

Then with an application of results from the proof of Corollary 3.1, since |*E*<sup>2</sup> − *E*<sup>3</sup>

to spiky clutter returns) and with the Pfa set to 10− <sup>6</sup>

*Ξ*

where each *E<sup>j</sup>*

*F*

Supposing that *κ*<sup>1</sup>

statistical conditioning *F*

**3.2. Thresholds and the CUT**

tially distributed with unit mean.

and *κ*<sup>2</sup>

In order to examine the performance of the proposed detection scheme (9), clutter is simulated under the assumption of a Pareto clutter model, which has been found to fit Defence Science and Technology Group's (DSTG's) real X-band maritime surveillance radar data sets. Ingara is an experimental X-band imaging radar which has provided real clutter for the analysis of detector performance [28]. A trial in 2004 produced a series of clutter sets that have been analyzed from a statistical perspective in Ref. [29]. During the trial, the radar operated in a circular spotlight mode, surveying the same patch of the Southern Ocean at different azimuth and grazing angles. Additionally, the radar provided full polarimetric data. For the purposes of the numerical work to follow, focus is restricted to one particular data set. This is run 34683, at an azimuth angle of 225*°*, which is approximately in the up wind direction. Additionally, the numerical analysis focuses on the horizontal transmit and receive (HH) case.

For performance analysis in homogeneous clutter, the data is simulated with distributional parameters matched to those obtained from the Ingara data set. The data consists of 821 pulses with 1024 range compressed samples, from which maximum likelihood estimates of the distributional parameters can be obtained from the intensity measurements. Under the Pareto model assumption, the estimates are *α*^ = 4.7241 and *β* ^ = 0.0446.

As remarked previously, it is necessary to couple (10) with an integration scheme to enhance its performance. The integration scheme used for this purpose is binary integration, which is well-described in Ref. [30], and an application of it in a Pareto distributed clutter environment can be found in Ref. [31]. Such a process applies a series of *M* ≥ 1 tests of Eq. (10), and then conclude that if at least *S* out of *M* return a detection result, then a target is likely to be present in the radar clutter [30], where *S* ∈ {1, 2, …, *M*}. Selection of an appropriate *S* is outlined in Ref. [31]. Essentially, it is pointed out in Ref. [32] that for a specified univariate cumulative detection probability and false alarm rate and a fixed number of maximum binary integration returns *M*, there exists an optimal *S* which minimizes the required signal to clutter ratio, and maximizes the binary integration gain. This can be done visually or numerically by plotting the minimum SCR as a function of *S*, under the assumption of a certain signal model. This approach, and the analysis in Ref. [31], shows that in the current context, the choice of *S* = 3 with *M* = 8 should provide good results. Relative to the problem addressed in this chapter, applying binary integration with a linear threshold detector in the transformed clutter domain is not computationally expensive, and thus is seen as a reasonable solution.

If PfaBI denotes the Pfa for binary integration, then it can be expressed in terms of the univariate detection processes Pfa through the equation

$$\text{Pfa}\_{\text{nl}} = \sum\_{j=\mathbf{S}}^{\text{M}} \binom{M}{j} \text{Pfa}^{\downarrow} (\mathbf{1} - \mathbf{P} \mathbf{f} \mathbf{a})^{\downarrow \leftarrow j}. \tag{18}$$

The threshold *τ* is set via Eq. (18) coupled with the univariate Pfa from Eq. (9).

To simulate detection performance, the probability of detection (Pd) is estimated, using 106 Monte Carlo runs based upon a Swerling 1 target model assumed for the CUT. For each SCR, the binary integration process is run using *S* = 3 out of *M* = 8 binary integration. The motivation for these choices can be found in Ref. [31]. In order to assess the robustness of the detection scheme to interference, up to two interfering targets are inserted into the clutter measurements to give an indication of the performance with interference. Thus independent Swerling 1 targets, with interference to clutter ratio (ICR) of 1 dB, are applied to *Z*<sup>1</sup> (denoted Inter 1 in the plots), then to *Z*2 (denoted Inter 2), and then to both *Z*<sup>1</sup> and *Z*<sup>2</sup> (denoted Inter 3) in the univariate decision rule in Eq. (9). A real spurious target may only appear in a subset of the clutter measurements and so this analysis of interference can be viewed as an upper bound on poor performance.

#### **4.2. Receiver operating characteristic curves**

analyzed from a statistical perspective in Ref. [29]. During the trial, the radar operated in a circular spotlight mode, surveying the same patch of the Southern Ocean at different azimuth and grazing angles. Additionally, the radar provided full polarimetric data. For the purposes of the numerical work to follow, focus is restricted to one particular data set. This is run 34683, at an azimuth angle of 225*°*, which is approximately in the up wind direction. Additionally,

For performance analysis in homogeneous clutter, the data is simulated with distributional parameters matched to those obtained from the Ingara data set. The data consists of 821 pulses with 1024 range compressed samples, from which maximum likelihood estimates of the distributional parameters can be obtained from the intensity measurements. Under the

As remarked previously, it is necessary to couple (10) with an integration scheme to enhance its performance. The integration scheme used for this purpose is binary integration, which is well-described in Ref. [30], and an application of it in a Pareto distributed clutter environment can be found in Ref. [31]. Such a process applies a series of *M* ≥ 1 tests of Eq. (10), and then conclude that if at least *S* out of *M* return a detection result, then a target is likely to be present in the radar clutter [30], where *S* ∈ {1, 2, …, *M*}. Selection of an appropriate *S* is outlined in Ref. [31]. Essentially, it is pointed out in Ref. [32] that for a specified univariate cumulative detection probability and false alarm rate and a fixed number of maximum binary integration returns *M*, there exists an optimal *S* which minimizes the required signal to clutter ratio, and maximizes the binary integration gain. This can be done visually or numerically by plotting the minimum SCR as a function of *S*, under the assumption of a certain signal model. This approach, and the analysis in Ref. [31], shows that in the current context, the choice of *S* = 3 with *M* = 8 should provide good results. Relative to the problem addressed in this chapter, applying binary integration with a linear threshold detector in the transformed clutter

domain is not computationally expensive, and thus is seen as a reasonable solution.

*j*=*S M* ( *M*

The threshold *τ* is set via Eq. (18) coupled with the univariate Pfa from Eq. (9).

If PfaBI denotes the Pfa for binary integration, then it can be expressed in terms of the univari-

To simulate detection performance, the probability of detection (Pd) is estimated, using 106 Monte Carlo runs based upon a Swerling 1 target model assumed for the CUT. For each SCR, the binary integration process is run using *S* = 3 out of *M* = 8 binary integration. The motivation for these choices can be found in Ref. [31]. In order to assess the robustness of the detection scheme to interference, up to two interfering targets are inserted into the clutter measurements to give an indication of the performance with interference. Thus independent Swerling 1 targets, with

and *Z*<sup>2</sup>

this analysis of interference can be viewed as an upper bound on poor performance.

in Eq. (9). A real spurious target may only appear in a subset of the clutter measurements and so

^ = 0.0446.

*<sup>j</sup>* ) Pfa*<sup>j</sup>* (1 <sup>−</sup> Pfa)*<sup>M</sup>*−*<sup>j</sup>* . (18)

(denoted Inter 1 in the plots), then to

(denoted Inter 3) in the univariate decision rule

the numerical analysis focuses on the horizontal transmit and receive (HH) case.

Pareto model assumption, the estimates are *α*^ = 4.7241 and *β*

150 Advances in Statistical Methodologies and Their Application to Real Problems

ate detection processes Pfa through the equation

interference to clutter ratio (ICR) of 1 dB, are applied to *Z*<sup>1</sup>

PfaBI = ∑

(denoted Inter 2), and then to both *Z*<sup>1</sup>

*Z*2

Receiver operating characteristic (ROC) curves are used to examine the performance, which plots the probability of detection as a function of the false alarm probability, when the target in the CUT is at a fixed SCR. **Figures 2**–**4** provide examples of the performance of the new detector Eq. (10) with binary integration and compares it to the performance of some of the recently introduced detectors designed for operation in a Pareto clutter model environment. For a CUT *Z* and clutter range profile *Z*<sup>1</sup> , *Z*<sup>2</sup> , …, *ZN*, the Geometric Mean (GM) CFAR is

$$Z \stackrel{H\_i}{\underset{H\_i}{\rightleftharpoons}} \theta^{1 \to \infty} \prod\_{\uparrow=1}^{N} Z\_{\uparrow}^{\downarrow} \,. \tag{19}$$

which is shown in Ref. [20] to have its threshold set via *ζ* = Pfa<sup>−</sup> 1/*<sup>N</sup>* − 1. Similarly, an Order Statistic (OS)-CFAR has been analyzed in Ref [22], which is given by

$$Z \mathop{\mathbb{Z}}\_{H\_i}^{H\_i} \beta^{1-\nu} Z^{\nu}\_{\varnothing \prime} \tag{20}$$

which has its threshold multiplier *ν<sup>j</sup>* set via inversion of the Pfa equation given by

$$\text{Pffa} = \frac{N!}{(N-j)!} \frac{\Gamma(\nu\_j + N - j + 1)}{\Gamma(\nu\_j + N + 1)} \, , \tag{21}$$

where the OS index 1 ≤ *j* ≤ *N* and the notation *ν<sup>j</sup>* emphasizes the fact that *ν<sup>j</sup>* depends on the selected OS index *j*. Observe that both these decision rules require *a priori* knowledge of *β*. In order to provide a valid comparison with Eq. (10), these detectors have been applied with *N* = 3 and coupled with binary integration. Due to this, there are three choices available for *j*, corresponding to a minimum (denoted MIN, when *j* = 1), median (MED, *j* = 2), and maximum (MAX, *j* = 3).

**Figure 2.** Comparison of detectors with a small target SCR.

**Figure 3.** Detector performance with a larger SCR in the CUT.

**Figure 2** compares the performance of these decision rules, where the detection process (10) coupled with binary integration is denoted as the nonlinear mapping (NLM). In this case, the CUT SCR is 5 dB, representing a small target. As can be observed, the new decision rule has superior performance. The same experiment is repeated in **Figure 3**, where the CUT SCR is 15 dB, and then it is increased to 20 dB in **Figure 4**. These results show that the new detection process has superior performance, while not requiring *a priori* knowledge of the Pareto scale parameter. These results validate the application of Eq. (10) to target detection in spiky X-band clutter with binary integration.

It is interesting to note that as *M* is increased, there is very little gain in performance. To demonstrate this, **Figure 5** repeats the same scenario in **Figure 4** except *M* has been increased to

**Figure 4.** Decision rule performance with a CUT SCR of 20 dB.

**Figure 5.** Decision rule performance with a CUT SCR of 20 dB, where the binary integration is *S* = 3 out of *M* = 30.

30. Comparing **Figures 4** and **5**, it is clear that there is very little gain. However, the computational complexity increases dramatically as *M* is increased. Hence, in a practical implementation of the binary integration process, it is more efficient to select *M* small.

#### **4.3. Effect of interference**

**Figure 2** compares the performance of these decision rules, where the detection process (10) coupled with binary integration is denoted as the nonlinear mapping (NLM). In this case, the CUT SCR is 5 dB, representing a small target. As can be observed, the new decision rule has superior performance. The same experiment is repeated in **Figure 3**, where the CUT SCR is 15 dB, and then it is increased to 20 dB in **Figure 4**. These results show that the new detection process has superior performance, while not requiring *a priori* knowledge of the Pareto scale parameter. These results validate the application of Eq. (10) to target detection in spiky

It is interesting to note that as *M* is increased, there is very little gain in performance. To demonstrate this, **Figure 5** repeats the same scenario in **Figure 4** except *M* has been increased to

X-band clutter with binary integration.

**Figure 3.** Detector performance with a larger SCR in the CUT.

152 Advances in Statistical Methodologies and Their Application to Real Problems

**Figure 4.** Decision rule performance with a CUT SCR of 20 dB.

Next the cost of interference on the new decision rule is examined, and for brevity, only this decision rule is considered. **Figure 6** shows the case where the CUT has SCR of 5 dB, and the decision rule (10) coupled with binary integration is denoted BI, while the three interference cases are marked appropriately. Here we observe quite good performance that decreases with the interference. **Figure 7** shows the result of increasing the SCR in the CUT to 20 dB. The result is an expected detection performance improvement as shown.

**Figure 6.** Performance of new detector when subjected to interference.

**Figure 7.** ROC for higher SCR with interference.

## **5. Performance in real data**

As a final test of the proposed detection scheme, it was run directly on the Ingara data set under consideration, with the insertion of synthetic Swerling 1 target and interference as for the homogeneous case. A sliding window was run across the data sequentially, and detection performance was estimated by running the 3 out of 8 detection scheme, resulting in a run length of 840,672. The Ingara data is slightly correlated from cell to cell and so the detector Eq. (9), which has threshold set via an independence assumption, becomes a suboptimal decision rule. Detection performance under both clutter model assumptions is plotted on the same ROC curve to compare performance on the real data more easily. The same scenario is repeated as for the analysis under homogeneous independent clutter.

**Figure 8** shows detection performance with the CUT SCR of 5 dB, while **Figure 9** repeats the same numerical experiment as for **Figure 8**, except the CUT has SCR of 20 dB. Comparing

**Figure 8.** Performance of the detectors on the Ingara data directly.

**Figure 9.** Second example of performance of the detectors on the Ingara data directly.

**Figure 8** with **Figure 6** we observe that the effects of correlation are having an effect on the performance in real data. The new decision rule is designed to operate in independent homogeneous clutter returns, and so there is a serious variation in performance. The same situation is observed at a larger CUT SCR (comparing **Figures 9** and **7**).

## **6. Conclusions**

**5. Performance in real data**

**Figure 7.** ROC for higher SCR with interference.

154 Advances in Statistical Methodologies and Their Application to Real Problems

As a final test of the proposed detection scheme, it was run directly on the Ingara data set under consideration, with the insertion of synthetic Swerling 1 target and interference as for the homogeneous case. A sliding window was run across the data sequentially, and detection performance was estimated by running the 3 out of 8 detection scheme, resulting in a run length of 840,672. The Ingara data is slightly correlated from cell to cell and so the detector Eq. (9), which has threshold set via an independence assumption, becomes a suboptimal decision rule. Detection performance under both clutter model assumptions is plotted on the same ROC curve to compare performance on the real data more easily. The same scenario is

**Figure 8** shows detection performance with the CUT SCR of 5 dB, while **Figure 9** repeats the same numerical experiment as for **Figure 8**, except the CUT has SCR of 20 dB. Comparing

repeated as for the analysis under homogeneous independent clutter.

**Figure 8.** Performance of the detectors on the Ingara data directly.

A nonlinear transformation was introduced and shown to remove clutter parameter dependence for a class of statistical models. This was used to formulate a simple linear threshold detector in the transformed clutter domain. Due to issues with the magnitude of detection thresholds, it was necessary to couple the approach with binary integration.

Analysis of detection performance in simulated clutter showed good detection performance. Interference had a strong impact on performance as expected. When the detection process was applied directly to real data, similar results were observed. Nonetheless, the nonlinear transformation, coupled with binary integration, resulted in reasonable detection performance while guaranteeing the CFAR property is preserved.

## **Author details**

Graham V. Weinberg

Address all correspondence to: Graham.Weinberg@defence.gov.au

National Security, Intelligence, Surveillance, Reconnaissance Division, Defence Science, Technology Group, Edinburgh, South Australia, Australia

## **References**


[16] Balleri, A., Nehorai, A., Wang, J.: Maximum Likelihood Estimation for Compound-Gaussian Clutter with Inverse-Gamma Texture, *IEEE Transactions on Aerospace and Electronic Systems*, 2007, **43**, pp. 775–779.

**References**

414–464.

pp. 102–114.

1991, **1**, pp. 198–214.

Magellan, Baltimore, 1990.

156 Advances in Statistical Methodologies and Their Application to Real Problems

*Systems*, 1986, **AES-22**, pp. 419–421.

*Signal Processing*, 1996, **49**, pp. 111–118.

Clutter, *Signal Processing*, 2000, **80**, pp. 117–123.

Clutter, *Signal Processing*, 2009, **89**, pp. 1023–1031.

Environments, *Signal Processing*, 2013, **93**, pp. 35–48.

[1] Minkler, G., Minkler, J.: CFAR: The Principles of Automatic Radar Detection in Clutter,

[2] Finn, H. M., Johnson, R. S.: Adaptive Detection Model with Threshold Control as a Function of Spatially Sampled Clutter-Level Estimates, *RCA Review*, 1968, **29**, pp.

[3] Nitzberg, R.: Low-Loss Almost Constant False-Alarm Rate Processors, *IEEE Transactions* 

[4] Weiss, M.: Analysis of Some Modified Cell-Averaging CFAR Processors in Multiple-Target Situations, *IEEE Transactions on Aerospace and Electronic Systems*, 1982, **AES-18**,

[5] Rohling, H.: Radar CFAR Thresholding in Clutter and Multiple Target Situations, *IEEE* 

[6] Nitzberg, R.: Clutter Map CFAR Analysis, *IEEE Transactions on Aerospace and Electronic* 

[7] Gandhi, P. P., Kassam, S. A.: Analysis of CFAR Processors in Nonhomogeneous Background, *IEEE Transactions on Aerospace and Electronic Systems,* 1988, **24**, pp. 427–445.

[8] Chen, W.-S., Reed, I. S.: A New CFAR Detection Test for Radar, *Digital Signal Processing*,

[9] Al-Hussaini, E. K., El-Mashade, M. B.: Performance of Cell-Averaging and Order-Statistic CFAR Detectors Processing Correlated Sweeps for Multiple Interfering Targets,

[10] Hamadouche, M, Barakat, M., Khodja, M.: Analysis of the Cutter Map CFAR in Weibull

[11] Laroussi, T., Barkat, M: Performance Analysis of Order-Statistic CFAR Detectors in Time Diversity Systems for Partially Correlated Chi-Square Targets and Multiple Target

[12] Erfanian, S., Vakili, V. T.: Introducing Excision Switching-CFAR in K Distributed Sea

[13] Zhang, R., Sheng, W., Ma, X.: Improved Switching CFAR Detector for Non-Homogeneous

[14] Zhang, R.-I., Sheng, W.-X., Ma, X.-F., Han, Y.-B.: Constant False Alarm Rate Detector based on the Maximal Reference Cell, *Digital Signal Processing*, 2013, **23**, pp. 1974–1988.

[15] Zaimbashi, A.: An Adaptive Cell Averaging-Based CFAR Detector for Interfering Targets

and Clutter-Edge Situations, *Digital Signal Processing*, 2014, **31**, pp. 59–68.

Situations: A Comparison, *Signal Processing*, 2006, **86**, pp. 1617–1631.

*Transactions on Aerospace and Electronic Systems*, 1983, **AES-19**, pp. 608–621

*on Aerospace and Electronic Systems*, 1979, **AES-15**, pp. 719–723.


#### **Distributions and Composite Models for Size-Type Data** Distributions and Composite Models for Size-Type Data

Yves Dominicy and Corinne Sinner Yves Dominicy and Corinne Sinner

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66443

#### Abstract

[31] Weinberg, G. V., Kyprianou, R.: Optimised Binary Integration with Order Statistic CFAR

[32] Frey, T. L.: An Approximation for the Optimum Binary Integration Threshold for Swerling II Targets, *IEEE Transactions on Aerospace and Electronic Systems*, 1996, **32**, pp.

in Pareto Distributed Clutter, *Digital Signal Processing*, 2015, **42**, pp. 50–60.

158 Advances in Statistical Methodologies and Their Application to Real Problems

1181–1184.

In the first part of this chapter, we present a sample of the best known and most used classical size distributions with their main statistical properties. In the second part, we introduce the concept of composite models and based on the size distributions of the first part, we describe those which already exist in the literature. In the last part of this chapter, we apply the described statistical size distributions and some of the composite models to two real data examples and compare their goodness-of-fit.

Keywords: size distributions, composite models, lognormal, Pareto, Weibull

## 1. Introduction

In statistical modeling, the continuous aim is to look for the probability law, which best describes the observations arising from a given field and which should represent the underlying data-generating process. The obtained probability distributions should possess desirable properties such as the flexibility of modeling different shapes and remain of a tractable form. This research avenue was initiated in the nineteenth century by famous mathematicians as Adolphe Quetelet, Sir Francis Galton, or Vilfredo Pareto, and since then it has never ceased. Nowadays, it still remains among the highly treated topics in statistics as shown by the large quantity of scientific papers recently published on the subject (see for instance the review papers [1] and [2]). The actual appeal for this topic is easily explained by the availability of large data sets in various scientific domains, making it essential and necessary to do further research on this subject.

In this chapter, we concentrate on probability distributions that analyze size-type data. By size distributions, we mean probability laws designed to model data that only take positive values. Positive observations appear naturally in different fields: survival analysis [3, 4], environmental science [5], network traffic modeling [6], economics [7, 8], hydrology [9], and actuarial

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

science [10]. Given the range of various domains of application, there exists a plethora of different size distributions and it is still a very active research area [11, 12].

The structure of this chapter is as follows. In Section 2, we review the most used and wellknown size distributions, and state their main statistical properties. Section 3 introduces the notion of composite models and gives a small review of the composite models in the literature, based on the size distributions depicted in Section 2. In Section 4, we apply the described size distributions of Section 2 and some of the composite models of Section 3 to two real data sets, namely, an insurance data set and an Internet traffic data set. Finally, Section 5 concludes.

## 2. Review of size distributions

We describe here a sample of the best known and most used size distributions. We will state their probability density function (p.d.f) and their cumulative density function (c.d.f), show their moments and their quantile function, and give the estimators obtained via maximum likelihood estimation. More specifically, we take a closer look at the lognormal, Pareto, generalized Lomax, and generalized extreme value distributions.

#### 2.1. The lognormal distribution

The English statistician Sir Francis Galton stated that in some situations it was preferable to measure the location of a distribution with the geometric mean instead of the arithmetic mean [13]. Indeed, laws of nature often behave in multiplicative ways; thus, the geometric mean becomes more appropriate as a measure of central tendency than the arithmetic mean. As a reply to Galton's request, the Scottish physician Donald McAlister established in 1879 a theory of the exponentiated (or multiplicative) normal distribution [14], this became to be known as the lognormal distribution.

Let X be a positive random variable (r.v.) such that log X = <sup>D</sup> Y is normally distributed with parameters μ ∈ R and σ > 0. The r.v. X then has a lognormal distribution, X ~LN(μ, σ<sup>2</sup> ), with probability density function (p.d.f.)

$$f(\mathbf{x}; \mu, \sigma^2) = \frac{1}{\mathbf{x}\sqrt{2\pi}\sigma} e^{-\frac{(\log \mathbf{r} - \mu)^2}{2\sigma^2}}, \; \mathbf{x} > 0. \tag{1}$$

The location parameter μ ∈ R and the scale parameter σ > 0 are characteristic for the r.v. logX. However, by the exponential transformation, the geometric mean e <sup>μ</sup> becomes a scale parameter, as depicted in Figure 1, and the multiplicative standard deviation e <sup>σ</sup> appears as shape parameter impacting the skewness (see Figure 2).

If random variability enjoys multiplicative effects, as stated by Galton, then a lognormal distribution must be the result. This establishes the basis of the multiplicative central limit theorem, which asserts that the geometric means of nonlognormal random variables are approximated by a lognormal distribution.

science [10]. Given the range of various domains of application, there exists a plethora of

The structure of this chapter is as follows. In Section 2, we review the most used and wellknown size distributions, and state their main statistical properties. Section 3 introduces the notion of composite models and gives a small review of the composite models in the literature, based on the size distributions depicted in Section 2. In Section 4, we apply the described size distributions of Section 2 and some of the composite models of Section 3 to two real data sets, namely, an insurance data set and an Internet traffic data set. Finally, Section 5 concludes.

We describe here a sample of the best known and most used size distributions. We will state their probability density function (p.d.f) and their cumulative density function (c.d.f), show their moments and their quantile function, and give the estimators obtained via maximum likelihood estimation. More specifically, we take a closer look at the lognormal, Pareto, gener-

The English statistician Sir Francis Galton stated that in some situations it was preferable to measure the location of a distribution with the geometric mean instead of the arithmetic mean [13]. Indeed, laws of nature often behave in multiplicative ways; thus, the geometric mean becomes more appropriate as a measure of central tendency than the arithmetic mean. As a reply to Galton's request, the Scottish physician Donald McAlister established in 1879 a theory of the exponentiated (or multiplicative) normal distribution [14], this became to be known as

Let X be a positive random variable (r.v.) such that log X = <sup>D</sup> Y is normally distributed with parameters μ ∈ R and σ > 0. The r.v. X then has a lognormal distribution, X ~LN(μ, σ<sup>2</sup>

The location parameter μ ∈ R and the scale parameter σ > 0 are characteristic for the r.v. logX.

If random variability enjoys multiplicative effects, as stated by Galton, then a lognormal distribution must be the result. This establishes the basis of the multiplicative central limit theorem, which asserts that the geometric means of nonlognormal random variables are

Þ ¼ <sup>1</sup> x ffiffiffiffiffiffi <sup>2</sup><sup>π</sup> <sup>p</sup> <sup>σ</sup> e <sup>−</sup> <sup>ð</sup>logx−μÞ<sup>2</sup> ), with

<sup>2</sup>σ<sup>2</sup> , x > 0: (1)

<sup>μ</sup> becomes a scale parame-

<sup>σ</sup> appears as shape

different size distributions and it is still a very active research area [11, 12].

160 Advances in Statistical Methodologies and Their Application to Real Problems

2. Review of size distributions

2.1. The lognormal distribution

the lognormal distribution.

probability density function (p.d.f.)

alized Lomax, and generalized extreme value distributions.

<sup>f</sup>ðx; <sup>μ</sup>, <sup>σ</sup><sup>2</sup>

However, by the exponential transformation, the geometric mean e

parameter impacting the skewness (see Figure 2).

approximated by a lognormal distribution.

ter, as depicted in Figure 1, and the multiplicative standard deviation e

Figure 1. Density plots of the lognormal distribution with varying location parameter μ and fixed scale parameter σ.

Figure 2. Density plots of the lognormal distribution with varying scale parameter σ and fixed location parameter μ.

The cumulative distribution function (c.d.f.) of the lognormal law is related to the c.d.f. of the normal distribution:

$$F(x; \mu, \sigma^2) = \Phi\left(\frac{\log x - \mu}{\sigma}\right), \ x > 0,\tag{2}$$

where Φ(.) represents the c.d.f. of a standard normal distribution.

The moments of order <sup>r</sup> are conveniently expressed as <sup>E</sup>ðX<sup>r</sup> Þ ¼ <sup>e</sup><sup>r</sup>μþr2σ<sup>2</sup> <sup>2</sup> : Hence, the mean is given by <sup>E</sup>ðXÞ ¼ <sup>e</sup><sup>μ</sup>þσ<sup>2</sup> <sup>2</sup> , and the variance by <sup>V</sup>ðXÞ ¼ <sup>e</sup><sup>2</sup>μþσ<sup>2</sup> ðeσ2 − 1Þ: The lognormal is a unimodal distribution and the unique mode is reached at xmode <sup>¼</sup> <sup>e</sup><sup>μ</sup>−σ<sup>2</sup> : By comparing the mean and the mode, we note that for a fixed μ, an increasing σ shifts the mode toward zero while the mean increases. The quantile function is defined as F<sup>−</sup><sup>1</sup> <sup>ð</sup>yÞ ¼ <sup>e</sup><sup>μ</sup>þσΦ−1ðy<sup>Þ</sup> , for 0 < y < 1 and where Φ-1(.) denotes the quantile function of a standard normal distribution.

Thanks to its relationship to the normal distribution, the likelihood function is given by

$$L(\mathbf{x}\_1, \dots, \mathbf{x}\_n | \mu, \sigma^2) = \left(\prod\_{i=1}^n \frac{1}{\mathbf{x}\_i}\right) \left(\frac{1}{\sqrt{2\pi}\sigma}\right)^n e^{-\sum\_{i=1}^n \frac{(\log x\_i + \mu)^2}{2\sigma^2}},\tag{3}$$

and hence the log-likelihood function can be expressed as

$$\mathcal{I}(\mathbf{x}\_1, \dots, \mathbf{x}\_n | \boldsymbol{\mu}, \sigma^2) = -\sum\_{i=1}^n \log \, \mathbf{x}\_i - \frac{n}{2} \log 2\pi - n \log \sigma - \sum\_{i=1}^n \frac{\left(\log \, \mathbf{x}\_i - \boldsymbol{\mu}\right)^2}{2\sigma^2}. \tag{4}$$

The maximum likelihood estimators for the mean and the scale are given by <sup>μ</sup>̂<sup>¼</sup> <sup>1</sup> <sup>n</sup> <sup>∑</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup>log xi and σ̂ <sup>2</sup> <sup>¼</sup> <sup>1</sup> <sup>n</sup> <sup>∑</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup>ðlog xi<sup>−</sup> <sup>μ</sup>̂ Þ 2 , respectively.

The lognormal distribution is widely used to describe natural phenomena. In finance, the Black-Scholes model, which is a mathematical model containing derivative instruments, assumes the underlying derivative price to have a lognormal distribution [15]. In economics, income data are often modeled by a lognormal distribution [16], which can be easily explained as follows: a very low percentage of earners have very low income. To gain averaged revenue is frequent, whereas an elevated income is rare. In actuarial sciences, the law is assumed to fit well some types of insurance losses [17, 18]. In 1931, the French economist and engineer Robert Pierre Louis Gibrat stated that the firm size follows a lognormal distribution as its proportional growth rate is independent of its absolute size. Other applications can be found in biology [19, 20] or in linguistics to model the number of words in a sentence [21].

#### 2.2. The Pareto distribution

The Italian economist and engineer Vilfredo Pareto observed in 1896 that in many populations the power law cx-<sup>α</sup> , for some constant c > 0 and some exponent α > 0, was an appropriate approximation of the number of individuals with income exceeding a given threshold x<sup>0</sup> (see for instance [22, 23]). These power laws assume that small values of x are very frequent, while large occurrences are extremely rare. Their form implies that all power laws with a particular scaling exponent are equivalent up to constant factors since each is simply a scaled version of the others. This produces the linear relationship when logarithms are taken of both f(x) and x, which denotes the signature of power laws. Such distributions are known as Pareto-type distributions.

The p.d.f. of a r.v. X having a Pareto (type I) distribution with parameters α > 0 and x0 > 0 is given by

$$f(\mathbf{x}; \alpha, \mathbf{x}\_0) = \frac{\alpha}{\mathbf{x}\_0} \left(\frac{\mathbf{x}\_0}{\mathbf{x}}\right)^{\alpha + 1}, \text{ x} \ge \mathbf{x}\_0. \tag{5}$$

The location parameter x<sup>0</sup> represents the lower bound of the data set and the shape parameter α is called the tail index or as well the Pareto index, and hence regulates the tail as can be seen in Figure 3. Note that a decreasing value of α implies a heavier tail.

Figure 3. Density plots of the Pareto distribution with varying shape parameter α and fixed location parameter x<sup>0</sup> = 1.

The c.d.f. of the Pareto law is given by

increases. The quantile function is defined as F<sup>−</sup><sup>1</sup>

162 Advances in Statistical Methodologies and Their Application to Real Problems

denotes the quantile function of a standard normal distribution.

<sup>L</sup>ðx1, …, xnjμ, <sup>σ</sup><sup>2</sup>

Þ ¼ − ∑ n i¼1

, respectively.

20] or in linguistics to model the number of words in a sentence [21].

and hence the log-likelihood function can be expressed as

<sup>l</sup>ðx1, …, xnjμ, <sup>σ</sup><sup>2</sup>

Þ 2

<sup>i</sup>¼<sup>1</sup>ðlog xi<sup>−</sup> <sup>μ</sup>̂

2.2. The Pareto distribution

the power law cx-<sup>α</sup>

and σ̂

<sup>2</sup> <sup>¼</sup> <sup>1</sup> <sup>n</sup> <sup>∑</sup><sup>n</sup>

Thanks to its relationship to the normal distribution, the likelihood function is given by

Þ ¼ ∏ n i¼1 1 xi � � 1

log xi <sup>−</sup> <sup>n</sup>

The maximum likelihood estimators for the mean and the scale are given by <sup>μ</sup>̂<sup>¼</sup> <sup>1</sup>

The lognormal distribution is widely used to describe natural phenomena. In finance, the Black-Scholes model, which is a mathematical model containing derivative instruments, assumes the underlying derivative price to have a lognormal distribution [15]. In economics, income data are often modeled by a lognormal distribution [16], which can be easily explained as follows: a very low percentage of earners have very low income. To gain averaged revenue is frequent, whereas an elevated income is rare. In actuarial sciences, the law is assumed to fit well some types of insurance losses [17, 18]. In 1931, the French economist and engineer Robert Pierre Louis Gibrat stated that the firm size follows a lognormal distribution as its proportional growth rate is independent of its absolute size. Other applications can be found in biology [19,

The Italian economist and engineer Vilfredo Pareto observed in 1896 that in many populations

approximation of the number of individuals with income exceeding a given threshold x<sup>0</sup> (see for instance [22, 23]). These power laws assume that small values of x are very frequent, while large occurrences are extremely rare. Their form implies that all power laws with a particular scaling exponent are equivalent up to constant factors since each is simply a scaled version of the others. This produces the linear relationship when logarithms are taken of both f(x) and x, which denotes the signature of power laws. Such distributions are known as Pareto-type distributions. The p.d.f. of a r.v. X having a Pareto (type I) distribution with parameters α > 0 and x0 > 0 is given by

<sup>f</sup>ðx; <sup>α</sup>, <sup>x</sup>0Þ ¼ <sup>α</sup>

in Figure 3. Note that a decreasing value of α implies a heavier tail.

x0

The location parameter x<sup>0</sup> represents the lower bound of the data set and the shape parameter α is called the tail index or as well the Pareto index, and hence regulates the tail as can be seen

x0 x � �<sup>α</sup>þ<sup>1</sup>

, for some constant c > 0 and some exponent α > 0, was an appropriate

<sup>ð</sup>yÞ ¼ <sup>e</sup><sup>μ</sup>þσΦ−1ðy<sup>Þ</sup>

ffiffiffiffiffiffi <sup>2</sup><sup>π</sup> <sup>p</sup> <sup>σ</sup> � �<sup>n</sup>

<sup>2</sup> log 2π−<sup>n</sup> log <sup>σ</sup><sup>−</sup> <sup>∑</sup>

e −∑ n i¼1 <sup>ð</sup>logxi−μÞ<sup>2</sup> <sup>2</sup>σ<sup>2</sup> ,

> n i¼1

ðlog xi − μÞ

, x ≥ x0: (5)

2

<sup>2</sup>σ<sup>2</sup> : (4)

<sup>n</sup> <sup>∑</sup><sup>n</sup>

<sup>i</sup>¼<sup>1</sup>log xi

, for 0 < y < 1 and where Φ-1(.)

(3)

$$F(\mathbf{x}; \alpha, \mathbf{x}\_0) = \mathbf{1} - \left(\frac{\mathbf{x}\_0}{\mathbf{x}}\right)^a, \mathbf{x} \ge \mathbf{x}\_0. \tag{6}$$

For <sup>α</sup> <sup>&</sup>gt; <sup>r</sup>, the <sup>r</sup>-th moment of the Pareto distribution is given by <sup>E</sup>ðX<sup>r</sup> Þ ¼ <sup>α</sup>xr 0 <sup>α</sup>−<sup>r</sup> : The mean and the variance are then, respectively, <sup>E</sup>ðXÞ ¼ <sup>α</sup>x<sup>0</sup> <sup>α</sup>−<sup>1</sup> for <sup>α</sup> > 1 and <sup>V</sup>ðXÞ ¼ <sup>α</sup>x<sup>2</sup> 0 ðα−1Þ 2 <sup>ð</sup>α−2<sup>Þ</sup> for <sup>α</sup> > 2. The quantile function is expressed as F<sup>−</sup><sup>1</sup> <sup>ð</sup>yÞ ¼ <sup>x</sup><sup>0</sup> ð1−yÞ 1 α , for 0 < y < 1. Being an unimodal law, the Pareto distribution reaches its peak at xmode = x0. As x<sup>0</sup> represents the minimum value of x, its estimation is straightforward: x<sup>0</sup> ̂ ¼ min <sup>i</sup>¼<sup>1</sup>, …, <sup>n</sup> xi. The likelihood function is given by

$$L(\mathfrak{x}\_1, \dots, \mathfrak{x}\_n | \alpha, \mathfrak{x}\_0) = \alpha^n \mathfrak{x}\_0^{n\alpha} \prod\_{i=1}^n \left(\frac{1}{\mathfrak{x}\_i}\right)^{\alpha+1}$$

and to estimate the parameter α, we maximize the following log-likelihood function

$$d(\mathbf{x}\_1, \dots, \mathbf{x}\_n | \alpha, \mathbf{x}\_0) = n \log \alpha + n\alpha \log \mathbf{x}\_0 - (\alpha + 1) \sum\_{i=1}^n \log \mathbf{x}\_i \tag{7}$$

which yields the maximum likelihood estimator <sup>α</sup>̂<sup>¼</sup> <sup>n</sup> ∑n <sup>i</sup>¼1logxi x0 ̂ . Let us note that the maximum

likelihood estimator of the tail index α corresponds to the popular Hill estimator [24], which is an estimator for the extreme value index in the extreme value theory. For a review on the Hill estimator, we refer the interested reader to reference [25]. Let us note that often the focus lies more on the power law probability distribution, which is a distribution whose density has approximately the form L(x)x-<sup>α</sup> , where α > 1 and L(x) is a slowly varying function. In many situations, it is convenient to assume a lower bound x<sup>0</sup> from which the law holds. Combining those two cases yields the Pareto-type distributions, or as well known in extreme value theory as distributions with regularly varying tails.

A generalization of the Pareto law is the so-called generalized Pareto distribution, and it regroups the Pareto type I, II, III, and IV distributions. The Pareto type IV contains the other types as special cases and hence as well other size distributions belonging to the different types as for instance the Lomax distributions [26]. This latter distribution belongs to the Pareto type II, and its p.d.f. is given by

$$f(\mathbf{x}; \alpha, k) = \frac{\alpha}{k} \left( 1 + \frac{\mathbf{x}}{k} \right)^{-(\alpha + 1)}, \mathbf{x} > 0,\tag{8}$$

with shape parameter α > 0 and scale parameter k > 0. It can be interpreted as a shifted Pareto type I distribution.

A generalization of the Pareto type I distribution is the Stoppa distribution [27], which comes from a power transformation of the Pareto c.d.f. and yields the following p.d.f.

$$f(\mathbf{x}; \alpha, \delta, \mathbf{x}\_0) = \delta \alpha \mathbf{x}\_0^a \mathbf{x}^{-(a+1)} \left( \mathbf{1} - \left(\frac{\mathbf{x}}{\mathbf{x}\_0}\right)^{-a} \right)^{\delta - 1}, \mathbf{x} > \mathbf{x}\_0,\tag{9}$$

with shape parameters α > 0, δ > 0 and location parameter x<sup>0</sup> > 0. If δ = 1, we get the Pareto type I distribution. However, if the shape parameter δ > 1, the Stoppa distribution presents a heavier tail than the classical Pareto law.

The Pareto distribution is often used to model fire losses in actuarial sciences [28, 29] as well as in reinsurance to approximate large losses. Originally, it was used to describe the income distribution and the allocation of wealth [22], but nowadays it is also used to model, for instance, areas burnt in forest fires or the file sizes of Internet traffic data [30]. Note that, in general, in empirical applications, the Pareto distribution does not fit for all the values but rather is used to fit their upper tail, i.e., large values. Hence, in order to fit a distribution to all the values, one often uses a composite model (see Section 3) which combines two distributions where one of both is the Pareto law.

#### 2.3. The generalized Lomax distribution

The generalized Lomax (GL) distribution, also known as the exponentiated Lomax distribution, was introduced by Abdul-Moniem and Abdel-Hameed in 2012 [31] by powering the c.d.f. of the Lomax distribution to a positive real number.

The p.d.f. of a r.v. X following a generalized Lomax distribution with parameters a > 0, b > 0, and k > 0 corresponds to

$$f(\mathbf{x}; a, b, k) = \frac{ab}{k} \left( 1 + \frac{\chi}{k} \right)^{-(a+1)} \left( 1 - \left( 1 + \frac{\chi}{k} \right)^{-a} \right)^{b-1}, \; \mathbf{x} > 0. \tag{10}$$

The shape parameter a regulates the heaviness of the tail, as can be seen in Figure 4 and the shape parameter b controls the skewness (see Figure 5). The parameter k is a scale parameter as depicted in Figure 6.

Figure 4. Density plots of the GL distribution with varying shape parameter a and fixed shape parameter b and scale parameter k.

Figure 5. Density plots of the GL distribution with varying shape parameter b and fixed shape parameter a and scale parameter k.

The c.d.f. is expressed as

A generalization of the Pareto law is the so-called generalized Pareto distribution, and it regroups the Pareto type I, II, III, and IV distributions. The Pareto type IV contains the other types as special cases and hence as well other size distributions belonging to the different types as for instance the Lomax distributions [26]. This latter distribution belongs to the Pareto type

<sup>k</sup> <sup>1</sup> <sup>þ</sup>

with shape parameter α > 0 and scale parameter k > 0. It can be interpreted as a shifted Pareto

A generalization of the Pareto type I distribution is the Stoppa distribution [27], which comes

<sup>0</sup> <sup>x</sup><sup>−</sup>ðαþ1<sup>Þ</sup> <sup>1</sup><sup>−</sup> <sup>x</sup>

with shape parameters α > 0, δ > 0 and location parameter x<sup>0</sup> > 0. If δ = 1, we get the Pareto type I distribution. However, if the shape parameter δ > 1, the Stoppa distribution presents a heavier

The Pareto distribution is often used to model fire losses in actuarial sciences [28, 29] as well as in reinsurance to approximate large losses. Originally, it was used to describe the income distribution and the allocation of wealth [22], but nowadays it is also used to model, for instance, areas burnt in forest fires or the file sizes of Internet traffic data [30]. Note that, in general, in empirical applications, the Pareto distribution does not fit for all the values but rather is used to fit their upper tail, i.e., large values. Hence, in order to fit a distribution to all the values, one often uses a composite model (see Section 3) which combines two distributions

The generalized Lomax (GL) distribution, also known as the exponentiated Lomax distribution, was introduced by Abdul-Moniem and Abdel-Hameed in 2012 [31] by powering the c.d.f.

The p.d.f. of a r.v. X following a generalized Lomax distribution with parameters a > 0, b > 0,

The shape parameter a regulates the heaviness of the tail, as can be seen in Figure 4 and the shape parameter b controls the skewness (see Figure 5). The parameter k is a scale parameter as

1− 1 þ

x k <sup>−</sup><sup>a</sup> <sup>b</sup>−<sup>1</sup>

x0 <sup>−</sup><sup>α</sup> <sup>δ</sup>−<sup>1</sup>

x k <sup>−</sup>ðαþ1<sup>Þ</sup>

, x > 0, (8)

, x > x0, (9)

, x > 0: (10)

<sup>f</sup>ðx; <sup>α</sup>, <sup>k</sup>Þ ¼ <sup>α</sup>

164 Advances in Statistical Methodologies and Their Application to Real Problems

from a power transformation of the Pareto c.d.f. and yields the following p.d.f.

<sup>f</sup>ðx; <sup>α</sup>, <sup>δ</sup>, <sup>x</sup>0Þ ¼ δαx<sup>α</sup>

II, and its p.d.f. is given by

tail than the classical Pareto law.

where one of both is the Pareto law.

and k > 0 corresponds to

depicted in Figure 6.

2.3. The generalized Lomax distribution

of the Lomax distribution to a positive real number.

<sup>f</sup>ðx; <sup>a</sup>, <sup>b</sup>, <sup>k</sup>Þ ¼ ab

<sup>k</sup> <sup>1</sup> <sup>þ</sup>

x k <sup>−</sup>ðaþ1<sup>Þ</sup>

type I distribution.

$$F(\mathbf{x}; a, b, k) = \left( 1 - \left( 1 + \frac{\mathbf{x}}{k} \right)^{-a} \right)^{b}, \mathbf{x} > 0. \tag{11}$$

The moments of order <sup>r</sup> are given by <sup>E</sup>ðX<sup>r</sup> Þ ¼ bkr ∑r i¼1 r i ð−1Þ i B 1− <sup>1</sup> <sup>a</sup> <sup>ð</sup><sup>r</sup> <sup>−</sup> <sup>i</sup>Þ, <sup>b</sup> , yielding <sup>E</sup>ðXÞ ¼ bkB <sup>1</sup><sup>−</sup> <sup>1</sup> <sup>a</sup> , <sup>b</sup> <sup>−</sup><sup>k</sup> for the mean and <sup>V</sup>ðXÞ ¼ bk<sup>2</sup> <sup>B</sup> <sup>1</sup><sup>−</sup> <sup>2</sup> <sup>a</sup> , <sup>b</sup> <sup>−</sup> b B <sup>1</sup><sup>−</sup> <sup>1</sup> <sup>a</sup> , <sup>b</sup> <sup>2</sup> for the

Figure 6. Density plots of the GL distribution with varying scale parameter k and fixed shape parameters a and b.

variance. The inverse c.d.f. is given by F<sup>−</sup><sup>1</sup> ðyÞ ¼ k 1 − y 1 b � �<sup>−</sup><sup>1</sup> a −1 � �, for 0 < <sup>y</sup> < 1, and the unique mode is reached at xmode <sup>¼</sup> <sup>k</sup> abþ<sup>1</sup> aþ1 � �<sup>1</sup> a −1 � �:

The likelihood function is given by

$$L(\mathbf{x}\_1, \dots, \mathbf{x}\_n | a, b, k) = \left(\frac{ab}{k}\right)^n \prod\_{i=1}^n \left(1 + \frac{\mathbf{x}\_i}{k}\right)^{-(a+1)} \prod\_{i=1}^n \left(1 - \left(1 + \frac{\mathbf{x}\_i}{k}\right)^{-a}\right)^{b-1} \tag{12}$$

and hence the following log-likelihood function is obtained

$$l(\mathbf{x}\_1, \dots, \mathbf{x}\_n | a, b, k) = n \log \frac{b}{k} - (a+1) \sum\_{i=1}^n \log \left( 1 + \frac{\mathbf{x}\_i}{k} \right) + (b-1) \sum\_{i=1}^n \log \left( 1 - \left( 1 + \frac{\mathbf{x}\_i}{k} \right)^{-a} \right). \tag{13}$$

The calculated score functions are expressed by

$$\frac{\partial l(\mathbf{x}\_1,\dots,\mathbf{x}\_n|a,b,k)}{\partial a} = \frac{n}{a} - \sum\_{i=1}^n \log(1 + \frac{\mathbf{x}\_i}{k}) - a(b-1) \sum\_{i=1}^n \frac{(1 + \frac{\mathbf{x}\_i}{k})^{-a} \log(1 + \frac{\mathbf{x}\_i}{k})}{1 - (1 + \frac{\mathbf{x}\_i}{k})^{-a}},$$

$$\frac{\partial l(\mathbf{x}\_1,\dots,\mathbf{x}\_n|a,b,k)}{\partial b} = \frac{n}{b} + \sum\_{i=1}^n \log\left(1 - \left(1 + \frac{\mathbf{x}\_i}{k}\right)^{-a}\right),\tag{14}$$

and

$$\frac{\partial l(\mathbf{x}\_1,\dots,\mathbf{x}\_n|a,b,k)}{\partial k} = -\frac{n}{k} + \frac{a+1}{k^2} \sum\_{i=1}^n \frac{\mathbf{x}\_i}{1 + \frac{\mathbf{x}\_i}{k}} - \frac{a(b-1)}{k^2} \sum\_{i=1}^n \frac{\left(1 + \frac{\mathbf{x}\_i}{k}\right)^{-(a+1)} \mathbf{x}\_i}{1 - \left(1 + \frac{\mathbf{x}\_i}{k}\right)^{-a}},\tag{15}$$

which have to be solved numerically by equating them to 0 in order to find the estimated parameters.

Figure 7. Density plots of the GEV distribution with varying location parameter μ.

The Lomax distribution is used to model income data, wealth allocation, and actuarial claim sizes [10]. The GL distribution is used to measure the breaking stress of carbon fibers [32], the survival times of patients getting chemotherapy treatment [33], and the number of successive failure of the air-conditioning system in airplanes [33].

#### 2.4. The generalized extreme value distribution

The generalized extreme value (GEV) distribution is well known in extreme value theory as it combines the Gumbel, Fréchet, and Weibull distributions, which are also known as type I, II, and III extreme value distributions. The GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables and this result arises from the central limit theorem of Fisher and Tippett [34]. Therefore, the GEV distribution is also known as Fisher-Tippett distribution in extreme value theory.

The GEV distribution has p.d.f.

$$f(\mathbf{x}; \mu, \sigma, k) = \frac{1}{\sigma} \left( 1 + k \left( \frac{\mathbf{x} - \mu}{\sigma} \right) \right)^{-1 \frac{1}{k}} e^{-\left( 1 + k \left( \frac{\mathbf{x} - \mu}{\sigma} \right) \right)^{-\frac{1}{k}}},\tag{16}$$

if 1 <sup>þ</sup> <sup>k</sup> <sup>x</sup>−<sup>μ</sup> <sup>σ</sup> > 0, with location parameter μ ∈ R (see Figure 7), scale parameter σ > 0 (see Figure 8), and shape parameter k ∈ R, which governs the shape and the heaviness of the tail of the distribution, as can be seen in Figure 9.

Its c.d.f. is given by

variance. The inverse c.d.f. is given by F<sup>−</sup><sup>1</sup>

aþ1 � �<sup>1</sup> a −1 � �

<sup>L</sup>ðx1, …, xnja, <sup>b</sup>, <sup>k</sup>Þ ¼ ab

166 Advances in Statistical Methodologies and Their Application to Real Problems

and hence the following log-likelihood function is obtained

k

a − ∑n i¼1

∂lðx1, …, xnja, b, kÞ

:

Figure 6. Density plots of the GL distribution with varying scale parameter k and fixed shape parameters a and b.

!n ∏ n i¼1

> n i¼1 log � 1 þ xi k �

logð1 þ

a þ 1 k <sup>2</sup> ∑ n i¼1

<sup>∂</sup><sup>b</sup> <sup>¼</sup> <sup>n</sup>

n k þ xi k

<sup>b</sup> <sup>þ</sup> <sup>∑</sup> n i¼1

which have to be solved numerically by equating them to 0 in order to find the estimated

xi <sup>1</sup> <sup>þ</sup> xi k − aðb−1Þ k <sup>2</sup> ∑ n i¼1

Þ − aðb − 1Þ ∑

log 1− 1 þ

k

−ða þ 1Þ ∑

mode is reached at xmode <sup>¼</sup> <sup>k</sup> abþ<sup>1</sup>

The likelihood function is given by

<sup>l</sup>ðx1, …, xnja, <sup>b</sup>, <sup>k</sup>Þ ¼ <sup>n</sup> log <sup>b</sup>

The calculated score functions are expressed by

∂lðx1, …, xnja, b, kÞ

<sup>∂</sup><sup>k</sup> <sup>¼</sup> <sup>−</sup>

<sup>∂</sup><sup>a</sup> <sup>¼</sup> <sup>n</sup>

∂lðx1, …, xnja, b, kÞ

and

parameters.

ðyÞ ¼ k 1 − y

1 þ xi k � �<sup>−</sup>ðaþ1<sup>Þ</sup>

1 b � �<sup>−</sup><sup>1</sup> a −1

> ∏ n i¼1

þ ðb−1Þ ∑ n i¼1

> n i¼1

<sup>ð</sup><sup>1</sup> <sup>þ</sup> xi k Þ

xi k � � � �<sup>−</sup><sup>a</sup>

> <sup>ð</sup><sup>1</sup> <sup>þ</sup> xi k Þ <sup>−</sup>ðaþ1<sup>Þ</sup> xi

<sup>1</sup>−ð<sup>1</sup> <sup>þ</sup> xi k Þ

1 − 1 þ

xi k � � � �<sup>−</sup><sup>a</sup> <sup>b</sup>−<sup>1</sup>

log 1− 1 þ

<sup>1</sup>−ð<sup>1</sup> <sup>þ</sup> xi k Þ

<sup>−</sup><sup>a</sup> logð<sup>1</sup> <sup>þ</sup> xi

, for 0 < y < 1, and the unique

xi k � � � �<sup>−</sup><sup>a</sup>

k Þ

, (14)

<sup>−</sup><sup>a</sup> , (15)

<sup>−</sup><sup>a</sup> ,

(12)

: (13)

� �

$$F(\mathbf{x}; \mu, \sigma, k) = e^{-\left(1 + k\left(\frac{x - \mu}{\sigma}\right)\right)^{\frac{1}{k}}},\tag{17}$$

if 1 <sup>þ</sup> <sup>k</sup> <sup>x</sup>−<sup>μ</sup> <sup>σ</sup> <sup>&</sup>gt; 0. The mean is defined as <sup>E</sup>ðXÞ ¼ <sup>μ</sup><sup>−</sup> <sup>σ</sup> <sup>k</sup> <sup>þ</sup> <sup>σ</sup> <sup>k</sup> Γð1 − kÞ and the variance as <sup>V</sup>ðXÞ ¼ <sup>σ</sup><sup>2</sup> k 2 Γð1−2kÞ − Γð1−kÞ 2 , both expressed in terms of the Gamma function. The quantile

Figure 8. Density plots of the GEV distribution with varying scale parameter σ.

Figure 9. Density plots of the GEV distribution with varying shape parameter k.

function is given by F<sup>−</sup><sup>1</sup> ðyÞ ¼ μ þ kσ ð−logyÞ −k −1 , for 0 < y < 1 and the unique mode is reached at xmode <sup>¼</sup> <sup>μ</sup> <sup>þ</sup> <sup>σ</sup> k ð1 þ kÞ −k −1 : Depending on the value of the parameter k, the GEV reduces to one of the following special cases: If k = 0, we obtain the Gumbel distribution, if k > 0, we get the Fréchet distribution, and if k < 0, the Weibull distribution is obtained. The parameters of the GEV distributions are estimated using the maximum likelihood approach.

We will now focus more closely on one of the three GEV distributions, namely, the Weibull distribution, which got its name from the Swedish engineer and scientist Waloddi Weibull, who analyzed it in detail in 1951. We take a look at this law as it is often used for size-type data and it is considered as an alternative to the lognormal distribution for the construction of composite models (see Section 3). The Weibull distribution belongs to power laws with an exponential cut-off; this means it is a power law multiplied by an exponential function. In these distributions, the exponential decay term overpowers the power law behavior for very large values.

The p.d.f. of the Weibull distribution is given by

$$f(\mathbf{x}; \sigma, \tau) = \frac{\tau}{\sigma} \left(\frac{\mathbf{x}}{\sigma}\right)^{\tau - 1} e^{-\left(\frac{\mathbf{x}}{\sigma}\right)^{\tau}}, \ x \ge 0,\tag{18}$$

with shape parameter τ > 0 governing the heaviness of the tail and scale parameter σ > 0. The distribution has c.d.f.

$$F(\mathbf{x}; \sigma, \tau) = 1 - e^{-\left(\frac{\mathbf{r}}{\sigma}\right)^{\tau}}, \ \mathbf{x} \ge 0. \tag{19}$$

The quantile function is given by F<sup>−</sup><sup>1</sup> <sup>ð</sup>yÞ ¼ <sup>σ</sup>ð−logð<sup>1</sup> <sup>−</sup> <sup>y</sup>ÞÞ<sup>1</sup> τ , for 0 < y < 1. The r-th moment is given by <sup>E</sup>ðX<sup>r</sup> Þ ¼ <sup>σ</sup><sup>r</sup> <sup>Γ</sup> <sup>1</sup> <sup>þ</sup> <sup>r</sup> τ : Hence, the expectation and the variance are expressed as <sup>E</sup>ðXÞ ¼ σΓ <sup>1</sup> <sup>þ</sup> <sup>1</sup> τ and <sup>V</sup>ðXÞ ¼ <sup>σ</sup><sup>2</sup> <sup>Γ</sup> <sup>1</sup> <sup>þ</sup> <sup>2</sup> τ <sup>−</sup> <sup>Γ</sup> <sup>1</sup> <sup>þ</sup> <sup>1</sup> τ <sup>2</sup> , respectively. The Weibull distribution is unimodal and for <sup>τ</sup> > 1 it reaches the mode at xmode <sup>¼</sup> <sup>σ</sup> <sup>τ</sup>−<sup>1</sup> τ <sup>1</sup> <sup>α</sup> and for τ = 1 the mode is reached at 0.

The parameters of the Weibull distribution are estimated via the maximum likelihood method. The corresponding likelihood and log-likelihood functions are given respectively by

$$L(\mathbf{x}\_1, \dots, \mathbf{x}\_n | \sigma, \tau) = \frac{\tau^n}{\sigma^n} \prod\_{i=1}^n \left(\frac{\mathbf{x}\_i}{\sigma}\right)^{\mathbf{r} - 1} e^{-\sum\_{i=1}^n \left(\frac{\mathbf{x}\_i}{\sigma}\right)^{\mathbf{r}}},\tag{20}$$

and

function is given by F<sup>−</sup><sup>1</sup>

reached at xmode <sup>¼</sup> <sup>μ</sup> <sup>þ</sup> <sup>σ</sup>

values.

ðyÞ ¼ μ þ kσ

Figure 9. Density plots of the GEV distribution with varying shape parameter k.

Figure 8. Density plots of the GEV distribution with varying scale parameter σ.

168 Advances in Statistical Methodologies and Their Application to Real Problems

k ð1 þ kÞ −k −1 

 ð−logyÞ −k −1 

of the GEV distributions are estimated using the maximum likelihood approach.

reduces to one of the following special cases: If k = 0, we obtain the Gumbel distribution, if k > 0, we get the Fréchet distribution, and if k < 0, the Weibull distribution is obtained. The parameters

We will now focus more closely on one of the three GEV distributions, namely, the Weibull distribution, which got its name from the Swedish engineer and scientist Waloddi Weibull, who analyzed it in detail in 1951. We take a look at this law as it is often used for size-type data and it is considered as an alternative to the lognormal distribution for the construction of composite models (see Section 3). The Weibull distribution belongs to power laws with an exponential cut-off; this means it is a power law multiplied by an exponential function. In these distributions, the exponential decay term overpowers the power law behavior for very large

, for 0 < y < 1 and the unique mode is

: Depending on the value of the parameter k, the GEV

$$l(\mathbf{x}\_1, \dots, \mathbf{x}\_n | \sigma, \tau) = n \log \tau - n \tau \log \sigma + (\tau - 1) \sum\_{i=1}^n \log(\mathbf{x}\_i) - \sum\_{i=1}^n \left(\frac{\mathbf{x}\_i}{\sigma}\right)^{\tau}. \tag{21}$$

The maximum likelihood estimator for the scale parameter σ, given τ, is σ̂ <sup>τ</sup> <sup>¼</sup> <sup>1</sup> <sup>n</sup> <sup>∑</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup>x<sup>τ</sup> <sup>i</sup> and the maximum likelihood estimator for the shape parameter τ is given by an implicit function which has to be solved numerically: τ̂ <sup>−</sup><sup>1</sup> <sup>¼</sup> <sup>∑</sup><sup>n</sup> <sup>i</sup>¼1x<sup>τ</sup> <sup>i</sup> logðxiÞ ∑n <sup>i</sup>¼1x<sup>τ</sup> i − 1 <sup>n</sup> <sup>∑</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup>logðxiÞ:

In risk management, finance, and insurance, the risk measure "Value at Risk" is assessed by considering the GEV distribution [35, 36]. They are as well used in hydrology [37, 38], telecommunications [39], and meteorology [40]. In material sciences, the Weibull is widely used thanks to its flexibility [41]. Other examples include wind speed distributions [42], forecasting technological change [43], the size of reinsurance claims [10], hydrology [9], and areas burnt in forest fires [44].

#### 3. Composite models

Given the wealth of distinct size distributions, as can be seen from the small sample of size distributions described in the previous section, the practitioner is often confronted to the following question: Which size distribution shall he/she use in which situation? The variation in the shapes, size distributions can take, for instance between the Pareto distribution and the lognormal distribution, renders the choice very complicated in practice.

For example, insurance companies face sometimes losses, which emerge from a combination of moderate and large claims. In order to model those large losses, the Pareto distribution seems to be the size distribution favored by practitioners. However, when losses consist of smaller values with high frequencies and larger losses with low frequencies, the lognormal or the Weibull distributions are preferred [45]. Nevertheless, no classical size distribution provides an acceptable fit for both small and large losses. On one hand, the Pareto fits well the tail, but on the other hand, lognormal and Weibull distributions produce an overall good fit, but fit badly the tail.

A solution to this dilemma comes from the composite parametric models introduced in 2005 by Cooray and Ananda [46]. The idea of the composite models is to join together two weighted distributions at a given threshold value. In statistical terms, let X be a r.v. and denote by f1(.) the p.d.f. of the first distribution and by f2(.) the p.d.f. of the second distribution. Let F1(.) and F2(.) be the corresponding c.d.f., respectively. Scollnik [47] noticed that the p.d.f. of a composite model can then be expressed as

$$f(\mathbf{x}) = \begin{cases} c f\_1^\*(\mathbf{x}), & -\infty < \mathbf{x} \le \theta \\ (1 - c) f\_2^\*(\mathbf{x}), \theta < \mathbf{x} < \infty, \end{cases} \tag{22}$$

where c is a normalization constant in [0,1], θ represents the threshold value, f � <sup>1</sup>ðxÞ ¼ <sup>f</sup> <sup>1</sup>ðx<sup>Þ</sup> <sup>F</sup>1ðθ<sup>Þ</sup> for -∞ < x ≤ θ, and f � <sup>2</sup>ðxÞ ¼ <sup>f</sup> <sup>2</sup>ðx<sup>Þ</sup> <sup>1</sup>−F2ðθ<sup>Þ</sup> for <sup>θ</sup> <sup>&</sup>lt; <sup>x</sup> <sup>&</sup>lt; <sup>∞</sup>. In our setting, the considered composite models piece together two different size distributions with different shapes and tail-weights at a specific threshold. As size distributions are only for positive values, the p.d.f. of a composite model is rewritten as

$$f(\mathbf{x}) = \begin{cases} \ c f\_1^\*(\mathbf{x}), & 0 < \mathbf{x} \le \theta \\ (1 - \mathbf{c}) f\_2^\*(\mathbf{x}), & \theta < \mathbf{x} < \mathbf{c} \end{cases} \tag{23}$$

where 0 ≤ c ≤ 1. The composite model can as well be interpreted as a two-component mixture model with mixing weights c and (1 − c). Hence, it can be seen as a convex sum of two density functions f(x) = c f<sup>1</sup> \* (x) + (1 − c) f<sup>2</sup> \* (x), as noted in [47].

As we have a threshold that cuts the composite model distribution into two, from a mathematical point of view, we need continuity and differentiability conditions at the threshold to yield a smooth density function. In order to make f (x) continuous, the following condition f (θ- ) = f (θ<sup>+</sup> ) is imposed and yields

$$\mathcal{L} = \frac{f\_2(\theta) F\_1(\theta)}{f\_2(\theta) F\_1(\theta) + f\_1(\theta) \Big(1 - F\_2(\theta)\Big)}.\tag{24}$$

The differential condition at the threshold value is given by f ' (θ- ) = f ' (θ<sup>+</sup> )and yields

Distributions and Composite Models for Size-Type Data http://dx.doi.org/10.5772/66443 171

$$\mathcal{L} = \frac{\int\_2^\prime (\theta) F\_1(\theta)}{\int\_2^\prime (\theta) F\_1(\theta) + \int\_1^\prime (\theta) \left(1 - F\_2(\theta)\right)}. \tag{25}$$

If we combine the two results for the normalization constant c, we obtain the additional restriction for θ, i.e., <sup>f</sup> <sup>1</sup>ðθ<sup>Þ</sup> <sup>f</sup> <sup>2</sup>ðθ<sup>Þ</sup> <sup>¼</sup> <sup>f</sup> 0 <sup>1</sup>ðθÞ f 0 <sup>2</sup>ðθÞ . Let us remark that in reference [48] they use a mode-matching procedure instead and state that it gives much simpler derivation of the model and allows for an easier implementation with any distribution which has a mode that has a closed form expression. Instead of having as threshold value θ, they use the modal value xm. Denote by xm<sup>1</sup> and xm<sup>2</sup> the modes of the distributions used by the first and second components of the composite model, then the mode-matching conditions are xm<sup>1</sup> = xm<sup>2</sup> and f \* (xm1) = f \* (xm2). The latter implies the continuity condition, and the former equality allows dropping the labels 1 and 2, which yields the following condition

$$\mathcal{L} = \frac{f\_2(\mathbf{x}\_m) F\_1(\mathbf{x}\_m)}{f\_2(\mathbf{x}\_m) F\_1(\mathbf{x}\_m) + f\_1(\mathbf{x}\_m) \Big(1 - F\_2(\mathbf{x}\_m)\Big)}. \tag{26}$$

Remark that the derivative at the mode is 0, hence the differentiability condition is satisfied.

The c.d.f. of a composite model of size distributions is given by

in the shapes, size distributions can take, for instance between the Pareto distribution and the

For example, insurance companies face sometimes losses, which emerge from a combination of moderate and large claims. In order to model those large losses, the Pareto distribution seems to be the size distribution favored by practitioners. However, when losses consist of smaller values with high frequencies and larger losses with low frequencies, the lognormal or the Weibull distributions are preferred [45]. Nevertheless, no classical size distribution provides an acceptable fit for both small and large losses. On one hand, the Pareto fits well the tail, but on the other hand, lognormal and Weibull distributions produce an overall good fit, but fit

A solution to this dilemma comes from the composite parametric models introduced in 2005 by Cooray and Ananda [46]. The idea of the composite models is to join together two weighted distributions at a given threshold value. In statistical terms, let X be a r.v. and denote by f1(.) the p.d.f. of the first distribution and by f2(.) the p.d.f. of the second distribution. Let F1(.) and F2(.) be the corresponding c.d.f., respectively. Scollnik [47] noticed that the p.d.f. of a composite

<sup>1</sup>ðxÞ, −∞ < x ≤ θ

<sup>1</sup>ðxÞ, 0 < x ≤ θ

<sup>2</sup>ðxÞ, θ < x < ∞,

 1−F2ðθÞ

> (θ- ) = f ' (θ<sup>+</sup>

<sup>2</sup>ðxÞ, θ < x < ∞,

<sup>1</sup>−F2ðθ<sup>Þ</sup> for <sup>θ</sup> <sup>&</sup>lt; <sup>x</sup> <sup>&</sup>lt; <sup>∞</sup>. In our setting, the considered composite models

(22)

<sup>F</sup>1ðθ<sup>Þ</sup> for

(23)

) = f

�

: (24)

)and yields

<sup>1</sup>ðxÞ ¼ <sup>f</sup> <sup>1</sup>ðx<sup>Þ</sup>

lognormal distribution, renders the choice very complicated in practice.

170 Advances in Statistical Methodologies and Their Application to Real Problems

<sup>f</sup>ðxÞ ¼ c f �

<sup>f</sup>ðxÞ ¼ c f �

ð1−cÞf �

(x), as noted in [47].

<sup>c</sup> <sup>¼</sup> <sup>f</sup> <sup>2</sup>ðθÞF1ðθ<sup>Þ</sup> f <sup>2</sup>ðθÞF1ðθÞ þ f <sup>1</sup>ðθÞ

where c is a normalization constant in [0,1], θ represents the threshold value, f

ð1 − cÞf �

piece together two different size distributions with different shapes and tail-weights at a specific threshold. As size distributions are only for positive values, the p.d.f. of a composite

where 0 ≤ c ≤ 1. The composite model can as well be interpreted as a two-component mixture model with mixing weights c and (1 − c). Hence, it can be seen as a convex sum of two density

As we have a threshold that cuts the composite model distribution into two, from a mathematical point of view, we need continuity and differentiability conditions at the threshold to yield a smooth density function. In order to make f (x) continuous, the following condition f (θ-

badly the tail.


model is rewritten as

functions f(x) = c f<sup>1</sup>

(θ<sup>+</sup>

model can then be expressed as

�

\*

) is imposed and yields

(x) + (1 − c) f<sup>2</sup>

\*

The differential condition at the threshold value is given by f '

<sup>2</sup>ðxÞ ¼ <sup>f</sup> <sup>2</sup>ðx<sup>Þ</sup>

$$F(\mathbf{x}) = \begin{cases} c \frac{F\_1(\mathbf{x})}{F\_1(\theta)}, & 0 < \mathbf{x} \le \theta \\ c + (1 - c) \frac{F\_2(\mathbf{x}) - F\_2(\theta)}{1 - F\_2(\theta)}, & \theta < \mathbf{x} < \mathbf{a}. \end{cases} \tag{27}$$

The moments of the r-th order can be expressed using this formula ErðfÞ ¼ cErðf � 1Þ þð1 − cÞErðf � 2Þ:

Statistical inference for composite models is done using the classical maximum likelihood (ML) estimation approach. The ML estimation for composite models was first presented in [46] and as well in [49]. In order to apply the ML approach, we have to know the integer value m such that the unknown threshold parameter θ is in between the m-th and m + 1-th observation. If we assume somehow that we know the value of the integer m, we would be able to write out explicitly the likelihood function. However, unfortunately, we do not know the exact value of m and as m changes, the ML estimation changes. Therefore, the following ML estimation algorithm was proposed where we have s parameters ρ<sup>i</sup> for i = 1, … , s. In a first step, for each integer m = 1, … , n - 1 we estimate the parameters as solution of the following ML system

$$\begin{cases} \frac{\partial \log L}{\partial \rho\_i} = 0, \ i = 1, \ldots, s, \\\\ \frac{\partial \log L}{\partial \theta} = 0. \end{cases} \tag{28}$$

If the inequality xm <sup>≤</sup> <sup>θ</sup>^ <sup>≤</sup> xm+1 holds, then the ML estimators can be denoted as <sup>θ</sup>^ and <sup>ρ</sup>^<sup>i</sup> for <sup>i</sup> = 1, … , s. However, a second step is needed in case the first step does not provide any satisfying result meaning that we are either in one of the following two settings m = n or m = 0. This implies that the use of f<sup>1</sup> and f<sup>2</sup> are recommended for the likelihood function, respectively. For the ML procedure, one needs to check n − 1 intervals. Thus, the computing time strongly depends on the magnitude of n. For large n this leads to a complex system of equations that must be solved numerically.

In reference [50], the authors propose an alternative algorithm based on quantiles and a moment matching approach. In a first step, let us denote by q<sup>1</sup> and q<sup>3</sup> the first and third empirical quartiles of the data sample. We assume that q<sup>1</sup> ≤ θ ≤ q3. Then we use the method of moments to match the first s − 1 empirical moments with their theoretical counterparts, and we add two more equations from matching two quartiles

$$\begin{cases} c \frac{F\_1(q\_1)}{F\_1(\theta)} = 0.25, \\\\ c + (1 - c) \frac{F\_2(q\_3) - F\_2(\theta)}{1 - F\_2(\theta)} = 0.75. \end{cases} \tag{29}$$

If no result is obtained, we move to a second step where we assume that the first and third quartiles are smaller than the threshold θ, and proceed like in the first step except using now the following two quartiles' equations

$$\begin{cases} c \frac{F\_1(q\_1)}{F\_1(\theta)} = 0.25, \\\\ c \frac{F\_1(q\_3)}{F\_1(\theta)} = 0.75. \end{cases} \tag{30}$$

If we still have no solution, we finally assume that the first and third quartiles are greater than θ and proceed again in a similar fashion as in the first step with the two equations

$$\begin{cases} c + (1 - c) \frac{F\_2(q\_1) - F\_2(\theta)}{1 - F\_2(\theta)} = 0.25, \\\\ c + (1 - c) \frac{F\_2(q\_3) - F\_2(\theta)}{1 - F\_2(\theta)} = 0.75. \end{cases} \tag{31}$$

Let us remark that those equations have to be solved numerically. Note that once we have a solution from this quantile and moment matching procedure, we can use the ML approach explained above to improve the result as now we have some a priori information on the parameter θ and hence on the integer m.

In general, in the area of size distributions, composite models comprise a lognormal or Weibull distribution up to a given threshold value and some form of the Pareto distribution thereafter. The obtained models are close in shape to the lognormal or Weibull law but with a thicker tail due to the Pareto distribution, see Figure 10 and Figure 11.

Figure 10. Density plot of the composite lognormal-Pareto model with θ = 0.55 and α = 0.5.

result meaning that we are either in one of the following two settings m = n or m = 0. This implies that the use of f<sup>1</sup> and f<sup>2</sup> are recommended for the likelihood function, respectively. For the ML procedure, one needs to check n − 1 intervals. Thus, the computing time strongly depends on the magnitude of n. For large n this leads to a complex system of equations that

In reference [50], the authors propose an alternative algorithm based on quantiles and a moment matching approach. In a first step, let us denote by q<sup>1</sup> and q<sup>3</sup> the first and third empirical quartiles of the data sample. We assume that q<sup>1</sup> ≤ θ ≤ q3. Then we use the method of moments to match the first s − 1 empirical moments with their theoretical counterparts, and we

F2ðq3Þ−F2ðθÞ

If no result is obtained, we move to a second step where we assume that the first and third quartiles are smaller than the threshold θ, and proceed like in the first step except using now

If we still have no solution, we finally assume that the first and third quartiles are greater than

F2ðq1Þ − F2ðθÞ

F2ðq3Þ − F2ðθÞ

Let us remark that those equations have to be solved numerically. Note that once we have a solution from this quantile and moment matching procedure, we can use the ML approach explained above to improve the result as now we have some a priori information on the

In general, in the area of size distributions, composite models comprise a lognormal or Weibull distribution up to a given threshold value and some form of the Pareto distribution thereafter. The obtained models are close in shape to the lognormal or Weibull law but with a thicker tail

<sup>1</sup> <sup>−</sup> <sup>F</sup>2ðθ<sup>Þ</sup> <sup>¼</sup> <sup>0</sup>:25,

<sup>1</sup> <sup>−</sup> <sup>F</sup>2ðθ<sup>Þ</sup> <sup>¼</sup> <sup>0</sup>:75:

<sup>1</sup> <sup>−</sup> <sup>F</sup>2ðθ<sup>Þ</sup> <sup>¼</sup> <sup>0</sup>:75:

(29)

(30)

(31)

must be solved numerically.

add two more equations from matching two quartiles

172 Advances in Statistical Methodologies and Their Application to Real Problems

the following two quartiles' equations

parameter θ and hence on the integer m.

c F1ðq1Þ <sup>F</sup>1ðθ<sup>Þ</sup> <sup>¼</sup> <sup>0</sup>:25,

8 >>><

>>>:

c þ ð1 − cÞ

c F1ðq1Þ <sup>F</sup>1ðθ<sup>Þ</sup> <sup>¼</sup> <sup>0</sup>:25,

8 >>>><

> c F1ðq3Þ <sup>F</sup>1ðθ<sup>Þ</sup> <sup>¼</sup> <sup>0</sup>:75:

θ and proceed again in a similar fashion as in the first step with the two equations

c þ ð1 − cÞ

8 >>><

>>>:

due to the Pareto distribution, see Figure 10 and Figure 11.

c þ ð1 − cÞ

>>>>:

Figure 11. Density plot of the composite Weibull-Pareto model with θ = 0.55 and τ = 1.42867.

This research area for size distributions was initiated by Cooray and Ananda in 2005 [46], who proposed the composite lognormal-Pareto model. They suggested that this composite model may be better suited for insurers when confronted to smaller losses with high frequencies as well as for larger values with lower frequencies. The lognormal-Pareto composite model introduced in reference [46] has been further enhanced by Scollnik [47]. In that paper, the author noticed that the two-component composite model is very restrictive since it has fixed and a priori known mixing weights. Hence, he improved the model by using unrestricted mixing weights as coefficients in each component. In a similar way, the article [51] improves the composite Weibull-Pareto model proposed by reference [52]. Those are the composite models that will be described in more detail in the sequel. The papers [47] and [51] consider beside the classical Pareto distribution as well the Pareto type II distribution, known also as the Lomax distribution, as an alternative above the threshold value. In 2013, Teodorescu and Vernic [50] replace the lognormal distribution by any arbitrary continuous distribution, and they analyze in detail the composite Weibull-Pareto and the composite Gamma-Pareto models, and use as well the Lomax distribution as an alternative to the Pareto distribution above the threshold point. The same authors suggested already the composite exponential-Pareto model [50]. More recently, reference [48] proposes a composite model based on the Stoppa distribution [27], which is a generalization of the Pareto law. More precisely, they propose the lognormal-Stoppa and Weibull-Stoppa composite models.

Let us now take a closer look at the composite lognormal-Pareto and Weibull-Pareto models. Given the general formulas above we can write the density for the composite lognormal-Pareto as

$$f(\mathbf{x}) = \begin{cases} \frac{1}{\mathcal{L}} e^{-\frac{(\log \mathcal{X} - \mu)^2}{2\sigma^2}} \\ c \frac{\overline{x\sqrt{2}\pi\sigma}}{\sigma\left(\frac{\log\theta - \mu}{\sigma}\right)}, \ 0 < \mathbf{x} \le \theta \\\\ (1 - c)\frac{\alpha}{\theta} \left(\frac{\theta}{\mathbf{x}}\right)^{\alpha + 1}, \ \theta < \mathbf{x} < \infty, \end{cases} \tag{32}$$

with 0 ≤ c ≤ 1 and Φ(.) denoting the c.d.f. of a standard normal distribution. In a similar way, the p.d.f. for the composite Weibull-Pareto can be written as

$$f(\mathbf{x}) = \begin{cases} c \frac{\pi}{\sigma} \left(\frac{\mathbf{x}}{\sigma}\right)^{\varepsilon - 1} e^{-\left(\frac{\mathbf{x}}{\sigma}\right)^{\varepsilon}} & 0 < \mathbf{x} \le \theta \\\\ 1 - e^{-\left(\frac{\theta}{\sigma}\right)^{\alpha}} & \end{cases}, \quad 0 < \mathbf{x} \le \theta \\\tag{33}$$

$$(1 - c)\frac{\alpha}{\theta} \left(\frac{\theta}{\pi}\right)^{\alpha + 1}, \quad \theta < \mathbf{x} < \infty.$$

with 0 ≤ c ≤ 1.

By verifying the continuity and differentiability conditions at the threshold point θ, we obtain for the composite lognormal-Pareto model:

$$\mathcal{L} = \frac{\frac{\alpha}{\theta} \left(\frac{\theta}{x}\right)^{\alpha+1} \mathcal{O}\left(\frac{\log\theta - \mu}{\sigma}\right)}{\frac{\alpha}{\theta} \left(\frac{\theta}{x}\right)^{\alpha+1} \mathcal{O}\left(\frac{\log\theta - \mu}{\sigma}\right) + \frac{1}{\theta\sqrt{2\pi\sigma}} \mathcal{e}^{-\frac{(\log\theta - \mu)^2}{2\sigma^2}}}\tag{34}$$

and

$$
\alpha \sigma = \frac{\log \theta - \mu}{\sigma}.\tag{35}
$$

These conditions guarantee that the p.d.f. of the composite lognormal-Pareto is continuous and smooth at the threshold value θ. The continuity and differentiability conditions at θ, for the composite Weibull-Pareto, yield:

#### Distributions and Composite Models for Size-Type Data http://dx.doi.org/10.5772/66443 175

$$\mathcal{L} = \frac{\frac{a}{\theta} \left(\frac{\theta}{x}\right)^{a+1} \left(1 - e^{-\left(\frac{\theta}{x}\right)^{\tau}}\right)}{\frac{a}{\theta} \left(\frac{\theta}{x}\right)^{a+1} \left(1 - e^{-\left(\frac{\theta}{x}\right)^{\tau}}\right) + \frac{\tau}{\sigma} \left(\frac{\theta}{\sigma}\right)^{\tau-1} e^{-\left(\frac{\theta}{\sigma}\right)^{\tau}}} \tag{36}$$

and

threshold point. The same authors suggested already the composite exponential-Pareto model [50]. More recently, reference [48] proposes a composite model based on the Stoppa distribution [27], which is a generalization of the Pareto law. More precisely, they propose the lognor-

Let us now take a closer look at the composite lognormal-Pareto and Weibull-Pareto models. Given the general formulas above we can write the density for the composite lognormal-

> � logθ − μ σ

with 0 ≤ c ≤ 1 and Φ(.) denoting the c.d.f. of a standard normal distribution. In a similar way,

<sup>−</sup> <sup>θ</sup> σ

By verifying the continuity and differentiability conditions at the threshold point θ, we obtain

ασ <sup>¼</sup> log<sup>θ</sup> <sup>−</sup> <sup>μ</sup>

These conditions guarantee that the p.d.f. of the composite lognormal-Pareto is continuous and smooth at the threshold value θ. The continuity and differentiability conditions at θ, for the

ðlog x − μÞ

2σ<sup>2</sup>

�<sup>α</sup>þ<sup>1</sup>

2

� , <sup>0</sup> <sup>&</sup>lt; <sup>x</sup> <sup>≤</sup> <sup>θ</sup>

(32)

(33)

(34)

, θ < x < ∞,

� �<sup>α</sup> , <sup>0</sup> <sup>&</sup>lt; <sup>x</sup> <sup>≤</sup> <sup>θ</sup>

, θ < x < ∞:

<sup>−</sup> <sup>ð</sup>logθ−μÞ<sup>2</sup> 2σ2

<sup>σ</sup> : (35)

mal-Stoppa and Weibull-Stoppa composite models.

174 Advances in Statistical Methodologies and Their Application to Real Problems

fðxÞ ¼

the p.d.f. for the composite Weibull-Pareto can be written as

fðxÞ ¼

c ¼

α θ ðθ xÞ αþ1 Φ � logθ−<sup>μ</sup> σ � <sup>þ</sup> <sup>1</sup> <sup>θ</sup> ffiffiffiffi <sup>2</sup><sup>π</sup> <sup>p</sup> <sup>σ</sup> e

for the composite lognormal-Pareto model:

composite Weibull-Pareto, yield:

c

8

>>>>>>>>>><

>>>>>>>>>>:

1 x ffiffiffi 2 <sup>p</sup> πσ e −

Φ

ð1 − cÞ α θ � θ x

c τ σ x σ � �<sup>τ</sup>−<sup>1</sup> e − x σ � �<sup>τ</sup>

8

>>>>>>>>><

>>>>>>>>>:

ð1−cÞ α θ θ x � �<sup>α</sup>þ<sup>1</sup>

> α θ � <sup>θ</sup> x �<sup>α</sup>þ<sup>1</sup> Φ � logθ−<sup>μ</sup> σ �

1 − e

Pareto as

with 0 ≤ c ≤ 1.

and

$$\left(\frac{\theta}{\sigma}\right)^{\tau} = \frac{\alpha}{\tau} + 1.\tag{37}$$

These conditions guarantee the continuity and smoothness of the p.d.f. of the composite Weibull-Pareto at the threshold point θ.

The c.d.f. of the composite lognormal-Pareto and Weibull-Pareto is given, respectively, by

$$F(\mathbf{x}) = \begin{cases} c \frac{\Phi\left(\frac{\log\|\mathbf{x} - \boldsymbol{\mu}\|}{\sigma}\right)}{\Phi\left(\frac{\log\Theta - \boldsymbol{\mu}}{\sigma}\right)}, & 0 < \mathbf{x} \le \boldsymbol{\Theta} \\ c + (1 - c) \left(1 - \left(\frac{\Theta}{\mathbf{x}}\right)^{\alpha}\right), & \boldsymbol{\Theta} < \mathbf{x} < \boldsymbol{\Theta}. \end{cases} \tag{38}$$

and

$$F(\mathbf{x}) = \begin{cases} c \frac{1 - e^{-\binom{\mathbf{d}}{\mathbf{c}}^{\mathbf{r}}}}{1 - e^{-\binom{\mathbf{d}}{\mathbf{c}}^{\mathbf{d}}}}, & 0 < \mathbf{x} \le \theta \\ c + (1 - c) \left(1 - \binom{\theta}{\mathbf{x}}^{\mathbf{d}} \right), & \theta < \mathbf{x} < \neg \bullet. \end{cases} \tag{39}$$

Finally, the moments of order r of the composite lognormal-Pareto and Weibull-Pareto are given by

$$E(X') = c \frac{e^{r\mu + \frac{2\sigma^2}{2}}}{\Phi\left(\frac{\log\theta - \mu}{\sigma}\right)} + (1 - c)\frac{\alpha\theta^r}{\alpha - r} \tag{40}$$

and

$$E(X') = c \frac{\sigma' \Gamma\left(\frac{r}{\pi} + 1; \left(\frac{\partial}{\partial}\right)^{\dagger}\right)}{1 - e^{-\left(\frac{\partial}{\sigma}\right)^{\dagger}}} + (1 - c) \frac{\alpha \theta'}{\alpha - r} \tag{41}$$

for τ > r, respectively.

To estimate the composite lognormal-Pareto and the composite Weibull-Pareto models, the algorithms described above are used.

## 4. Applications

In this section, we focus on two applications to real data sets, one from actuarial sciences, dealing with fire losses and one on Internet traffic data. We will analyze these two data sets with the size distributions seen in Section 2 and the two composite models, namely, the lognormal-Pareto and the Weibull-Pareto, seen in Section 3. In order to compare the distributions, we used the following three criteria:


$$\text{AIC} = 2p \text{--} 2MLL,$$

where p represents the number of parameters to estimate. This criterion represents a measure of the relative quality of a distribution given a set of laws. The distribution with the lowest AIC value is preferred.

3. The Bayesian information criterion (BIC):

$$\text{BIC} = p \log n \text{--} 2 \text{MLL},$$

where n represents the length of the data set and p the number of parameters to estimate. This criterion is used to choose a distribution among a finite set of laws. The distribution with the lowest BIC is preferred.

The AIC and BIC give a trade-off between a reward for a good goodness-of-fit performance and a penalty for an increasing number of parameters to estimate. The BIC tends to favor more parsimonious models than does the AIC.

We carried out the calculations with Wolfram Mathematica 10. To calculate the MLL, AIC, and BIC values for the size distributions of Section 2, we used the function NMaximize with numerical maximization algorithm Random Search method enhanced with the option Interior-Point. For the composite lognormal-Pareto and the composite Weibull-Pareto models, we used the estimation algorithms described in Section 3.

## 4.1. Danish fire losses

In this example, we analyze a classical insurance data set. This is the set of Danish data on 2492 fire insurance losses in Danish Krone (DK) from the years 1980 to 1990 inclusive. The data set can be found in the "SMPracticals" add-on package for R, available from the CRAN website cran.r-project.org.

The comparison of the considered distributions using the three criteria explained above is presented in Table 1. The estimated values for the fitted distributions are given in Table 2.

Distributions and Composite Models for Size-Type Data http://dx.doi.org/10.5772/66443 177


Table 1. MLL, AIC, and BIC values for the Danish fire data set.

4. Applications

tions, we used the following three criteria:

176 Advances in Statistical Methodologies and Their Application to Real Problems

2. The Akaike information criterion (AIC):

the lowest AIC value is preferred.

with the lowest BIC is preferred.

parsimonious models than does the AIC.

4.1. Danish fire losses

cran.r-project.org.

the estimation algorithms described in Section 3.

3. The Bayesian information criterion (BIC):

distribution to the data set.

In this section, we focus on two applications to real data sets, one from actuarial sciences, dealing with fire losses and one on Internet traffic data. We will analyze these two data sets with the size distributions seen in Section 2 and the two composite models, namely, the lognormal-Pareto and the Weibull-Pareto, seen in Section 3. In order to compare the distribu-

1. The maximum log-likelihood (MLL) value: the larger the value, the better the fit of the

AIC ¼ 2p−2MLL,

BIC ¼ plogn−2MLL,

The AIC and BIC give a trade-off between a reward for a good goodness-of-fit performance and a penalty for an increasing number of parameters to estimate. The BIC tends to favor more

We carried out the calculations with Wolfram Mathematica 10. To calculate the MLL, AIC, and BIC values for the size distributions of Section 2, we used the function NMaximize with numerical maximization algorithm Random Search method enhanced with the option Interior-Point. For the composite lognormal-Pareto and the composite Weibull-Pareto models, we used

In this example, we analyze a classical insurance data set. This is the set of Danish data on 2492 fire insurance losses in Danish Krone (DK) from the years 1980 to 1990 inclusive. The data set can be found in the "SMPracticals" add-on package for R, available from the CRAN website

The comparison of the considered distributions using the three criteria explained above is presented in Table 1. The estimated values for the fitted distributions are given in Table 2.

where n represents the length of the data set and p the number of parameters to estimate. This criterion is used to choose a distribution among a finite set of laws. The distribution

where p represents the number of parameters to estimate. This criterion represents a measure of the relative quality of a distribution given a set of laws. The distribution with


Table 2. Estimated values for the fitted distributions for the Danish fire data set.

Figure 12. Histogram of the Danish fire data with the fitted density of the composite lognormal-Pareto model.


Table 3. MLL, AIC, and BIC values for the Internet traffic data.

With a MLL value of -3877.84 and only two parameters, yielding the values AIC = 7759.68 and BIC = 7771.32, the lognormal-Pareto model provides a better fit than the other models for the given data set. A visual conclusion of the fit can be seen in Figure 12.

As the data present a humped shape behavior for the lower values and tail behavior for the upper values; this example justifies the use and the necessity of the composite lognormal-Pareto model.

This data set has also been analyzed in reference [46] where the composite lognormal-Pareto model was introduced and reference [51] applied as well the Weibull-Pareto model to this data set. The results we obtain above coincide with their results.

#### 4.2. Internet traffic data

In the second empirical illustration, we analyze Internet traffic data, which have already been analyzed from a Bayesian point of view in references [53] and [54]. This data set consists of 3143 transferred bytes/second within consecutive seconds.

Based on the MLL, the AIC, and BIC values represented in Table 3, we conclude that among the considered laws, the lognormal distribution performs the best fit, closely followed by the GL and the GEV distributions. The two considered composite models do not provide good fits for this example. The estimated values for the fitted densities are given in Table 4.


Figure 13 provides a visual proof of the goodness-of-fit of the lognormal distribution.

Table 4. Estimated values for the fitted distributions for the Internet traffic data.

Figure 13. Histogram of the Internet traffic data with the fitted lognormal density.

## 5. Conclusion

With a MLL value of -3877.84 and only two parameters, yielding the values AIC = 7759.68 and BIC = 7771.32, the lognormal-Pareto model provides a better fit than the other models for the

Distribution p MLL AIC BIC Lognormal 2 –39582.2 79168.4 79180.5 Pareto 2 –43031.7 86067.4 86079.5 GL 3 –39581.7 79169.4 79187.6 GEV 3 –39608.4 79222.8 79241.0 Lognormal-Pareto 2 –40098.4 80200.8 80212.9 Weibull-Pareto 2 –42823.9 85651.8 85663.9

As the data present a humped shape behavior for the lower values and tail behavior for the upper values; this example justifies the use and the necessity of the composite lognormal-

This data set has also been analyzed in reference [46] where the composite lognormal-Pareto model was introduced and reference [51] applied as well the Weibull-Pareto model to this data

In the second empirical illustration, we analyze Internet traffic data, which have already been analyzed from a Bayesian point of view in references [53] and [54]. This data set consists of

Based on the MLL, the AIC, and BIC values represented in Table 3, we conclude that among the considered laws, the lognormal distribution performs the best fit, closely followed by the GL and the GEV distributions. The two considered composite models do not provide good fits

\_ = 11.6518 σ^ = 0.62067

\_ = 94465 σ^ = 54467.5 ^k = 0.204602

GL α^ = 13.6735 ^b = 4.08831 ^k = 808429

for this example. The estimated values for the fitted densities are given in Table 4.

Figure 13 provides a visual proof of the goodness-of-fit of the lognormal distribution.

given data set. A visual conclusion of the fit can be seen in Figure 12.

set. The results we obtain above coincide with their results.

Table 3. MLL, AIC, and BIC values for the Internet traffic data.

178 Advances in Statistical Methodologies and Their Application to Real Problems

3143 transferred bytes/second within consecutive seconds.

Pareto α^ = 0.353628 x^<sup>0</sup> = 6795

Lognormal-Pareto α^ = 1.1.05077 θ^ = 85064.5 Weibull-Pareto τ^ = 1.12043 θ^ = 79366.3

Table 4. Estimated values for the fitted distributions for the Internet traffic data.

Pareto model.

Distribution

Lognormal μ

GEV μ

4.2. Internet traffic data

To sum up, we review in this chapter the notion of size distributions by presenting the best known and most used ones. We further describe the general concept of composite models based on size distributions and present in more details the composite lognormal-Pareto and the composite Weibull-Pareto models. Besides providing their main statistical properties, we illustrate the size distributions and composite models by applying them to two real application examples to emphasize their use in practice. We compare the goodness-of-fit of the considered distributions using as criteria the MLL, AIC, and BIC. For the first data set dealing with fire losses we find that the composite lognormal-Pareto model performs the best, hinting at the usefulness of composite models in this research area. However, for the second data set on Internet traffic, the simple lognormal distribution outperforms the other size distributions and composite models. This shows how delicate the choice is for a practitioner when confronted with the question which distribution or model he/she should use on a given data set. The composite models are already quite flexible, but given the different shapes a data set can take, there is a quest for even more flexible distributions. In the literature, some families of distributions are proposed, which contain many of the classical size distributions and hence can model very diverse behaviors. The most popular one is the generalized beta distribution presented in reference [55], and very recently reference [56] introduces a new flexible distribution called the interpolating family of size distributions. Those distributions are quite flexible as they enable to model very distinct shapes and probably constitute the future avenue of research in the area of size distributions.

## Author details

Yves Dominicy<sup>1</sup> \* and Corinne Sinner<sup>2</sup> \*


## References


Author details

Yves Dominicy<sup>1</sup>

References

Review. 2015;83(2):175– 192.

2016;30:185– 206.

Methods. Forthcoming.

(10):2159– 2179.

Française de Statistique. 2015;156:76– 96.

and Statistics. New York, USA: Wiley; 2003.

and Statistics. New York, USA: Wiley; 2003.

New York: John Wiley & Sons; 2003.

to precipitation data. Environmetrics. 2010;21:318– 340.

distributions. Internet Mathematics. 2004;1(2):226– 251.

distribution. Water Resources Research. 2002;38:25-1– 25-10.

\* and Corinne Sinner<sup>2</sup>

180 Advances in Statistical Methodologies and Their Application to Real Problems

\*

1 Université libre de Bruxelles, SBSEM, ECARES, Brussels, Belgium

\*Address all correspondence to: yves.dominicy@ulb.ac.be and corinne.sinner@ulb.ac.be

[1] Jones M.C. On families of distributions with shape parameters. International Statistical

[2] Ley C. Flexible modelling in statistics: past, present and future. Journal de la Société

[3] Lawless J. Statistical Models and Methods for Lifetime Data. Wiley Series in Probability

[4] Lee E., Wang J. Statistical Methods for Survival Data Analysis. Wiley Series in Probability

[5] Marchenko Y., Genton M. Multivariate log-skew-elliptical distributions with applications

[6] Mitzenmacher M. A brief history of generative models for power law and lognormal

[7] Eeckhout J. Gibrat's law for (All) cities. American Economic Review. 2004;94:1429– 1451. [8] Gabaix X. Power laws in economics: An introduction. Journal of Economic Perspectives.

[9] Clarke R. Estimating trends in data from the Weibull and a generalized extreme value

[10] Kleiber C., Kotz S. Statistical Size Distributions in Economics and Actuarial Sciences.

[11] Asgharzadeh A., Nadarajah S., Sharafi F. Generalized inverse Lindley distribution with application to Danish fire insurance data. Communications in Statistics: Theory and

[12] Ortega E.M.M., Lemonte A.J., Silva G.O., Cordeiro G.M. New flexible models generated by gamma random variables for lifetime modeling. Journal of Applied Statistics. 2015;42

2 Université libre de Bruxelles, Département de Mathématique, Brussels, Belgium


[47] Scollnik D.P.M. On composite lognormal-Pareto models. Scandinavian Actuarial Journal. 2007;1:20– 33.

[31] Abdul-Moniem I.B., Abdel-Hameed H.F. On exponentiated Lomax distribution. Interna-

[32] Shams T.M. The Kumaraswamy-generalized Lomax distribution. Middle-East Journal of

[33] Tahir M.H., Hussain M.A., Cordeiro G.M., Hamedani G.G., Mansoor M., Zubair M. The Gumbel-Lomax distribution: properties and applications. Journal of Statistical Theory

[34] Fisher R.A., Tippett L.H.C. Limiting forms of the frequency distribution of the largest or smallest member of a sample. Proceedings of the Cambridge Philosophical Society.

[35] Embrechts P., Klü ppelberg C., Mikosch T. Modelling Extremal Events for Insurance and

[36] Guégan D., Hassani B.K. A mathematical resurgence of risk management: An extreme modeling of expert opinions. Frontiers in Finance and Economics. 2014;11(1):25– 45. [37] Burke E.J., Perry R.H.J., Brown S.J. An extreme value analysis of UK drought and pro-

[38] Coles S. An Introduction to Statistical Modeling of Extreme Values. Berlin: Springer-

[39] Finkenstädt B., Rootzén H. Extreme Values in Finance, Telecommunications and the

[40] Jenkinson A.F. The frequency distribution of the annual maximum (or minimum) values of meteorological elements. Quarterly Journal of the Royal Meteorological Society.

[41] Lindquist E.S. Strength of materials and the Weibull distribution. Probabilistic Engineer-

[42] Manwell J.F., McGowan J.G., Rogers A.L. Wind Energy Explained: Theory, Design and

[43] Sharif N., Islam N. The Weibull distribution as a general model for forecasting technological change. Technological Forecasting and Social Change. 1980;18(3):247– 256. [44] Alvarado-Celestino E. Large forest fires: An analysis using extreme value theory and

[45] Klugman S.A., Panjer H.H., Willmot G. Loss Models: From Data to Decisions. New York:

[46] Cooray K., Ananda M.M.A. Modeling actuarial data with composite lognormal-Pareto

robust statistics [thesis]. University of Washington, USA; 1992.

model. Scandinavian Actuarial Journal. 2005;5:321– 334.

jections of change in the future. Journal of Hydrology. 2010;388:131– 143.

Environment. London: Chapman & Hall/CRC; 2004.

tional Journal of Mathematical Archive. 2012;3(5):2144– 2150.

Scientific Research. 2013;17(5):641– 646.

182 Advances in Statistical Methodologies and Their Application to Real Problems

and Applications. 2016;15(1):61– 79.

Finance. Berlin: Springer-Verlag; 1997.

1928;24:180– 290.

Verlag; 2001.

1955;81:158– 171.

Wiley; 2008.

ing Mechanics. 1994;9(3):191– 194.

Application. New York, USA: Wiley; 2009.


**Applications of Data Analysis in Finance and Economics**

#### **Modelling Limit Order Book Volume Covariance Structures Modelling Limit Order Book Volume Covariance Structures**

#### Andrija Mihoci Andrija Mihoci

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66152

## **Abstract**

Limit order volume data have been here analysed using key multivariate techniques: principal components,factor and discriminant analysis. The focus lies on understanding of the covariance structure of posted quantities of the asset to be potentially sold or bought at the market. Employing the methods to data of 20 blue chip companies traded at the NASDAQ stock market in June 2016, one observes that two principal components account for approximately 85–95% of order book variation. The most important factor related to order book data variation has furthermore been the demand side (variability). The order book data variation, moreover, successfully classifies stock price movements. Potential applications include improving order execution strategies, designing trading algorithms and understanding price formation.

**Keywords:** limit order book, multivariate techniques, principal components analysis, factor analysis, discriminant analysis

## **1. Introduction**

The limit order book (LOB) trading mechanism became the dominant way to trade assets on financial markets. Since the limit order book represents liquidity supply of assets on a market, it essentially reflects the demand for as well as the supply of assets above the equilibrium pricevolume point. Its variation is affecting the liquidity and price dynamics of an asset, and thus, the goal ofthis study is to conduct a comprehensivemultivariate analysis ofthe limit order book (variation) data.

Here we model the covariance structures of order book data of several assets by employing key multivariate methods. Theodore W. Anderson synthesized various subareas of the subject

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

and has influenced the direction of recent and current research in theoretical multivariate analysis [1]. The principal components, factor and discriminant analysis remain quite popular dimension-reduction and classification techniques that are applied in many research fields.

Multivariate techniques are, for example, recently used in financial econometrics of limit order book markets. The principal component analysis is performed in the studies about commonalities in liquidity (measures), see, for example [2, 3], or while analysing price impact data [4]. The dynamics of liquidity supply curves is captured by the so-called dynamic semiparametric factor model in [5], whereas [6] characterize traders' behaviour using discriminant analysis.

Our focus lies on understanding of the variability of posted quantities of the asset, to be potentially sold or bought at the market. The volume (variation) at every order book level is analysed as a random variable, and thus we do not suppress the order book information through, for example, liquidity measures or reward functions. In this chapter, we consider the (full) structure of the covariance matrices. Potential applications thus include improving order execution strategies, understanding price formation and liquidity commonalities, designing trading algorithms.

This study is organised as follows: after the limit order book data have been described in Section 2, the statistical methods are presented in Section 3. Empirical results are provided in Section 4, and Section 5 concludes.

## **2. Limit order book data**

The limit order book of an asset lists the volume of pending buying or selling orders at given prices for the asset under consideration and here we analyse its variance-covariance structure. At a fixed time point, the order book essentially represents a snapshot of the asset's demand and supply curves above the market equilibrium quantity level. The volume to be potentially bought forms the asset's demand (bid) side, whereas the volume to be potentially sold depicts the asset's supply (ask) side. To be more precise, the order book bid and ask curves represent liquidity supply, thus quantities above the equilibrium volume level, as orders below the equilibrium (would) have been traded at the market.

## **2.1. NASDAQ market data and descriptive statistics**

At the NASDAQ stock market, one of the world's largest securities exchange, the orders are posted nearly instantaneously and the limit orders are executed in the received order. To visualize a limit order book, consider the data of Intel Corp. (INTC) on 30 June 2016, obtained from the data provider LOBSTER (lobsterdata.com). The number of shares to be potentially bought or sold at different prices at 10:00 and 11:00 are depicted in **Figure 1**. For example, at 10:00 at prices 32.14 (fifth best bid price) and 32.18 (best bid price), there are 16,834 and 2927 stocks demanded, respectively. At the same time, the number of offered shares at prices 32.19 (best ask price) and 32.23 (fifth best ask price) similarly equals 1700 and 15,355, respectively. At 11:00, one furthermore observes that the order book shifted to the direction of higher prices. We attribute this movement to the (observed) increased demand pressure.

and has influenced the direction of recent and current research in theoretical multivariate analysis [1]. The principal components, factor and discriminant analysis remain quite popular dimension-reduction and classification techniques that are applied in many research fields. Multivariate techniques are, for example, recently used in financial econometrics of limit order book markets. The principal component analysis is performed in the studies about commonalities in liquidity (measures), see, for example [2, 3], or while analysing price impact data [4]. The dynamics of liquidity supply curves is captured by the so-called dynamic semiparametric factor model in [5], whereas [6] characterize traders' behaviour using discriminant analysis. Our focus lies on understanding of the variability of posted quantities of the asset, to be potentially sold or bought at the market. The volume (variation) at every order book level is analysed as a random variable, and thus we do not suppress the order book information through, for example, liquidity measures or reward functions. In this chapter, we consider the (full) structure of the covariance matrices. Potential applications thus include improving order execution strategies, understanding price formation and liquidity commonalities, designing

188 Advances in Statistical Methodologies and Their Application to Real Problems

This study is organised as follows: after the limit order book data have been described in Section 2, the statistical methods are presented in Section 3. Empirical results are provided in

The limit order book of an asset lists the volume of pending buying or selling orders at given prices for the asset under consideration and here we analyse its variance-covariance structure. At a fixed time point, the order book essentially represents a snapshot of the asset's demand and supply curves above the market equilibrium quantity level. The volume to be potentially bought forms the asset's demand (bid) side, whereas the volume to be potentially sold depicts the asset's supply (ask) side. To be more precise, the order book bid and ask curves represent liquidity supply, thus quantities above the equilibrium volume level, as orders below the

At the NASDAQ stock market, one of the world's largest securities exchange, the orders are posted nearly instantaneously and the limit orders are executed in the received order. To visualize a limit order book, consider the data of Intel Corp. (INTC) on 30 June 2016, obtained from the data provider LOBSTER (lobsterdata.com). The number of shares to be potentially bought or sold at different prices at 10:00 and 11:00 are depicted in **Figure 1**. For example, at 10:00 at prices 32.14 (fifth best bid price) and 32.18 (best bid price), there are 16,834 and 2927 stocks demanded, respectively. At the same time, the number of offered shares at prices 32.19 (best ask price) and 32.23 (fifth best ask price) similarly equals 1700 and 15,355, respectively. At 11:00, one furthermore observes that the order book shifted to the direction of higher prices.

We attribute this movement to the (observed) increased demand pressure.

trading algorithms.

Section 4, and Section 5 concludes.

equilibrium (would) have been traded at the market.

**2.1. NASDAQ market data and descriptive statistics**

**2. Limit order book data**

**Figure 1.** Observed limit order book for Intel Corp. (INTC) on 30 June 2016 at 10:00 and 11:00. The monotonically decreasing (increasing) functions represent the demand (supply) curves.

At the NASDAQ order book driven securities exchange, there are several event types that influence the bid and ask curves, namely submissions of new limit orders, cancellations, deletions and executions (lobsterdata.com). Our data set allows us thus to reconstruct all order book activities of a particular company over the course of a trading day. For a description of trading that is common to most limit order book markets, see, for example [7].

The order book volume at given price level represents here a *p*-dimensional random variable. Denoted by <sup>=</sup> 1,…, ┬ , 1 <…< , the price and by <sup>=</sup> 1,…, ┬ ,the associated volume vector. The limit order book of an asset is given by the pairs

$$\{\{S\_1, X\_1\}, \ldots, \{S\_p, X\_p\}\}.\tag{1}$$

The expected volume vector is denoted by and the object of our interest, the limit order book volume variance-covariance matrix by

$$\mathsf{Var}\left(X\right) = \Sigma = \mathsf{E}\left[\left(X - \mathsf{E}\left[X\right]\right)\left(X - \mathsf{E}\left[X\right]\right)^{\top}\right],\tag{2}$$

here Σ is a symmetric × matrix whose mean diagonal elements depict the variances of the pending volume at fixed price levels 1,…,.

Limit order book data of the 20 largest stocks traded at the NASDAQ stock market have been collected for the purpose of our analysis. In modelling of the high-dimensional covariance structures of this object, we set = 10. The volume at the demand side is thus represented by the variables 1,…,5 and the variables 6,…,10 form the supply regime. Since the "Brexit" referendum results had a significant influence on the stock market movements, we correspondingly focus on the order book activities on 27 June 2016 (S&P 500 at lowest level after the voting) and 30 June 2016 (upward movement of the S&P 500 series).


**Table 1.** Number of limit order book observations on 27 and 30 June 2016 and the change (decrease) in % for the largest 20 stocks at NASDAQ.

The number of daily order book changes varies considerably across the investigated stocks, that is, between 59,628 and 1,805,688, see **Table 1**. After the referendum results, there have been many order book changes present, as compared to the trading activities on 30 June 2016. For almost all stocks, the number of changes then decreased quite substantially.

The majority of the companies had on 30 June 2016 interestingly more stocks (on average) listed at the given price levels of the order book compared to that on 27 June 2016, see **Figure 2**. For convenience, denote the observed × volume data matrix by . Here the expected value of is estimated by

Modelling Limit Order Book Volume Covariance Structures http://dx.doi.org/10.5772/66152 191

**Figure 2.** Estimated average volume of the order book data for selected stocks on 27 June 2016 (solid) and 30 June 2016 (dashed).

$$
\hat{\mu} = n^{-1} \mathcal{X}^{\top} \mathbf{1}\_n \tag{3}
$$

with a × 1 vector of ones denoted by 1. The average posted quantities moreover exhibit a symmetric pattern while comparing the estimated volume at the bid and ask sides.

#### **2.2. Covariance structure estimation**

The number of daily order book changes varies considerably across the investigated stocks, that is, between 59,628 and 1,805,688, see **Table 1**. After the referendum results, there have been many order book changes present, as compared to the trading activities on 30 June 2016. For

**Table 1.** Number of limit order book observations on 27 and 30 June 2016 and the change (decrease) in % for the largest

Apple Inc. AAPL 1,805,688 1,124,082 37.7 Alphabet Inc. GOOGL 236,436 178,569 24.5 Alphabet Inc. GOOG 202,449 152,442 24.7 Microsoft Corporation MSFT 1,778,587 777,538 56.3 Amazon.com, Inc. AMZN 212,951 245,500 −15.3 Facebook, Inc. FB 863,979 521,138 39.7 Comcast Corporation CMCSA 694,958 367,544 47.1 Intel Corporation INTC 1,260,947 603,475 52.1 Cisco Systems, Inc. CSCO 870,147 477,008 45.2 Amgen Inc. AMGN 171,111 135,631 20.7 Gilead Sciences, Inc. GILD 635,498 443,749 30.2 The Kraft Heinz Company KHC 133,353 166,864 −25.1 Walgreens Boots Alliance, Inc. WBA 278,448 336,216 −20.7 Starbucks Corporation SBUX 804,742 410,650 49.0 Celgene Corporation CELG 338,872 304,187 10.2 QUALCOMM Incorporated QCOM 709,635 419,285 40.9 Costco Wholesale Corporation COST 150,007 141,545 5.6 Mondelez International, Inc. MDLZ 414,248 699,600 −68.9 The Priceline Group Inc. PCLN 85,459 59,628 30.2 Texas Instruments Incorporated TXN 798,510 475,992 40.4

190 Advances in Statistical Methodologies and Their Application to Real Problems

**2016-06-27 2016-06-30 % Decr.**

The majority of the companies had on 30 June 2016 interestingly more stocks (on average) listed at the given price levels of the order book compared to that on 27 June 2016, see **Figure 2**. For convenience, denote the observed × volume data matrix by . Here the expected value of

almost all stocks, the number of changes then decreased quite substantially.

is estimated by

20 stocks at NASDAQ.

The results above indicate that the order book change count as well as the estimated average volume vector changed (substantially) on 30 June as compared to the market situation on the 27 June 2016. Having estimated the mean vector, we are ready to focus on the (potential) changes in the variance-covariance matrices, that is, covariance structures of the order book data. The covariance matrix of the order book volume is estimated by

$$
\hat{\Sigma} = n^{-1} \mathcal{X}^{\top} \mathcal{H} \mathcal{X} \tag{4}
$$

where , with identity matrix , denotes the centring matrix, and 1 represents a × 1 vector of ones [8]. The empirical results are displayed in **Figures 3** and **4**, for the megacap and large-cap stocks, respectively. Since the analysed order book volume vector is a 10 dimensional object, , the axes at every graphical display represent the index of the random variable(s) under consideration. In total, there are 100 estimated covariance values displayed at each graph, that is, all values of the 10 × 10 matrix Σ. For example, the upper left square of every graph denotes the estimated covariance between 1 and 1 (which equals the estimated variance of 1); the lower left square represents the estimated covariance between 1 and 10, etc. The MATLAB function 'pcolor' has been used for generating **Figures 3** and **4**. The matrix values are used to define the vertex colours by scaling the values to map to the full range of the 'colourmap', see the MATLAB documentation for more details. Note that a darker (blue) colour shows a larger value of the estimated covariance between the random variables and vice versa.

**Figure 3.** Estimated covariance structure of order book data: mega-cap stocks on 27 June 2016 (upper panel for each stock) and 30 June 2016 (lower panel for each stock).

Our empirical results indicate several interesting findings. One observes a relatively stronger variation in the individual volume variables than the covariance levels across all stocks. We aim identifying the linear combination that is responsible for the largest proportion of the data variation. There are furthermore relatively larger covariance levels between the bid and ask sides on 30 June 2016 in comparison with the levels on 27 June 2016, indicating a stronger impact of one market side on order book variation immediately after the referendum results. Our analysis aims particularly to select the most important factor associated with this variation.

**Figure 4.** Estimated covariance structures of order book data: large-cap stocks on 27 June 2016 (upper panel for each stock) and 30 June 2016 (lower panel for each stock).

## **3. Statistical modelling**

where , with identity matrix

192 Advances in Statistical Methodologies and Their Application to Real Problems

random variables and vice versa.

stock) and 30 June 2016 (lower panel for each stock).

, denotes the centring matrix, and 1 represents

a × 1 vector of ones [8]. The empirical results are displayed in **Figures 3** and **4**, for the megacap and large-cap stocks, respectively. Since the analysed order book volume vector is a 10 dimensional object, , the axes at every graphical display represent the index of the random variable(s) under consideration. In total, there are 100 estimated covariance values displayed at each graph, that is, all values of the 10 × 10 matrix Σ. For example, the upper left square of every graph denotes the estimated covariance between 1 and 1 (which equals the estimated variance of 1); the lower left square represents the estimated covariance between 1 and 10, etc. The MATLAB function 'pcolor' has been used for generating **Figures 3** and **4**. The matrix values are used to define the vertex colours by scaling the values to map to the full range of the 'colourmap', see the MATLAB documentation for more details. Note that a darker (blue) colour shows a larger value of the estimated covariance between the

**Figure 3.** Estimated covariance structure of order book data: mega-cap stocks on 27 June 2016 (upper panel for each

Our empirical results indicate several interesting findings. One observes a relatively stronger variation in the individual volume variables than the covariance levels across all stocks. We aim identifying the linear combination that is responsible for the largest proportion of the data

## **3.1. Modelling framework**

Recall, we model the limit order book volume as a *p*-dimensional random vector and denote its expected value by , a × 1 vector, and the covariance matrix by Var = Σ, a × matrix. After observing realizations of , that is, after obtaining the × order book volume matrix , the parameters and Σ are estimated by expressions (3) and (4), respectively.

Among multivariate techniques that deal with dimension reduction of high-dimensional random vectors, in volume covariance structure modelling we focus on the principal components, factor and discriminant analysis. Multivariate techniques deal with simultaneous relationship among variables and differ from univariate and bivariate analysis in that they direct attention away from the analysis of the mean and variance of single variable or from the pairwise relationship between two variables, to the analysis of the covariances and correlations among three or more variables [9].

#### **3.2. Principal components analysis**

Principal component analysis focuses on standardised principal components of a highdimensional random variable. It has been first introduced by Karl Pearson for nonstochastic variables and by Harold Hotelling for random vectors [10]. The low dimensional representation enables us to study the correlation between the principal components and the original data; here our goal is to find the standardized linear combination of the order book volume vector that is associated with the largest order book variation. The technique is based on a very useful theorem [11], the spectral decomposition theorem. General results about eigenvalues and eigenvectors for square matrices and those for symmetrical matrices are provided in [12].

The standardized linear combination of a *p*-dimensional variable <sup>=</sup> 1,…, ┬ that maximizes the order book variation uses the first eigenvector associated to the first (largest) eigenvalue of the spectral decomposition Σ = ┬, = diag 1,…, , 1 <sup>≤</sup> … ≤ being the × diagonal matrix of eigenvalues and the × matrix of associated eigenvectors. The second largest variance proportion is explained by the linear combination using the second eigenvector, etc. The principal components are given by Y = ┬ .

In modelling order book data, we estimate the principal components by

$$\top = \left(\mathcal{X} - \mathbf{1}\_n \hat{\boldsymbol{\mu}}^{\top}\right) \hat{\Gamma} \tag{5}$$

with the estimated matrix of eigenvectors from the spectral decomposition of <sup>Σ</sup> <sup>=</sup> ┬, and the estimated × dimensional diagonal matrix of eigenvalues . For illustrative purpose, it often suffices to consider only the first two principal components, that is, the first two columns of the × matrix .

#### **3.3. Factor analysis**

In factor analysis the random vector is modelled as a linear combination of few common factors. The concept of latent factors seems to have been suggested by Francis Galton, the formulation and early development of factor analysis have their genesis in psychology and are generally attributed to Charles Edward Spearman [10]. Factor analysis aims to discover independent variables that describe the variation of a high-dimensional random variable with high explanatory power [13]. Formally, we consider a *k*-factor model

$$X \equiv QF + U + \mu \tag{6}$$

where and denote the and dimensional common and specific factors, respectively [8]. The <sup>×</sup> matrix of factor loadings is denoted by . It is furthermore assumed that = 0,

The associated factor loadings represent the combinations which reflect the common variance part and the remaining variation is quantified through the covariance matrix of the specific factors. In practice, we are consequently interested in estimating the matrix of common factor loadings and the covariance matrix of the specific factor Ω. Here we utilise the maximum likelihood method: while assuming that the volume is multivariate normally distributed [8], the estimates are given by maximising the log-likelihood function, namely

$$\mathbb{E}\left(\hat{Q},\hat{\Omega}\right) = \arg\max\_{Q,\Omega} \left[ -\frac{n}{2}\log\left\{ \left| 2\pi \left( QQ^{\top} + \Omega \right) \right| \right\} + \frac{n}{2}\operatorname{tr}\left\{ \left( QQ^{\top} + \Omega \right)^{-1}\hat{\Sigma} \right\} \right] \tag{7}$$

where denotes the sample size and Σ the estimated covariance matrix, see Eq. (4).

#### **3.4. Discriminant analysis**

direct attention away from the analysis of the mean and variance of single variable or from the pairwise relationship between two variables, to the analysis of the covariances and correlations

Principal component analysis focuses on standardised principal components of a highdimensional random variable. It has been first introduced by Karl Pearson for nonstochastic variables and by Harold Hotelling for random vectors [10]. The low dimensional representation enables us to study the correlation between the principal components and the original data; here our goal is to find the standardized linear combination of the order book volume vector that is associated with the largest order book variation. The technique is based on a very useful theorem [11], the spectral decomposition theorem. General results about eigenvalues and eigenvectors for square matrices and those for symmetrical matrices are provided

mizes the order book variation uses the first eigenvector associated to the first (largest) eigenvalue of the spectral decomposition Σ = ┬, = diag 1,…, , 1 <sup>≤</sup> … ≤ being the × diagonal matrix of eigenvalues and the × matrix of associated eigenvectors. The second largest variance proportion is explained by the linear combination using the second

with the estimated matrix of eigenvectors from the spectral decomposition of <sup>Σ</sup> <sup>=</sup> ┬, and the estimated × dimensional diagonal matrix of eigenvalues . For illustrative purpose, it often suffices to consider only the first two principal components, that is, the first two columns

In factor analysis the random vector is modelled as a linear combination of few common factors. The concept of latent factors seems to have been suggested by Francis Galton, the formulation and early development of factor analysis have their genesis in psychology and are generally attributed to Charles Edward Spearman [10]. Factor analysis aims to discover independent variables that describe the variation of a high-dimensional random variable with

m

(6)

*X QF U* = + +

 ┬

that maxi-

(5)

The standardized linear combination of a *p*-dimensional variable <sup>=</sup> 1,…,

eigenvector, etc. The principal components are given by Y = ┬ .

In modelling order book data, we estimate the principal components by

high explanatory power [13]. Formally, we consider a *k*-factor model

among three or more variables [9].

194 Advances in Statistical Methodologies and Their Application to Real Problems

**3.2. Principal components analysis**

in [12].

of the × matrix .

**3.3. Factor analysis**

In discriminant analysis, multivariate data observations are classified into two or more known groups. A modern treatment of discriminant analysis and a brief history of discriminant analysis is included in [10]. In the analysis of group differences [13], for example, state two questions: (i) does there exists a significant difference between the groups (variation) and (ii) which variables are responsible in this aspect? In practice, a discriminant rule is used to classify existing and new observations and the number of correctly classified observations reflects the quality of the approach. Here we are interested in the classification accuracy: to which extend a price change can be expected (or not) at each order book entry based exclusively on observed volume data.

The linear Fisher's discriminant rule is based on a linear combination of data, say , with denoting a × 1 vector, and the idea is to find that achieves a good separation [8, 14]. When the method is applied to two groups, one assumes that the data matrix is split into two groups, say 1and 2. Denote the sample sizes of these matrices by 1and 2, the estimated mean vectors by 1 and 2, the estimated covariance matrices by Σ1 and Σ2, and the centering matrices by ℋ1 and ℋ2. The linear combination that maximizes the ratio of the between-groupsum of squares to the within-group-sum of squares is given by

$$
\hat{a} = \mathcal{W}^{-1} \left( \hat{\mu}\_1 - \hat{\mu}\_2 \right) \tag{8}
$$

where the × matrix is related to the between-group-sum of squares as follows:

## **4. Empirical results**

An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily results [15]. Consider, for example, the proportion of order book variance explained by two principal components in **Table 2**. Two principal components are sufficient to describe the order book variation, since the explained proportions range between 0.81 and 0.96 (27 June 2016) and 0.78 and 0.97 (30 June 2016).

The limit order book variation of most companies is clearly stronger explained on 30 June 2016 as compared to the resulting explanatory power on 27 June 2016. The largest explained proportion increase is evident for smaller stocks, especially for SBUX, CELG, QCOM, COST and PCLN. Looking only at the descriptive results reported in **Table 1** one would conclude that the number of changes is apparently similar across all stocks. Now it is evident that the demand and supply curves of smaller stocks change relatively stronger during turbulent times (here during a downward price movement). We attribute this to the relatively lower liquidity of large-cap stocks as compared to the highly liquid mega-cap stocks.


**Table 2.** Estimated proportion of explained order book volume variance by the first two principal components.

Factor analysis can be considered as an extension of principal component analysis, although both techniques can be viewed as attempts to approximate the covariance matrix; however, the approximation based on the factor analysis model is more elaborate [15]. In the sequel, we chose a = 1 factor model since we are interested in selecting the driving factor of order book variation. The results are depicted in **Tables 3** and **4** for the mega-cap and large-cap companies, respectively, based on the estimated values of the factor loadings .


**4. Empirical results**

196 Advances in Statistical Methodologies and Their Application to Real Problems

June 2016).

An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily results [15]. Consider, for example, the proportion of order book variance explained by two principal components in **Table 2**. Two principal components are sufficient to describe the order book variation, since the explained proportions range between 0.81 and 0.96 (27 June 2016) and 0.78 and 0.97 (30

The limit order book variation of most companies is clearly stronger explained on 30 June 2016 as compared to the resulting explanatory power on 27 June 2016. The largest explained proportion increase is evident for smaller stocks, especially for SBUX, CELG, QCOM, COST and PCLN. Looking only at the descriptive results reported in **Table 1** one would conclude that the number of changes is apparently similar across all stocks. Now it is evident that the demand and supply curves of smaller stocks change relatively stronger during turbulent times (here during a downward price movement). We attribute this to the relatively lower liquidity

**2016-06-27 2016-06-30 2016-06-27 2016-06-30**

**2016-06-27 2016-06-30 2016-06-27 2016-06-30**

of large-cap stocks as compared to the highly liquid mega-cap stocks.

AAPL 0.87 0.87 FB 0.84 0.85 GOOGL 0.85 0.87 CMCSA 0.94 0.89 GOOG 0.87 0.87 INTC 0.93 0.95 MSFT 0.90 0.95 CSCO 0.96 0.97 AMZN 0.85 0.78 AMGN 0.88 0.85

GILD 0.83 0.89 QCOM 0.90 0.95 KHC 0.93 0.88 COST 0.84 0.88 WBA 0.93 0.92 MDLZ 0.96 0.96 SBUX 0.83 0.91 PCLN 0.81 0.89 CELG 0.88 0.94 TXN 0.92 0.94

**Table 2.** Estimated proportion of explained order book volume variance by the first two principal components.

respectively, based on the estimated values of the factor loadings .

Factor analysis can be considered as an extension of principal component analysis, although both techniques can be viewed as attempts to approximate the covariance matrix; however, the approximation based on the factor analysis model is more elaborate [15]. In the sequel, we chose a = 1 factor model since we are interested in selecting the driving factor of order book variation. The results are depicted in **Tables 3** and **4** for the mega-cap and large-cap companies,

**Table 3.** Identified common factors based on the estimated factor loadings for investigated mega-cap and largest largecap stocks.


**Table 4.** Identified common factors based on the estimated factor loadings for investigated large-cap stocks.

Across all stocks, demand is selected as the most important factor on 30 June 2016. The price of the companies indeed reacted positively during this day. For most of the relatively illiquid large-cap stocks, interestingly, the same factor has been identified on both days. Its magnitude changed, as evident from the principal components analysis.

Discriminant analysis cannot usually provide an error-free method of assignment of data, because there may not be a clear distinction between the measured characteristics of the populations—that is, the groups may overlap [15]. We report the proportions of correctly classified price changes based only on volume data in **Tables 5** and **6** for the selected megacap and largest large-cap, and large-cap stocks, respectively.


**Table 5.** Estimated proportion of correctly classified price changes based on volume data for investigated mega-cap and largest large-cap stocks.


**Table 6.** Estimated proportion of correctly classified price changes based on volume data for investigated large-cap stocks.

The empirical findings suggest that limit order book volume data successfully classify price changes, especially on 30 June 2016, a day with relatively low number of order book entries. Here the first group contains entries with mid-quote price 5 <sup>+</sup> 6 /2 change, and the second group entries without a change. Our results show that the classification rates changed positively quite significantly for extremely large and the smallest investigated stocks. The later ones, as discussed above, exhibit a relatively well understood covariance structure on 30 June 2016.

## **5. Conclusions**

Limit order book data of 20 highly traded stocks at the NASDAQ market in June 2016 have been analysed. We select 2 days after the 'Britex' referendum, namely, 27 June (lowest S&P 500 level) and 30 June (recovery day). The variable of interest is the 10-dimensional order book volume data vector, that is, quantities pending at the five best levels of the demand side and at the five best supply side levels.

Two principal components account for approximately 85–95% of the order book data variation. The results of a one-factor model identify the demand (variation) as the most important factor explaining the order book covariance structure. The limit order book volume data variation is quite informative in predicting the price evolution (change or no change in the mid-quote) across all stocks and during the analysed trading activities. The mega-cap and the smallest investigated large-cap companies share almost the same classification performance. Finally, multivariate statistical techniques are successfully employed in covariance modelling of order book data.

## **Author details**

Andrija Mihoci

Address all correspondence to: Andrija.Mihoci@b-tu.de

Brandenburg University of Technology Cottbus-Senftenberg, Cottbus, Germany

## **References**

**2016-06-27 2016-06-30 2016-06-27 2016-06-30**

**Table 6.** Estimated proportion of correctly classified price changes based on volume data for investigated large-cap

The empirical findings suggest that limit order book volume data successfully classify price changes, especially on 30 June 2016, a day with relatively low number of order book entries. Here the first group contains entries with mid-quote price 5 <sup>+</sup> 6 /2 change, and the second group entries without a change. Our results show that the classification rates changed positively quite significantly for extremely large and the smallest investigated stocks. The later ones, as discussed above, exhibit a relatively well understood covariance structure on 30 June

Limit order book data of 20 highly traded stocks at the NASDAQ market in June 2016 have been analysed. We select 2 days after the 'Britex' referendum, namely, 27 June (lowest S&P 500 level) and 30 June (recovery day). The variable of interest is the 10-dimensional order book volume data vector, that is, quantities pending at the five best levels of the demand side and

Two principal components account for approximately 85–95% of the order book data variation. The results of a one-factor model identify the demand (variation) as the most important factor explaining the order book covariance structure. The limit order book volume data variation is quite informative in predicting the price evolution (change or no change in the mid-quote) across all stocks and during the analysed trading activities. The mega-cap and the smallest investigated large-cap companies share almost the same classification performance. Finally, multivariate statistical techniques are successfully employed in covariance modelling of order

GILD 0.49 0.52 QCOM 0.50 0.64 KHC 0.47 0.50 COST 0.50 0.50 WBA 0.47 0.50 MDLZ 0.54 0.53 SBUX 0.45 0.55 PCLN 0.52 0.57 CELG 0.53 0.51 TXN 0.55 0.60

198 Advances in Statistical Methodologies and Their Application to Real Problems

stocks.

2016.

**5. Conclusions**

book data.

**Author details**

Andrija Mihoci

at the five best supply side levels.

Address all correspondence to: Andrija.Mihoci@b-tu.de

Brandenburg University of Technology Cottbus-Senftenberg, Cottbus, Germany


**Provisional chapter**

## **A Practical Approach to Evaluating the Economic and Technical Feasibility of LED Luminaires and Technical Feasibility of LED Luminaires**

**A Practical Approach to Evaluating the Economic** 

Sean Schmidt and Suzanna Long Sean Schmidt and Suzanna Long Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66216

#### **Abstract**

LED roadway luminaires are currently under consideration for widespread implementation with departments of transportation, facilities managers, and city planners. This research focuses on a case study in Missouri and presents relevant research findings calculated by the authors as part of a project funded by the Missouri Department of Transportation. Although high-pressure sodium (HPS) luminaires have been the standard product for roadway illumination, advances in LED technologies have led many departments of transportation to consider them as viable options along state routes. For this case study, pilot sites were developed across the state of Missouri in sites assessed as moderately busy, medium pedestrian conflict zones. These zones were along roadways with an R3 pavement classification. This case study details the economic feasibility findings from the study; a life cycle cost approach was used. In addition, a technical feasibility analysis was conducted to determine fit with Illumination Engineering Society (IES) standards for the traffic pattern and pavement classification at study sites. Key findings reveal that LED roadway luminaires fail to outperform HPS in their current design, but may become technically and economically feasible in the future.

**Keywords:** LED roadway luminaires, life cycle cost evaluation, field data, energy consumption, environmental impacts

## **1. Introduction**

As high-pressure sodium (HPS) roadway luminaires reach the end of their product life cycle, many states and local agencies, as well as city planners and utilities are considering LED roadway luminaires as a replacement product [1]. Manufacturers of LED luminaires promote their benefits as longer useful life, reduced operations and maintenance costs, reduced environmental impact, and reduced energy cost. This case study presents a quantitative method for

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

assessing technical and economic feasibility of LED roadway luminaires for those considering the product for widespread implementation.

Previously, research has been completed on LED luminaires in the field case studies sponsored by the Department of Energy's Energy Efficiency and Renewable Energy (EERE) program [2]. Research has also shown the life cycle costs of LEDs may be higher than high-pressure sodium luminaires for both collector and local roads [3, 4]. The Illumination Engineering Society (IES) of North America standards for roadway illumination vary depending on three factors: classification of roadway, pedestrian conflict potential, and pavement classification. For example, if a roadway was classified as a moderately traveled major route with a low pedestrian conflict potential and an R3 pavement classification, then the minimum maintained average illumination would be 9.0 lux [5]. If the same roadway's pedestrian conflict potential was reclassified to be a high potential of pedestrian conflict, then the minimum maintained average illuminance increases to 17.0 lux. The Illinois Center for Transportation produced a report providing background information on LED luminaires [6]. This report covers optics of LED lighting, advantages and drawbacks of LED lighting, and summarizes work previously completed by several GATEWAY demonstrations.

Manufacturers, government agencies, and utilities are collaborating to produce effective LED roadway luminaires. One such effort is the Department of Energy's Solid-State lighting GATEWAY Demonstration programs, which investigate real-world application of solid-state lighting technologies in various fields, such as roadway illumination, sidewalk illumination, and parking lot illumination. These programs have performed feasibility analyses on several types of LED luminaires across several uses. Thus far, the program has published reports on the use of LED lighting in parking lot [7] and minor roadway lighting [8]. Research has previously been performed on combining an economic analysis with a product performance analysis to develop street lighting standards [9–12]. In addition, the benefits of LEDs have been investigated in a previous research, such as reduced concern of the power factor of electricity loads [13]. Another potential benefit is the ability to rapidly start luminaires without harmful impact on the luminaire's lifetime, which can allow for "smarter" usage of lighting systems [14].

The Missouri Department of Transportation commissioned research into the feasibility of replacing existing high-intensity-discharge roadway lighting with LED lighting luminaires. The goals of this research include a technical analysis on the ability of LED luminaires to meet the minimum performance standards set by the Illumination Engineering Society and an economic analysis to compare the life cycle costs of replacing existing lighting luminaires with LED luminaires.

## **2. Evaluation of led roadway luminaires**

## **2.1. LED luminaire data collection methodology**

Illumination readings were collected from LED luminaire testing sites throughout the state of Missouri. The luminaires studied are currently used on roadways throughout Missouri. These readings were collected for LEDs produced by several manufacturers at three HPS equivalent power ratings: 150, 250, and 400 W. A total of eight unique manufacturer's LED luminaires were studied in this research.

Data collection points are based on a function of the pole spacing between luminaires and the width of the traffic lane at the location of the luminaire. Using intervals of one quarter of the distance between the target pole and adjacent poles minimizes interference caused by nearby streetlights. The pole spacing, roadway width, the distance between the pole and the outer lane, and the location of the luminaire were measured, in feet, for each luminaire using a perambulator. In order to minimize the impact of nearby sources of light, illuminance readings were collected such that the readings were directed toward the target luminaire. An illuminance meter was used to measure the lux for each field data location. The illuminance meter is greatly impacted by the direction in which the eyelet of the device points. Therefore, in order to minimize error, the maximum reading was recorded for each data point. Data was collected in an interval based on illumination pole spacing, or the distance between two luminaires. Data was collected in intervals of ¼ of the pole spacing. Pole spacing varied between data collection sites. Perpendicular data collection intervals along the road were collected in intervals equal to one lane of traffic, which in most cases was approximately 12 feet (3.66 m).

Key characteristics for the field study are repeated from the final report of the funded MoDOT project [15]. For each luminaire, 31 readings, including 15 readings at the ground level and 15 readings elevated 18 inches above the ground level, were collected. In addition, field data includes one ambient reading collected from a nonilluminated area. Ambient illuminance readings were collected approximately 20 feet behind the luminaire in order to be outside of the illuminated area. In order to determine the role of naturally occurring light sources, ambient readings were collected. This includes ambient lighting from nearby outdoor area lighting. To calculate adjusted field readings, ambient readings were subtracted from the field readings. This was then compared to the .ies file data for each studied luminaire. **Figure 1**

**Figure 1.** LED field testing methodology.

assessing technical and economic feasibility of LED roadway luminaires for those considering

Previously, research has been completed on LED luminaires in the field case studies sponsored by the Department of Energy's Energy Efficiency and Renewable Energy (EERE) program [2]. Research has also shown the life cycle costs of LEDs may be higher than high-pressure sodium luminaires for both collector and local roads [3, 4]. The Illumination Engineering Society (IES) of North America standards for roadway illumination vary depending on three factors: classification of roadway, pedestrian conflict potential, and pavement classification. For example, if a roadway was classified as a moderately traveled major route with a low pedestrian conflict potential and an R3 pavement classification, then the minimum maintained average illumination would be 9.0 lux [5]. If the same roadway's pedestrian conflict potential was reclassified to be a high potential of pedestrian conflict, then the minimum maintained average illuminance increases to 17.0 lux. The Illinois Center for Transportation produced a report providing background information on LED luminaires [6]. This report covers optics of LED lighting, advantages and drawbacks of LED lighting, and summarizes work previously completed by

Manufacturers, government agencies, and utilities are collaborating to produce effective LED roadway luminaires. One such effort is the Department of Energy's Solid-State lighting GATEWAY Demonstration programs, which investigate real-world application of solid-state lighting technologies in various fields, such as roadway illumination, sidewalk illumination, and parking lot illumination. These programs have performed feasibility analyses on several types of LED luminaires across several uses. Thus far, the program has published reports on the use of LED lighting in parking lot [7] and minor roadway lighting [8]. Research has previously been performed on combining an economic analysis with a product performance analysis to develop street lighting standards [9–12]. In addition, the benefits of LEDs have been investigated in a previous research, such as reduced concern of the power factor of electricity loads [13]. Another potential benefit is the ability to rapidly start luminaires without harmful impact on the luminaire's lifetime, which can allow for "smarter" usage of lighting systems [14].

The Missouri Department of Transportation commissioned research into the feasibility of replacing existing high-intensity-discharge roadway lighting with LED lighting luminaires. The goals of this research include a technical analysis on the ability of LED luminaires to meet the minimum performance standards set by the Illumination Engineering Society and an economic analysis to compare the life cycle costs of replacing existing lighting luminaires with LED luminaires.

Illumination readings were collected from LED luminaire testing sites throughout the state of Missouri. The luminaires studied are currently used on roadways throughout Missouri. These readings were collected for LEDs produced by several manufacturers at three HPS equivalent power ratings: 150, 250, and 400 W. A total of eight unique manufacturer's LED

the product for widespread implementation.

202 Advances in Statistical Methodologies and Their Application to Real Problems

several GATEWAY demonstrations.

**2. Evaluation of led roadway luminaires**

**2.1. LED luminaire data collection methodology**

luminaires were studied in this research.


**Table 1.** LED field data and manufacturer claims.

indicates the locations used for data collection points as well as the direction of the illuminance meter.

Once the field data collection phase ended, the manufacturer's .ies file for each luminaire was compared with field results to validate the manufacturer's claims. The variation between the field data and each manufacturer's claim was analyzed and is shown in figures within the field data evaluation and assessment section. Standards were created by the Illumination Engineering Society and are set in RP-08 [5]. These IES standards set a minimum of 13.0 lux for moderately busy, medium pedestrian conflict roads with R3 pavement classification. The desired average:minimum uniformity ratio for such a road is 3.0. Using the previously mentioned methodology, field data was collected for eight different LED luminaires across three HPS equivalent power ratings (150, 250, and 400 W). Five 150, two 250, and one 400 W equivalent luminaires were studied in this research. All field data collected was then compared to the IES standards provided by the manufacturer and the average to minimum uniformity ratios were calculated in accordance to the IES RP-08 publication [5]. All tested luminaires were installed with 9 months of data collection, therefore light loss factors were not applied to collected illuminance values.

Four out of the eight luminaires met the minimum average illuminance criteria of 13.0 lux at 30 foot mounting heights for both field readings and manufacturer's claims. The field readings for LED B fulfilled the criteria for minimum average illuminance and average to minimum ratio, however, the manufacturer's provided .ies file did not meet this claim, therefore LED B is deemed as technically infeasible. LEDs A, C, and D were deemed technically infeasible due to a combination of field readings and/or manufacturer's claims. LEDs F, G, and H were deemed technically infeasible due to their average to minimum ratios exceeding the 3.0 recommendation set by IES. The technical feasibility of HPS luminaires was not evaluated due to their current use in the transportation lighting field. LED fixtures were tested at mounting heights of 30 foot or less, since a significant portion of luminaires in Missouri is installed at these heights (**Table 1**).

## **3. Economic feasibility analysis**

**LED A**

**Standard**

 **Field** 

**IES** 

**Field** 

**IES** 

**Field** 

**IES** 

**Field** 

**IES** 

**Field** 

**IES** 

**Field** 

**IES** 

**Field** 

**IES** 

**Field** 

**IES** 

**data**

12.8

 12.1

32.7

 25.3

 33.5

 49.0

 8.9

 9.4

 30.5

 30.0

 38.6

44.6

39.0

 43.4

 35.1

41.4

Max (lux) Min (lux) Avg (lux)

Avg/min <3

1.7

**Table 1.** LED field data and manufacturer claims.

 2.2

 2.5

 5.0

2.9

 3.8

2.8

 1.8

2.3

2.4

1.9

 3.3

 8.9

6.4

 4.0

 8.4

204 Advances in Statistical Methodologies and Their Application to Real Problems

 >13

7.2

 7.2

 20.1

 12.0

 11.6

9.4

5.6

 4.2

 16.5

 14.8

 18.8

 14.7

 18.0

 16.0

 17.5

 17.7

4.2

3.2

8.0

 2.4

 4.0

 2.5

 2.0

 2.4

 7.1

 6.1

9.8

4.4

 2.0

 2.5

 4.4

 2.1

**data**

**data**

**data**

**data**

**data**

**data**

**data**

**data**

**data**

**data**

**data**

**data**

**data**

**data**

**data**

**LED B**

**LED C**

**LED D**

**LED E**

**LED F**

**LED G**

**LED H**

In order to conduct a thorough economic feasibility analysis of LED luminaires, several factors must be considered. These factors were originally reported in the MoDOT report and are repeated here [15]. Equivalencies are determined by grouping luminaires for comparison with the most appropriate high-pressure sodium luminaire. Manufacturers have worked to produce LED luminaires that are specifically designed as equivalent replacements for traditional high-intensity-discharge (HID) lamps. This allows transportation organizations the option of directly replacing traditional luminaires with LED luminaires, but other factors must be considered as well.

Second, the fiscal feasibility of LED luminaires rely heavily on the assumptions made pertaining to lifetime, labor hour cost, overhead, equipment costs, repair costs, discounts for ordering in large quantities, and electricity efficiency. The assumptions in this economic analysis include: replacing HPS luminaires after 3 years, LED luminaires remain in operation for 12 years, labor cost for relamping or retrofitting luminaires is \$60, and the costs for replacing high-pressure sodium lamps for 150, 250, and 400 W lamps are \$100, \$130, and \$160, respectively [15].

The economic analysis assumes high-pressure sodium luminaires are replaced every 3 years. This assumption can easily change to reflect a transportation agency's views of scheduling HPS replacements. The assumption of 3 years accounts for the reduction in luminaire lifetime due to vibration and shock, which is prevalent along bridges and overpasses, and spot replacement of HPS luminaires. In contrast, some transportation agencies wait until the HPS lamp fails catastrophically, which maximizes the lifetime of each luminaire.

Another key assumption is LED luminaires will remain in operation for a 12-year life expectancy. Many manufacturers claim the life of their luminaire will operate beyond 50,000 hours (approximately 12 years with an annual usage of approximately 4000 hours), however, the most common claim is a 12-year lifetime, and 12 years is a conservative lifetime overall for LED luminaires. Therefore, 12 years was used for the LED luminaire lifetime for the economic analysis.

Perspective on labor costs significantly affects the outcome of the economic analysis. Organizations that do not consider maintenance savings as a large factor to their organization will not likely find LED luminaires beneficial. For example, City Utilities in Springfield, MO, replaces traditional street lighting technology on the downtime of their line workers. City Utility policy states that there must be line workers on duty 24 hours per day, 7 days per week in order to respond to outages and emergencies. Therefore, when City Utilities economically analyzed LED luminaires, the results did not favor LED luminaires because the avoided maintenance costs were not included in economic analysis. It is essential for each agency to consider their perspective on replacing or repairing luminaires when performing an economic analysis.

Labor cost to retrofit or relamp a light pole with an LED or an HPS luminaire was assumed to be \$60 per luminaire. With lighting labor costs around \$25–\$35 per hour, the labor cost was averaged and doubled to \$60 in order to account for overhead, equipment cost, setup, and travel time to estimate a conservative labor cost.

The costs for replacing high-pressure sodium luminaires vary by the wattage of the lamp being replaced. For the lowest wattage bulb, a \$100 cost is used which is based on related LED luminaire analyses. The costs of 250 and 400 W bulbs were estimated to be \$130 and \$160, respectively. The costs are based on the cost of the lamp being replaced, the cost of labor repairing the lamp's ballast, and the cost of vehicles and equipment to travel to and reach the luminaire.

As previously mentioned, costs may be reduced once roadway lighting demand shifts its focus solely toward LEDs. Economies of scale will then be realized, which similarly occurred in LED traffic signal indicators, and prices of LED luminaires will decrease significantly.

#### **3.1. Life cycle analysis**

To determine economic feasibility of LEDs, all costs to install, operate, and dispose of the luminaire are included in the analysis. The installation and disposal costs are accounted for in the retrofitting and relamping labor cost. In addition, the cost of powering the luminaire was calculated based on a sample of actual energy consumption. The actual energy consumption was then extrapolated to other luminaires based on relative wattages between the luminaires which energy consumption was known and other luminaires. Energy consumption for HPS luminaires was calculated using system wattages.

In order to make a fair comparison between HPS luminaires with assumed lifetimes of 3 years and LED luminaires with expected lifetimes of 12 years, the total cost to install and operate a luminaire was annualized. This allows for a fair economic comparison between products with varying lifetimes. An expected project return of 3% was used to annualize costs.

Using information from **Table 2**–**4**, the annualized costs of LED luminaires is equivalent to or approaching equivalency to HPS lamps. This evaluation of the luminaires was based on pricing for small purchase orders, except for the manufacturer of LED E, which quoted a discounted price for orders of 1000 or more luminaires.

## **3.2. Replacement period analysis**

replacing HPS luminaires after 3 years, LED luminaires remain in operation for 12 years, labor cost for relamping or retrofitting luminaires is \$60, and the costs for replacing high-pressure

The economic analysis assumes high-pressure sodium luminaires are replaced every 3 years. This assumption can easily change to reflect a transportation agency's views of scheduling HPS replacements. The assumption of 3 years accounts for the reduction in luminaire lifetime due to vibration and shock, which is prevalent along bridges and overpasses, and spot replacement of HPS luminaires. In contrast, some transportation agencies wait until the HPS

Another key assumption is LED luminaires will remain in operation for a 12-year life expectancy. Many manufacturers claim the life of their luminaire will operate beyond 50,000 hours (approximately 12 years with an annual usage of approximately 4000 hours), however, the most common claim is a 12-year lifetime, and 12 years is a conservative lifetime overall for LED luminaires. Therefore, 12 years was used for the LED luminaire lifetime for the economic analysis. Perspective on labor costs significantly affects the outcome of the economic analysis. Organizations that do not consider maintenance savings as a large factor to their organization will not likely find LED luminaires beneficial. For example, City Utilities in Springfield, MO, replaces traditional street lighting technology on the downtime of their line workers. City Utility policy states that there must be line workers on duty 24 hours per day, 7 days per week in order to respond to outages and emergencies. Therefore, when City Utilities economically analyzed LED luminaires, the results did not favor LED luminaires because the avoided maintenance costs were not included in economic analysis. It is essential for each agency to consider their perspective on replacing or repairing luminaires when performing

Labor cost to retrofit or relamp a light pole with an LED or an HPS luminaire was assumed to be \$60 per luminaire. With lighting labor costs around \$25–\$35 per hour, the labor cost was averaged and doubled to \$60 in order to account for overhead, equipment cost, setup, and

The costs for replacing high-pressure sodium luminaires vary by the wattage of the lamp being replaced. For the lowest wattage bulb, a \$100 cost is used which is based on related LED luminaire analyses. The costs of 250 and 400 W bulbs were estimated to be \$130 and \$160, respectively. The costs are based on the cost of the lamp being replaced, the cost of labor repairing the lamp's ballast, and the cost of vehicles and equipment to travel to and reach the luminaire. As previously mentioned, costs may be reduced once roadway lighting demand shifts its focus solely toward LEDs. Economies of scale will then be realized, which similarly occurred in LED traffic signal indicators, and prices of LED luminaires will decrease significantly.

To determine economic feasibility of LEDs, all costs to install, operate, and dispose of the luminaire are included in the analysis. The installation and disposal costs are accounted for in

sodium lamps for 150, 250, and 400 W lamps are \$100, \$130, and \$160, respectively [15].

lamp fails catastrophically, which maximizes the lifetime of each luminaire.

206 Advances in Statistical Methodologies and Their Application to Real Problems

an economic analysis.

**3.1. Life cycle analysis**

travel time to estimate a conservative labor cost.

A potential methodology to level the roadway lighting expenditures while transitioning from HPS luminaires to LED luminaires would be to slowly phase in LED luminaires. By transitioning to LEDs at a rate of the inverse of the expected lifetime of LED luminaires, the annual investment in LEDs is uniform. For example, if LEDs are rated to last for 12 years of use, then 1/12 of lamps should be replaced with LEDs every year. This allows for approximately constant replacement of LED luminaires once the transition from HPS is completed because the failure rate of the LED luminaires will be evenly distributed throughout 12 years.

It is recommended to replace the LED luminaires in large, continuous sections. This will allow for more consistency in overhead street lighting for long sections of road. This will prevent the need to change between the high-pressure sodium and LED luminaires.


**Table 2.** Economic analysis of 150 W equivalent luminaires [15].


**Table 3.** Economic analysis of 250 W equivalent luminaires [15].


**Table 4.** Economic analysis of 400 W equivalent luminaires [15].

## **3.3. Sensitivity analysis**

**Figures 2** and **3** demonstrate the sensitivity of one LED luminaire and one HPS luminaire's annualized cost to changes of four variables: luminaire price, expected luminaire lifetime, relamping, retrofit labor cost, and annual electricity consumption. Each variable varies between 75 and 125% of the original value, in 12.5% intervals. The sensitivity analysis determined the variables with the greatest impact on the annualized cost of LED luminaires. In addition, an incremental economic analysis was performed. The results of the incremental analysis are displayed in **Table 5**. This analysis used the same values as the sensitivity analysis but calculated the change in annual worth per 1% change in each variable. Due to the nonlinearity of the expected lifetime variable, the incremental analysis results of this variable were averaged.

#### **3.4. Sensitivity analysis results**

The results of the sensitivity analyses in **Figures 2** and **3** contrast the differences between HPS and LED luminaires as costs change. LED luminaires are significantly less sensitive to

**Figure 2.** 150 W HPS sensitivity analysis.

**Figure 3.** LED a sensitivity analysis.

**3.3. Sensitivity analysis**

**Table 4.** Economic analysis of 400 W equivalent luminaires [15].

**Table 3.** Economic analysis of 250 W equivalent luminaires [15].

208 Advances in Statistical Methodologies and Their Application to Real Problems

**Life cycle analysis (400 W equivalents)**

**Life cycle analysis (250 W equivalents)**

**Luminaire 400W HPS LED H** Price \$160.00 \$800.00 Expected lifetime (years) 3 12 Project rate of return 3% 3% Pole installation costs 0 0 Relamping/Retrofit labor costs \$60.00 \$60.00 Initial cost per life cycle \$220.00 \$860.00 Annual electricity consumption \$78.08 \$66.72 **Annualized cost \$155.86 \$153.12**

**Luminaire 250W HPS LEF F LED G** Price \$130.00 \$700.00 \$712.00 Expected lifetime (years) 3 12 12 Project rate of return 3% 3% 3% Pole installation costs 0 0 0 Relamping/retrofit labor costs \$60.00 \$60.00 \$60.00 Initial cost per life cycle \$190.00 \$760.00 \$772.00 Annual electricity consumption \$48.80 \$40.26 \$44.48 **Annualized cost \$115.97 \$116.61 \$122.04**

**3.4. Sensitivity analysis results**

**Figures 2** and **3** demonstrate the sensitivity of one LED luminaire and one HPS luminaire's annualized cost to changes of four variables: luminaire price, expected luminaire lifetime, relamping, retrofit labor cost, and annual electricity consumption. Each variable varies between 75 and 125% of the original value, in 12.5% intervals. The sensitivity analysis determined the variables with the greatest impact on the annualized cost of LED luminaires. In addition, an incremental economic analysis was performed. The results of the incremental analysis are displayed in **Table 5**. This analysis used the same values as the sensitivity analysis but calculated the change in annual worth per 1% change in each variable. Due to the nonlinearity of the expected lifetime variable, the incremental analysis results of this variable were averaged.

The results of the sensitivity analyses in **Figures 2** and **3** contrast the differences between HPS and LED luminaires as costs change. LED luminaires are significantly less sensitive to changes in retrofitting costs, which consist mostly of labor costs. However, LED luminaires are significantly more sensitive to changes in the expected lifetime of the luminaire. Changes in the price of the luminaires linearly impact the annualized cost of the respective luminaire. Changes in each luminaire's expected lifetime result in an inverse exponential change in the annualized cost of the luminaire. Thus, the greater the deviation of the actual lifetime from the expected lifetime, the exponentially greater impact the life of the luminaire has on the annualized cost of the luminaire. Therefore, it is imperative for estimates of an LED luminaire's expected lifetime to be accurate.

The results of the economic sensitivity analysis show the change in annualized cost per 1% change in a variable value. For example, if the price of LED A decreased by 10%, the annualized cost decreases by \$7.00. Due to the nonlinearity of the expected lifetime variable, the incremental sensitivity analysis was linearly approximated in order to compare results across


**Table 5.** Incremental sensitivity analysis of HPS and LED luminaires.

all variables. The results of the incremental analysis provide a starting point for effective estimation of annualized costs to account for changes in variable values.

#### **3.5. Energy consumption and environmental impact analysis**

Energy consumption data was obtained on a studied luminaire (LED A) at two separate intersections. Both intersections were located in St. Louis, MO. Energy consumption data was normalized to account for days in each month, hours of operation in each month, and the number of luminaires operated at each intersection. Energy consumption data was separated by month and analyzed. **Figure 4** depicts the energy consumption in Watts per luminaire per month.

**Figure 4** shows the increase in electricity consumption between October and December, which endures through the month of February. The increase in consumption at this time period averages to 32%. This increase is independent of the duration which the lights operate. The approved product list process section suggests studying this effect further on more luminaires by assessing each luminaire during both summer and winter seasons.

The sharp decrease in March in consumption at the intersection of Route 30 and Main Drive is due to a traffic crash that removed the pole for a period of time. With no replacement LED in stock, one had to be ordered.

Energy consumption was also measured to determine the energy savings of LED luminaires. Our analysis shows an actual energy savings of 11%, which is for 150 W equivalent luminaires. Information was unable to be obtained for equivalent LED power consumption data for 250 W or 400 watt HPS luminaires.

**Figure 4.** Electricity consumption per luminaire by month.

For a 150 W HPS lamp, with a system rating of 183 watts, the equivalent energy savings is 80.5 kWh per year. According to an EPA study from 2000, the average electrical generation portfolio releases 1.341 lbs (0.608 kg) of CO<sup>2</sup> into the atmosphere per kWh of electricity consumed [16]. Therefore, replacing one 150 Watt HPS lamp with the LED A luminaire avoids the release of approximately 108 lbs of CO<sup>2</sup> into the atmosphere.

## **4. Conclusions and future work**

all variables. The results of the incremental analysis provide a starting point for effective esti-

Energy consumption data was obtained on a studied luminaire (LED A) at two separate intersections. Both intersections were located in St. Louis, MO. Energy consumption data was normalized to account for days in each month, hours of operation in each month, and the number of luminaires operated at each intersection. Energy consumption data was separated by month and analyzed. **Figure 4** depicts the energy consumption in Watts per luminaire per

**Figure 4** shows the increase in electricity consumption between October and December, which endures through the month of February. The increase in consumption at this time period averages to 32%. This increase is independent of the duration which the lights operate. The approved product list process section suggests studying this effect further on more

The sharp decrease in March in consumption at the intersection of Route 30 and Main Drive is due to a traffic crash that removed the pole for a period of time. With no replacement LED

Energy consumption was also measured to determine the energy savings of LED luminaires. Our analysis shows an actual energy savings of 11%, which is for 150 W equivalent luminaires. Information was unable to be obtained for equivalent LED power consumption data

luminaires by assessing each luminaire during both summer and winter seasons.

mation of annualized costs to account for changes in variable values.

**3.5. Energy consumption and environmental impact analysis**

**Table 5.** Incremental sensitivity analysis of HPS and LED luminaires.

month.

in stock, one had to be ordered.

**Incremental economic sensitivity analysis**

**Luminaire Price Expected lifetime** 

210 Advances in Statistical Methodologies and Their Application to Real Problems

**(years)**

150W HPS \$0.35 (\$0.56) \$0.21 \$0.29 LED A \$0.70 (\$0.66) \$0.06 \$0.27 LED B \$0.70 (\$0.66) \$0.06 \$0.29 LED C \$0.74 (\$0.69) \$0.06 \$0.29 LED D \$0.70 (\$0.66) \$0.06 \$0.26 LED E \$0.59 (\$0.57) \$0.06 \$0.32 250W HPS \$0.46 (\$0.67) \$0.21 \$0.49 LED F \$0.70 (\$0.66) \$0.06 \$0.40 LED G \$0.72 (\$0.67) \$0.06 \$0.44 400W HPS \$0.57 (\$0.77) \$0.21 \$0.78 LED H \$0.80 (\$0.75) \$0.06 \$0.67

**Relamping/Retrofit labor costs**

**Annual electricity consumption**

for 250 W or 400 watt HPS luminaires.

Performance and cost are major issues when considering a change in technologies such as the transition to the use of LED roadway luminaires. Performance was a major issue in early development of LED roadway luminaires. Most manufacturers invested in product development to ensure that LED roadway luminaires performed at similar or higher performance levels as the HPS roadway luminaires. These initial investments were focused at 30-foot mounting height luminaires and have in the recent past moved toward mounting heights of 40 feet (12.2 m) or higher.

Performance of the LED roadway luminaire, when compared to the current preferred HPS roadway luminaire, has seen improvements over the past few years. Impacted parties (like manufacturers, public agencies, utilities, etc.) have joined together with the intent of producing an equivalent LED roadway luminaire that can be used. Manufacturers have invested in producing new generations of LED roadway luminaires that continue to close the gap between the HPS and LED roadway luminaire. Local agencies and utilities continue to evaluate and report findings on these new generations. Their performance improvements have led some agencies like the City of Los Angeles to make major investments in the transition to LED roadway luminaires.

Results from this research reveal that LED luminaires are less sensitive to changes in retrofitting costs (consisting mostly of labor costs). However, LED luminaires are more sensitive to changes in the expected lifetime of the luminaire. Changes in the price of the luminaires linearly impact the annualized cost of the respective luminaire and changes in each luminaire's expected lifetime result in an inverse exponential change in the annualized cost of the luminaire. Based on these findings, it is essential that life cycle costs for the lifetime of LED luminaires be as accurate as possible. Moreover, the economic sensitivity analysis reveals that incremental analysis provides an effective mechanism for estimating annualized cost.

Energy and environmental analysis shows promising results as well. For a 150-W HPS lamp, with a system rating of 183 W, the equivalent energy savings is 80.5 kWh per year. According to an EPA study from 2000, the average electrical generation portfolio releases 1.341 lbs (0.608 kg) of CO<sup>2</sup> into the atmosphere per kWh of electricity consumed. Therefore, replacing one 150 W HPS lamp with the LED A luminaire avoids the release of approximately 108 lbs of CO<sup>2</sup> into the atmosphere.

Based on our analysis, LED luminaires are a promising technology for replacement of highpressure sodium lamps. As the technology matures, more robust analysis will confirm the efficacy of the approach.

## **Acknowledgements**

This project was partially funded through the Missouri Department of Transportation (TRyy1101) and the data used in this case was originally published in the corresponding final report. The authors would like to thank Tom Ryan, Dr. A. Curt Elmore, and Dr. Ruwen Qin for their input and guidance throughout the research project. We also want to thank Julie Stotlemeyer and Jen Harper, MoDOT, for their valuable assistance.

## **Author details**

Sean Schmidt and Suzanna Long\*

\*Address all correspondence to: longsuz@mst.edu

Department of Engineering Management and Systems Engineering, Missouri University of Science and Technology, Rolla, MO, USA

## **References**


[3] Radetsky LC. *Specifier Reports: Streetlights for Collector Roads*. Troy, Rensselaer Polytechnic Institute NY, 2010.

to changes in the expected lifetime of the luminaire. Changes in the price of the luminaires linearly impact the annualized cost of the respective luminaire and changes in each luminaire's expected lifetime result in an inverse exponential change in the annualized cost of the luminaire. Based on these findings, it is essential that life cycle costs for the lifetime of LED luminaires be as accurate as possible. Moreover, the economic sensitivity analysis reveals that

Energy and environmental analysis shows promising results as well. For a 150-W HPS lamp, with a system rating of 183 W, the equivalent energy savings is 80.5 kWh per year. According to an EPA study from 2000, the average electrical generation portfolio releases 1.341 lbs (0.608 kg)

Based on our analysis, LED luminaires are a promising technology for replacement of highpressure sodium lamps. As the technology matures, more robust analysis will confirm the

This project was partially funded through the Missouri Department of Transportation (TRyy1101) and the data used in this case was originally published in the corresponding final report. The authors would like to thank Tom Ryan, Dr. A. Curt Elmore, and Dr. Ruwen Qin for their input and guidance throughout the research project. We also want to thank Julie

Department of Engineering Management and Systems Engineering, Missouri University of

[1] *LED Street Lighting Efficiency Program*. Los Angeles: City of Los Angeles – Bureau of

[2] Pacific Northwest National Laboratory. *Demonstration Assessment of Light-Emitting Diode* 

*(LED) Roadway Lighting*. Washington D.C.: U.S. Department of Energy, 2009.

Street Lighting, 2011. http://bsl.lacity.org. Accessed December 2011.

Stotlemeyer and Jen Harper, MoDOT, for their valuable assistance.

into the atmosphere per kWh of electricity consumed. Therefore, replacing one 150 W

into the

incremental analysis provides an effective mechanism for estimating annualized cost.

212 Advances in Statistical Methodologies and Their Application to Real Problems

HPS lamp with the LED A luminaire avoids the release of approximately 108 lbs of CO<sup>2</sup>

of CO<sup>2</sup>

atmosphere.

efficacy of the approach.

**Acknowledgements**

**Author details**

**References**

Sean Schmidt and Suzanna Long\*

\*Address all correspondence to: longsuz@mst.edu

Science and Technology, Rolla, MO, USA


**Applications of Data Analysis in Medicine**

**Provisional chapter**

## **Validation of Instrument Measuring Continuous Variable in Medicine Variable in Medicine**

**Validation of Instrument Measuring Continuous** 

Rafdzah Zaki Rafdzah Zaki Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66151

#### **Abstract**

In medicine, accurate measurement of clinical values is vital, either at the stage of health screening, diagnosing cases or making prognosis. There are numerous instruments or machines that have been invented for the purpose of measuring various clinical variables such as blood pressure, glucose level, body temperature and oxygen level. When a new method of measurement or instrument is invented, the quality of the instrument has to be assessed. This chapter will focus on the application of statistical methods used to analyse continuous data in a method comparison study or validation study in medicine. The concept of validity and analysis in method comparison study will be discussed. This chapter also reviews the theoretical aspects of several common methods and approaches that have been used to measure agreement and reliability including the Bland-Altman limits of agreement (LoA) and intra-class correlation coefficient (ICC). Issues related to method comparison studies will be highlighted, which include the evaluation of agreement and reliability in a single study, the application of multiple statistical methods and the use of inappropriate methods in testing agreement and reliability. Finally, the importance of education in method comparison studies among medical professional will be emphasized.

**Keywords:** agreement, reliability, medical instrument, continuous variable

## **1. Introduction**

In medicine, accurate measurement of clinical values is vital, either at the stage of health screening, diagnosing cases or making prognosis. For example, accurate measurement of blood pressure, heart rate and oxygen level is crucial for monitoring patients under general anaesthesia in surgery. Inaccurate measurement of these variables will result in inappropriate management of the patient, thus putting the patient's life at risk.

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Most of the important variables measured in medicine are in numerical forms or continuous in nature, such as blood pressure, glucose level, oxygen level, weight, height, body temperature, creatinine level, albumin level and many other clinical values. There are numerous instruments or machines that have been invented for the purpose of measuring various variables. Some measurements are obtained by using invasive techniques and expensive procedures. Consequently, new instruments and tests are constantly being developed and fashioned to provide complementary meaningful information to the search for information, with the aim of providing cheaper, non-invasive, more convenient and safe methods. Whether a test's outcome can provide trustworthy judgements or decisions depend particularly on the measurement quality of the test [1].

When a new method of measurement or instrument is invented, the quality of the instrument has to be assessed. We want to know by how much the value of measurements obtained using new method differs from the old method, or from the gold standard. Information provided by any clinical instrument cannot be trusted and licitly used in any judgement and decisionmaking process if the measurement quality has not been evaluated. This is where a method comparison study or a validation study comes into medicine. Clinimetric properties indicating that the test is reliable and valid should be considered as fundamental for determining the measurement quality of any test [2]. In general, clinimetric refers to the development of methodological and statistical methods applicable in clinical medicine in order to assign numbers or scores to observable clinical events [3, 4].

## **1.1. Validity**

An instrument is considered to be valid if it measures what it is intended to measure [3]. The term 'validity' actually has a wide range of classification and definition. In clinical research, current and accepted validity concepts include *criterion* validity, *construct* validity and *content* validity, the first two being the most relevant for performance-based tests [5].

Criterion validity is used to examine the extent to which a measurement instrument provides the same results as the gold standard [5]. This type of validity is the most powerful in terms of its usefulness, and is divided into two types: *concurrent* validity and *predictive* validity. Of these, *concurrent validity* is the most used method. This is when we are trying to compare a new measurement tool with the criterion measure, both of which are given at the same time [5, 6]. The new tool is usually simpler, cheaper or less invasive compared to the standard or currently used tools. In contrast, in *predictive validity*, the criterion will not be available until sometime in the future. When no gold standard is available, the common alternative is to use an accepted and well-grounded reference test to relate to the evaluated test [7, 8]. Generally, this form of validity is used in developing instruments that allow us to get earlier answers or to give earlier predictions than current instruments can provide [5].

Construct validity refers to the degree to which a test measures a hypothetical, non-observable construct and this validity can be established by relating the test to outcomes of other instruments [1, 5]. It is used when we dealing with more abstract variables or factors that cannot be measured directly for example level of anxiety and pain [5]. We cannot see or directly measure anxiety, but we can observe other factors related to anxiety (according to theory) such as sweaty palm and tachycardia. The proposed underlying factors are referred to as *hypothetical construct*

or simply known as *constructs* [5]. So, *construct validity* is the next best option in the absence of an acceptable gold standard. The measurement of instrument under study will be compared with other instruments that claim to measure the same construct [5].

Content validity is a closely related concept, consisting of a judgement whether the instrument samples all the relevant or important content or domains [5, 9]. Content validity can be claimed when a test logically and obviously measures what it purposes to measure [5, 6]. The relationship between the phenomenon being measured and the test score(s) is determined by a panel of experts or researchers [6].

## **1.2. Reproducibility**

Most of the important variables measured in medicine are in numerical forms or continuous in nature, such as blood pressure, glucose level, oxygen level, weight, height, body temperature, creatinine level, albumin level and many other clinical values. There are numerous instruments or machines that have been invented for the purpose of measuring various variables. Some measurements are obtained by using invasive techniques and expensive procedures. Consequently, new instruments and tests are constantly being developed and fashioned to provide complementary meaningful information to the search for information, with the aim of providing cheaper, non-invasive, more convenient and safe methods. Whether a test's outcome can provide trustworthy judgements or decisions depend particularly on the measurement quality of the test [1]. When a new method of measurement or instrument is invented, the quality of the instrument has to be assessed. We want to know by how much the value of measurements obtained using new method differs from the old method, or from the gold standard. Information provided by any clinical instrument cannot be trusted and licitly used in any judgement and decisionmaking process if the measurement quality has not been evaluated. This is where a method comparison study or a validation study comes into medicine. Clinimetric properties indicating that the test is reliable and valid should be considered as fundamental for determining the measurement quality of any test [2]. In general, clinimetric refers to the development of methodological and statistical methods applicable in clinical medicine in order to assign numbers

An instrument is considered to be valid if it measures what it is intended to measure [3]. The term 'validity' actually has a wide range of classification and definition. In clinical research, current and accepted validity concepts include *criterion* validity, *construct* validity and *content*

Criterion validity is used to examine the extent to which a measurement instrument provides the same results as the gold standard [5]. This type of validity is the most powerful in terms of its usefulness, and is divided into two types: *concurrent* validity and *predictive* validity. Of these, *concurrent validity* is the most used method. This is when we are trying to compare a new measurement tool with the criterion measure, both of which are given at the same time [5, 6]. The new tool is usually simpler, cheaper or less invasive compared to the standard or currently used tools. In contrast, in *predictive validity*, the criterion will not be available until sometime in the future. When no gold standard is available, the common alternative is to use an accepted and well-grounded reference test to relate to the evaluated test [7, 8]. Generally, this form of validity is used in developing instruments that allow us to get earlier answers or to give earlier predictions

Construct validity refers to the degree to which a test measures a hypothetical, non-observable construct and this validity can be established by relating the test to outcomes of other instruments [1, 5]. It is used when we dealing with more abstract variables or factors that cannot be measured directly for example level of anxiety and pain [5]. We cannot see or directly measure anxiety, but we can observe other factors related to anxiety (according to theory) such as sweaty palm and tachycardia. The proposed underlying factors are referred to as *hypothetical construct*

validity, the first two being the most relevant for performance-based tests [5].

or scores to observable clinical events [3, 4].

218 Advances in Statistical Methodologies and Their Application to Real Problems

than current instruments can provide [5].

**1.1. Validity**

Another approach in assessing the quality of measuring instrument is to assess the *reproducibility* of the instrument. This is when we are interested to know whether the new instrument is able to produce similar values as that predicted by the old or standard instrument. In the literature, terms reproducibility is often used interchangeably with the reliability, repeatability, consistency, agreement and stability [9]. Recently, de Vet et al. [3, 10] advocated that reproducibility is the proper term to use in clinical research, making the distinction between two aspects that are important for clinical interpretation: reliability and agreement.

## *1.2.1. Agreement*

Agreement assesses how close the results of repeated measurements are to the 'true value' or the criterion value [10]. So, agreement actually concerns accuracy or validity; more specifically, concurrent validity. An instrument with good agreement will be able to produce accurate repeated measurements in the same person [10]. Thus, agreement parameters are important in instruments that are used for evaluative purposes. In evaluative measurement instruments, the variability between individuals in a population is not important, in comparison to the variability within an individual [10]. This is because, in some clinical settings, we want to detect differences or changes within the same individual, and not how much difference is the individual's value compared to another person's, or with the population. For example, in antenatal clinics we are interested in the weight gain of a mother throughout her pregnancy, and not how much her weight differs from the others'.

Agreement parameters estimate the *measurement error* in repeated measurements. When the measurement error is large, small changes cannot be distinguished from the measurement error [10]. The smaller the measurement error, the smaller the changes that can be detected beyond the measurement error, and the more appropriate the instrument is for evaluative purposes. Thus, for an instrument to be used to evaluate changes over time, such as changes in blood pressure after receiving antihypertensive therapy, it is important for us to ensure the agreement or the accuracy of the instrument.

## *1.2.2. Reliability*

Reliability measures the extent to which the test results can be replicated [11]. For example, if we measure body weight using a scale five times, ideally all five measurements should be the same. Reliability is concerned with precision. It also represents the extent to which individuals can be distinguished from each other, despite the variability of repeated measurements in one person or subject (i.e. measurement error) [10]. In contrast with agreement, reliability measures the variability between people or subjects. This measurement tells us how well the measured value in one person can be distinguished from another [10]. Thus, reliability parameters are important when measurement instruments are used for discriminative purposes; for example, to decide whether a certain value is normal or abnormal, and when the measurement from the instrument is involved in important decisions, such as whether treatment is required or not.

In clinical practice, the cut-off for normal and abnormal values is usually well established by clinical guidelines, which are produced based on extensive reviews of available evidence. Reliable instruments should be able to provide values that will allow doctors or clinicians to distinguish whether their patients are in the normal or abnormal group. For instance, if we take the blood pressure of one patient five times, all the values should be almost the same, and the values should give us an idea whether the patient's blood pressure is normal or not.

An acceptable range of reliability will vary depending on the circumstances [5]. For example, if repeated measurements of a weighing scale are found to vary around the 'true' weight by 0.5 kg, the reliability of this weighing scale would be acceptable if the measurements are only to be done on an adult population, but not reliable when used to weigh new-born babies in the hospital. This is because differences of 0.5 kg in weight in an adult represent only a very small percentage of an adult body weight, and will not affect him or her clinically. In contrast, a difference of 0.5 kg represents a large proportion of body weight for a new-born baby.

#### **1.3. Agreement versus reliability**

Although the terms 'agreement' and 'reliability' carry different meanings, they are sometime used interchangeably in medical literature. To illustrate the concept of agreement and reliability in more simple language, imagine if we have three target boards (see **Figure 1**) that show the results of five repeated measurements of body weight of the same person, using three different scales (A, B and C). The centre of each board indicates the true value. **Figure 1(A)** shows that after taking five measurements using scale A, the results of the measurements are scattered all over the target board. This suggests that the measurements are not near each other (poor reliability), and are not near their intended target or true value (poor agreement).

**Figure 1(B)** shows that all the five measurements from scale B appear in more or less the same location on the target board, but not in the centre of the target board. This suggests that five different measurements were almost the same (good reliability), but they did not hit the intended target (poor agreement). **Figure 1(C)** shows that all the five measurements from scale C are close to each other (good reliability), and hit the centre of the target board (good agreement).

In most clinical situations, we use the same instrument to evaluate changes over time and also to differentiate values from the normal or abnormal cut-off point (which is usually derived from population-based studies). One of the examples of this situation is in the screening of hypertension cases, and the assessment of the reduction of blood pressure post-treatment, in a clinic or health centre. Both blood pressure measurements are performed using the same blood pressure machine, or sphygmomanometer. So, agreement and reliability parameters are equally important in determining the quality of instruments. In fact, it is difficult to be certain about the agreement of an instrument if the instrument is not reliable. Similarly, a precise instrument or instrument with good reliability will not necessarily measure the 'true' value. Therefore, when comparing two instruments, or methods of measurement, we should consider assessing the *repeatability* of the instrument, which covers both agreement (accuracy) and reliability (precision).

**Figure 1.** Results of measurements of body weight using three different scales A, B and C.

## **2. Issues related to method comparison studies**

Reliability is concerned with precision. It also represents the extent to which individuals can be distinguished from each other, despite the variability of repeated measurements in one person or subject (i.e. measurement error) [10]. In contrast with agreement, reliability measures the variability between people or subjects. This measurement tells us how well the measured value in one person can be distinguished from another [10]. Thus, reliability parameters are important when measurement instruments are used for discriminative purposes; for example, to decide whether a certain value is normal or abnormal, and when the measurement from the instrument

In clinical practice, the cut-off for normal and abnormal values is usually well established by clinical guidelines, which are produced based on extensive reviews of available evidence. Reliable instruments should be able to provide values that will allow doctors or clinicians to distinguish whether their patients are in the normal or abnormal group. For instance, if we take the blood pressure of one patient five times, all the values should be almost the same, and the values should give us an idea whether the patient's blood pressure is normal or not. An acceptable range of reliability will vary depending on the circumstances [5]. For example, if repeated measurements of a weighing scale are found to vary around the 'true' weight by 0.5 kg, the reliability of this weighing scale would be acceptable if the measurements are only to be done on an adult population, but not reliable when used to weigh new-born babies in the hospital. This is because differences of 0.5 kg in weight in an adult represent only a very small percentage of an adult body weight, and will not affect him or her clinically. In contrast, a difference of 0.5 kg represents a large proportion of body weight for a new-born baby.

Although the terms 'agreement' and 'reliability' carry different meanings, they are sometime used interchangeably in medical literature. To illustrate the concept of agreement and reliability in more simple language, imagine if we have three target boards (see **Figure 1**) that show the results of five repeated measurements of body weight of the same person, using three different scales (A, B and C). The centre of each board indicates the true value. **Figure 1(A)** shows that after taking five measurements using scale A, the results of the measurements are scattered all over the target board. This suggests that the measurements are not near each other (poor reliability), and are not near their intended target or true value (poor agreement). **Figure 1(B)** shows that all the five measurements from scale B appear in more or less the same location on the target board, but not in the centre of the target board. This suggests that five different measurements were almost the same (good reliability), but they did not hit the intended target (poor agreement). **Figure 1(C)** shows that all the five measurements from scale C are close to each other (good reliability), and hit the centre of the target board (good agreement). In most clinical situations, we use the same instrument to evaluate changes over time and also to differentiate values from the normal or abnormal cut-off point (which is usually derived from population-based studies). One of the examples of this situation is in the screening of hypertension cases, and the assessment of the reduction of blood pressure post-treatment, in a clinic or health centre. Both blood pressure measurements are performed using the same blood pressure machine, or sphygmomanometer. So, agreement and reliability parameters are equally

is involved in important decisions, such as whether treatment is required or not.

220 Advances in Statistical Methodologies and Their Application to Real Problems

**1.3. Agreement versus reliability**

## **2.1. Evaluation of agreement and reliability in a single study**

Agreement and reliability are both important in assessing the quality of instruments. An instrument with high agreement will not be useful if it is unreliable. Ideally, these parameters should be assessed together. However, recent systematic reviews showed that this is not commonly followed in practice, especially with respect to agreement studies [12, 13]. Most (71%) of the reliability studies, measured agreement at the same time [13]. However, only 30% of agreement studies found assessed reliability [12]. Researchers tend to focus on one aspect of quality when validating instruments, although there is a possibility of agreement and reliability studies being conducted separately for the same instrument. Nonetheless, it is important to ensure the reliability of the instrument first, before testing for agreement, because it is impossible to assess the agreement of an unreliable instrument.

#### **2.2. Inappropriate application of statistical method**

Thousands of validation studies have been conducted in the past. Various statistical tests have been used to test for agreement and reliability [14–16]. Some of the methods that were used were inappropriate. Correlation coefficient (*r*), coefficient of determination (*r*<sup>2</sup> ), regression coefficient and means comparison have been shown to be inappropriate for the analysis in method comparison study. This has been discussed by Altman and Bland, since the 1980s [15] and also by Daly and Bourke [17]. Reasons for why those methods are inappropriate for the analysis in method comparison study will be discussed in Sections 3 and 4.

One example of the inappropriate application of statistical methods in method comparison study is in the study to explore the suitability of existing formulas to estimate the body surface area (BSA) of new-borns [18]. The authors compared different methods of estimation of body surface area in newborn and used correlation coefficient to determine the agreement of those methods [18]. In one of their results, the authors described that the method of estimating body surface area (BSA) using the BSA-Meban was most similar to the BSA-Mean, by having a mathematically perfect correlation with *r* = 1.00 (*p* = 0.000) [18]. However, their conclusion was obviously inappropriate because the correlation coefficient only measures linear relationship, and does not suggest that the two methods give similar results.

Another example of the inappropriate application of the Pearson correlation coefficient was demonstrated in one study conducted in Greece [19]. The authors aimed to assess the validity of a new motorised isometric dynamometer for measuring strength characteristics of elbow flexor muscles. They set the criteria of the Pearson correlation coefficient's (*r*) values >0.97 to demonstrate that high agreement occurred between measures, and with *r* = 0.986, they concluded that the new dynamometer was accurate [19].

The use of inappropriate methods for the assessment of agreement and reliability will, undoubtedly, result in an inappropriate interpretation of the results and conclusions on the quality of an instrument. Consequently, this might result in the application of invalid equipment in medical practice, and will jeopardise the quality of care given to patients. The proportion of studies with inappropriate statistical methods might reflect the proportion of medical instruments that have been validated using inappropriate methods in current clinical practice.

As found in recent systematic reviews, 19% of reliability studies [13] and 10% of agreement studies [12] used inappropriate methods, which means that there is a distinct possibility that some medical instruments or equipment used currently were validated using inappropriate methods, with consequently erroneous conclusions being drawn from these methods. This equipment, therefore, may not be as precise or accurate as believed, which could, potentially, affect the management of patients, the quality of care given to patients and, worse, it could cost lives.

Altman and Bland [15] proposed a method for agreement analysis in their original 1983 article. Later, they drew the attention of the medical professionals to this area in an article in The Lancet [20]. Their article [20] received very high citation [21]. The popularity of the Bland-Altman method was thought, owing to its simplicity, practicality and ability to detect bias, when compared to other methods [16].

The issue of which method is the best is still debatable, and almost all methods have been criticised, especially for the agreement study. Even the Bland-Altman method has been criticised. Hopkins [22] demonstrated that the Bland-Altman plot indicates, incorrectly, that there is systematic bias in the relationship between two measures. Recent study also showed that there is overestimation of bias in the Bland-Altman analysis [23].

## **2.3. Application of multiple methods**

According to recent systematic reviews conducted, most reliability studies (86%) relied on a single statistical method to assess reliability [13], in contrast with agreement studies where most of the studies (65%) used a combination of statistical methods [12]. The application of multiple or a combination of methods, particularly in the assessment of agreement, suggests that there is no consensus among researchers on which method is the best statistical method for measuring agreement. One example of the multiple application of method is in one study that testing the accuracy of peak flow meters [24]. In this study, the authors applied three statistical methods (Pearson's correlation coefficient, comparing mean (significant test) and the Bland-Altman method) to assess for agreement of peak flow meters [24].

A strong reason for using multiple methods in assessing agreement and reliability is that each statistical method has its strengths and weaknesses. The usage of multiple methods in method comparison studies has the advantage of compensating for the limitations of any one single method [14, 25]. As long as the methods chosen are appropriate for it purposes. However, another possible reason for using multiple methods is the researcher's limited understanding of the statistical methods for agreement and reliability. This is probably the reason for the application of multiple inappropriate statistical methods in a single study; for example, the use of both correlation coefficient and significance test of the difference between means, to test for agreement and reliability. Both of these methods have been clearly shown to be inappropriate statistical methods to assess agreement and reliability [15, 17].

## **3. Most commonly used methods to assess agreement**

In 2012, Zaki et al. [12] review the statistical methods used to measure the agreement of equipment measuring continuous variables in the medical literature. The most common method to assess agreement is the Bland-Altman limits of agreement (LoA), followed by correlation coefficient (*r*), comparing means, comparing slope and intercept and intraclass correlation coefficient. However, some of these methods were inappropriate to assess agreement.

#### **3.1. Bland-Altman limits of agreement**

One example of the inappropriate application of statistical methods in method comparison study is in the study to explore the suitability of existing formulas to estimate the body surface area (BSA) of new-borns [18]. The authors compared different methods of estimation of body surface area in newborn and used correlation coefficient to determine the agreement of those methods [18]. In one of their results, the authors described that the method of estimating body surface area (BSA) using the BSA-Meban was most similar to the BSA-Mean, by having a mathematically perfect correlation with *r* = 1.00 (*p* = 0.000) [18]. However, their conclusion was obviously inappropriate because the correlation coefficient only measures linear relationship,

Another example of the inappropriate application of the Pearson correlation coefficient was demonstrated in one study conducted in Greece [19]. The authors aimed to assess the validity of a new motorised isometric dynamometer for measuring strength characteristics of elbow flexor muscles. They set the criteria of the Pearson correlation coefficient's (*r*) values >0.97 to demonstrate that high agreement occurred between measures, and with *r* = 0.986, they con-

The use of inappropriate methods for the assessment of agreement and reliability will, undoubtedly, result in an inappropriate interpretation of the results and conclusions on the quality of an instrument. Consequently, this might result in the application of invalid equipment in medical practice, and will jeopardise the quality of care given to patients. The proportion of studies with inappropriate statistical methods might reflect the proportion of medical instruments that have

As found in recent systematic reviews, 19% of reliability studies [13] and 10% of agreement studies [12] used inappropriate methods, which means that there is a distinct possibility that some medical instruments or equipment used currently were validated using inappropriate methods, with consequently erroneous conclusions being drawn from these methods. This equipment, therefore, may not be as precise or accurate as believed, which could, potentially, affect the management of patients, the quality of care given to patients and, worse, it could

Altman and Bland [15] proposed a method for agreement analysis in their original 1983 article. Later, they drew the attention of the medical professionals to this area in an article in The Lancet [20]. Their article [20] received very high citation [21]. The popularity of the Bland-Altman method was thought, owing to its simplicity, practicality and ability to detect bias,

The issue of which method is the best is still debatable, and almost all methods have been criticised, especially for the agreement study. Even the Bland-Altman method has been criticised. Hopkins [22] demonstrated that the Bland-Altman plot indicates, incorrectly, that there is systematic bias in the relationship between two measures. Recent study also showed that there is

According to recent systematic reviews conducted, most reliability studies (86%) relied on a single statistical method to assess reliability [13], in contrast with agreement studies

and does not suggest that the two methods give similar results.

222 Advances in Statistical Methodologies and Their Application to Real Problems

been validated using inappropriate methods in current clinical practice.

cluded that the new dynamometer was accurate [19].

cost lives.

when compared to other methods [16].

**2.3. Application of multiple methods**

overestimation of bias in the Bland-Altman analysis [23].

Bland-Altman limits of agreement were found to be the most commonly used method to assess agreement in medical literature. In 1983, Bland and Altman introduced limits of agreement (LoA) to quantify agreement [15]. They proposed a scatter plot of the differences of two measurements against the average of the two measurements, and later it becomes a graphical presentation of agreement (see **Figure 2**). Bland and Altman [20] stated that it is very unlikely for two different methods or instruments to be exactly in agreement, or give identical results for all individuals. However, what is important is how close the values obtained by the new method (predicted values) are to the gold standard method (actual values). This is because a very small difference in the predicted and the actual value will not have an effect on decisions of patient management [20]. So they started with an estimation of the difference between measurements by two methods or instruments [20]. To construct limits of agreement, first we need to calculate the mean and standard deviation of these differences. The formula for limits of agreement (LoA) is given as [20]:

$$\text{LoA = mean difference} \pm 1.96 \times \text{(standard deviation of differences)}\tag{1}$$

**Figure 2.** The Bland-Altman plot.

So, 95% of differences should lie within these limits. To illustrate this, we can use the data from **Table 1** (adapted from Table 12.5, interpretation and uses of medical statistics) [17], which compared the values from the glucometer and laboratory. If we apply the data from **Table 1**, the first step of the analysis is to calculate the difference and mean. The mean difference for the data is −0.28 mmol/l, and the standard deviation of difference is 0.27 mmol/l. This makes the LoA = −0.81 to 0.26 mmol/l.

Limits of agreement give us the range of how much one method is likely to differ from another. So it is all about the differences. If we are testing a new method B against the old method A, and the difference is calculated from A–B, then a positive value of limits of agreement means A > B, or new method B underestimates the new method A. If a negative value of limits of agreement means A < B, or the new method B overestimates the old method A. So, the result of Bland-Altman analysis between glucometer and laboratory values (**Table 1**) can be shown as follows:


This means that, on average, the glucometer measures 0.28 mmol/l less than the laboratory. Also, 95% of the time the glucometer reading will be somewhere between 0.81 mmol/l below and 0.26 mmol/l above the laboratory values.


**Table 1.** Hypothetical data of blood glucose level from a glucometer and laboratory.

#### **3.2. Correlation coefficient**

So, 95% of differences should lie within these limits. To illustrate this, we can use the data from **Table 1** (adapted from Table 12.5, interpretation and uses of medical statistics) [17], which compared the values from the glucometer and laboratory. If we apply the data from **Table 1**, the first step of the analysis is to calculate the difference and mean. The mean difference for the data is −0.28 mmol/l, and the standard deviation of difference is 0.27 mmol/l. This

Limits of agreement give us the range of how much one method is likely to differ from another. So it is all about the differences. If we are testing a new method B against the old method A, and the difference is calculated from A–B, then a positive value of limits of agreement means A > B, or new method B underestimates the new method A. If a negative value of limits of agreement means A < B, or the new method B overestimates the old method A. So, the result of Bland-Altman analysis between glucometer and laboratory values (**Table 1**) can

Differences = Glucometer–Laboratory Mean difference <sup>=</sup> <sup>−</sup>0.28 mmol/l

This means that, on average, the glucometer measures 0.28 mmol/l less than the laboratory. Also, 95% of the time the glucometer reading will be somewhere between 0.81 mmol/l below

(2)

Limits of Agreement = −0.81–0.26 mmol/l

makes the LoA = −0.81 to 0.26 mmol/l.

224 Advances in Statistical Methodologies and Their Application to Real Problems

**Figure 2.** The Bland-Altman plot.

and 0.26 mmol/l above the laboratory values.

be shown as follows:

One of the favourite approaches in measuring agreement is to calculate the correlation coefficient (*r*) [11, 15, 26]. This method is the next most popular method after the Bland and Altman method [12]. The first approach in this analysis is to make a scatter plot, and then to calculate product-moment correlation coefficient [11]. To calculate the product moment correlation coefficient (*r*), variables for each pair of measurements are labelled as *X* and *Y*. The formula for the correlation coefficient *r* is given as:

coefficient ( $r$ ), variables for each pair of measurements are labelled as  $X$  and  $Y$ . The formula for the correlation coefficient  $r$  is given as: 
$$r = \frac{\sum XY - \frac{\sum X \sum Y}{N}}{\sqrt{(\sum X^2 - \frac{(\sum X)^2}{N})(\sum Y^2 - \frac{(\sum Y)^2}{N})}}} \tag{3}$$

If we use an example from data presented in **Table 1** (to compare blood sugar levels from glucometer and laboratory values), the Pearson correlation coefficient (*r*) is 0.9798 with a 95% confidence interval of 0.9139–0.9954, and *p*-value <0.0001 (analysis using SPSS 17.0 software). The null hypothesis here is that the measurements of blood glucose level by the two methods (glucometer and laboratory) are not related linearly. With a very small *p*-value, we can reject this null hypothesis and propose an alternative hypothesis: there is a linear relationship between the measurements of glucose level by the two methods (glucometer and laboratory). Some people will interpret this as being that there is an agreement between the two instruments. This is another mistake conducted by many researchers [15].

Correlation is a measure of association, and only measures the strength of linear relationship [11]. Strong correlation does not mean strong agreement. To demonstrate the inappropriate use of correlation, let's double the value of glucometer from **Table 1** so that it is obvious that there is no agreement between the glucometer and the laboratory value (see **Table 2**). Despite this, the correlation analysis of data from **Table 2** will give exactly the same Pearson correlation coefficient (*r*) of 0.9798, with a similar 95% confidence interval (CI) of 0.9139–0.9954. Of course, the two instruments (glucometer and laboratory measurement) do not agree, but the correlation coefficient value is still very high, suggesting a strong correlation or association.

Agreement is assessing a different aspect of relationship between two measurements as compared to the correlation coefficient. The correlation coefficient reflects the noises and the direction of a linear relationship [27, 28]. Perfect correlation occurs if all the points lie along any straight line (see **Figure 3**), and so data with poor agreement can produce a high or strong association [20, 29]. Furthermore, data covering an extensive (wide) range of values will appear to be more highly correlated than if it covers a narrow range [20]. Therefore, it is clear that correlation is not an appropriate method for testing agreement.


**Table 2.** Hypothetical data of blood glucose value.

**Figure 3.** Correlation coefficient values, the noises and direction of a linear relationship [29].

Some people use the *coefficient of determination* (*r*<sup>2</sup> ) parameter as a measure of agreement. One example of the application of this method is in a recent study [30] on the accuracy evaluation of point-of-care glucose analysers in the Saudi Arabian market. The authors compare the blood glucose readings from five different types of glucose analysers with the results from laboratory analysis. Their aim was to test the accuracy of the devices. In one of their results, the authors described that the Nova StatStrip device showed an excellent performance that almost agreed and correlated perfectly with the lab results, because the *r*<sup>2</sup> = 0.99 [30]. However, the use of the *coefficient of determination* (*r*<sup>2</sup> ) is inappropriate because *r*<sup>2</sup> is obtained from correlation coefficient *r*, which is a wrong method to measure agreement. *Coefficient of determination* (*r*<sup>2</sup> ) is used to state the proportion of variance in the dependent variables that is explained by the regression equation or model [17]. The more closely the points are dispersed around the regression line in the scatter plot, the higher the proportion of variation explained by the regression line, thus the greater the value of *r*<sup>2</sup> [11]. So it applies a similar concept to the correlation coefficient.

## **3.3. Comparing means**

Of course, the two instruments (glucometer and laboratory measurement) do not agree, but the correlation coefficient value is still very high, suggesting a strong correlation or association. Agreement is assessing a different aspect of relationship between two measurements as compared to the correlation coefficient. The correlation coefficient reflects the noises and the direction of a linear relationship [27, 28]. Perfect correlation occurs if all the points lie along any straight line (see **Figure 3**), and so data with poor agreement can produce a high or strong association [20, 29]. Furthermore, data covering an extensive (wide) range of values will appear to be more highly correlated than if it covers a narrow range [20]. Therefore, it is clear

that correlation is not an appropriate method for testing agreement.

226 Advances in Statistical Methodologies and Their Application to Real Problems

10.20 10.20 20.40 8.20 8.00 16.00 8.70 8.05 16.10 9.60 9.70 19.40 9.60 9.05 18.10 8.20 8.15 16.30 9.40 8.80 17.60 7.00 6.55 13.10 6.60 6.55 13.10 10.80 10.50 21.00

**Table 2.** Hypothetical data of blood glucose value.

**Lab value Glucometer Glucometer ×2**

**Figure 3.** Correlation coefficient values, the noises and direction of a linear relationship [29].

The third most popular method used is comparing means of readings from two instruments [12]. In this method, the means of readings from two instruments are compared. The test of significance is then carried out to test the null hypothesis that there is no difference between the means of readings from the two instruments.

In assessing agreement, the same measurement of similar subjects will be taken using different instruments. Therefore, a paired *t*-test is usually used to test the hypothesis. Here, we want to know whether the difference observed is the true difference or has only occurred by chance when there was really no difference in the population. If the difference is truly occurring, and the null hypothesis is not true, then the alternative hypothesis must be true. So, in this case, the alternative hypothesis is that there is a significant difference between the mean of reading from the two instruments.

People have interpreted non-significance results to mean that there is not enough evidence to show that the two means differ (i.e. no differences), thus there is an agreement between the two groups, and vice versa. An example of this inappropriate approach is in a study conducted in Sweden on the assessment of left ventricular volumes, using simplified 3-D echocardiography and computed tomography [31]. However, the paired t-test with non-significant results does not indicate agreement. The reason for this is that the value of mean is affected by the value of each piece of data, especially when there is an outlier. Distribution of differences between the instruments can lead to a difference in means being non-significant. It is possible that poor agreement between the two instruments can be hidden in the distribution of differences, and thus the two methods can appear to agree [17]. To illustrate this example, we have a hypothetical dataset comparing the measurements from standard instrument A, with the new instruments B and C (**Table 3**).

From the dataset (**Table 3**), it is obvious that the two new instruments (B and C) do not agree with the standard instrument A. The mean and standard deviation for the three groups are all


**Table 3.** Hypothetical dataset for instruments A, B and C.

the same: the mean is equal to 3.0 and standard deviation is equal to 1.58. If we compare the readings from instruments A and B, using a paired *t*-test, the results will be:

$$\begin{aligned} & \text{ instruments A and B, using a paired } t \text{-test, the results will be:}\\ & \text{Mean difference (confidence interval)} = 0 \text{ (-1.24-1.24)}\\ & \text{Standard deviation of differences = 1.0} \\ & \text{ } p \text{-value = 1.0} \end{aligned} \tag{4}$$

So, from this analysis, we can conclude that there is no difference between the mean reading of instruments A and B. If we are saying that non-significant results indicate an agreement, this suggests that there is an agreement between instruments A and B. However, we know that this is not true. Similarly, the result will be not significant when we compare the mean reading of instruments A and C, where the results will be:

$$\begin{aligned} \text{(numerals A and C, where the results will be:}\\ \text{Mean difference (confidence interval)} &= 0 \text{ (-3.93-3.93)}\\ \text{Standard deviation of differences = 3.16} \\ p-\text{value} &= 1.0 \end{aligned} \tag{5}$$

Again, this does not suggest that there is an agreement between instruments A and C. The inappropriate application of the test of significance, as a test for agreement, has also been discussed earlier in the article by Altman and Bland [15]. What matters in agreement is that each reading from the standard instrument should be repeated by the second instrument. We are not interested in the mean of readings by each instrument, but are interested in each individual reading. Therefore, comparing means using a significance test is not an appropriate method for assessing agreement.

#### **3.4. Intra-class correlation coefficient**

The intra-class correlation coefficient (ICC)is another popular method found used to test for agreement [12]. ICC was devised initially to assess the relationship between variables within classes, or reliability. However, it was then used to assess agreement, to avoid the problem of linear relationship being mistaken for agreement in the product moment correlation coefficient (*r*) [26, 32]. Different assignments of measurements of *X* and *Y*, in the calculation of the correlation coefficient (*r*), would produce different values of *r*. To overcome some of the limitations of the correlation coefficient (*r*), the ICC averages the correlations among all the possible ordering of the pairs [33]. The ICC also extends to more than two observations, in contrast with the correlation coefficient (*r*) [11]. In general, the ICC is a ratio of two variances:

$$\begin{aligned} \text{http:} & \text{//d.o.do.org/10.5772/66615/3} \\\\ \text{ICC} &= \frac{\text{Variance owing to rated subjects}}{\text{Variance owing to subjects + Error}} \end{aligned} \tag{6}$$

The value of the ICC can theoretically vary from 0 to 1, where 0 indicates no reliability or disagreement in the agreement study. The ICC of one indicates perfect reliability, or perfect agreement. There are different types of ICC that have been described by Shrout and Fleiss [33]. McGraw and Wong [35] expanded the Shrout and Fleiss system to include two more general forms of ICC. Weir [34] summarised different types of ICC, based on models introduced by Shrout and Fleiss [33] and McGraw and Wong [35] (see **Table 4**).


*MSB*, Between-subjects mean square; *MSE*, error mean square; *MSS* , subjects mean square; *MST* , trials mean square; *MSW*, within-subjects mean square.

**Table 4.** Different types of ICC.

the same: the mean is equal to 3.0 and standard deviation is equal to 1.58. If we compare the

So, from this analysis, we can conclude that there is no difference between the mean reading of instruments A and B. If we are saying that non-significant results indicate an agreement, this suggests that there is an agreement between instruments A and B. However, we know that this is not true. Similarly, the result will be not significant when we compare the mean

(4)

(5)

Mean difference (confidence interval) = 0 (−1.24–1.24 )

Mean difference (confidence interval) = 0 (−3.93–3.93 )

Again, this does not suggest that there is an agreement between instruments A and C. The inappropriate application of the test of significance, as a test for agreement, has also been discussed earlier in the article by Altman and Bland [15]. What matters in agreement is that each reading from the standard instrument should be repeated by the second instrument. We are not interested in the mean of readings by each instrument, but are interested in each individual reading. Therefore, comparing means using a significance test is not an appropriate

The intra-class correlation coefficient (ICC)is another popular method found used to test for agreement [12]. ICC was devised initially to assess the relationship between variables within classes, or reliability. However, it was then used to assess agreement, to avoid the problem of linear relationship being mistaken for agreement in the product moment correlation coefficient (*r*) [26, 32]. Different assignments of measurements of *X* and *Y*, in the calculation of the correlation coefficient (*r*), would produce different values of *r*. To overcome some of the limitations of the correlation coefficient (*r*), the ICC averages the correlations among all the possible ordering of the pairs [33]. The ICC also extends to more than two observations, in contrast with the correlation coefficient (*r*) [11]. In general, the ICC is a ratio

Standard deviation of differences <sup>=</sup> 3.16

readings from instruments A and B, using a paired *t*-test, the results will be:

Standard deviation of differences <sup>=</sup> 1.0

**Patient A B C** 1 1 1 5 2 2 3 4 3 3 2 3 4 4 5 2 5 5 4 1

228 Advances in Statistical Methodologies and Their Application to Real Problems

*p*-value = 1.0

**Table 3.** Hypothetical dataset for instruments A, B and C.

*p*-value = 1.0

method for assessing agreement.

of two variances:

**3.4. Intra-class correlation coefficient**

reading of instruments A and C, where the results will be:

Shrout and Fleiss [33] suggested three main models: Model 1 is a one-way fixed model; Model 2 is a two-way random model; and Model 3 is a two-way fixed model. The model is represented in the format of ICC (a, b). The value of '*a*' can be 1, 2 or 3 (this depends on the three main models). For value '*b*', when *b* = 1, this suggests single measures ICC and *b* = *k* suggests averaged measures ICC [34].

In the ICC model suggested by McGraw and Wong [35], the designation '*C'* refers to consistency and '*A'* refers to absolute agreement. The '*A'* model considers both fixed and systematic error, whereas the '*C*' model only considers fixed error [34, 35].

Although a total of 10 ICC models were summarised by Weir [34], there are similarities in some of the ICC formula for different types of ICC. This is because the difference between the random model and the fixed model is not in the calculation but in the interpretation of the ICC [35]. According to Shrout and Fleiss [33], there is only one ICC that measures the extent of absolute agreement, and that is ICC (2, 1), which is based on the two-way random-effects ANOVA (analysis of variances) [14, 33]. This model is similar to ICC (*A*, 1), as suggested by McGraw and Wong [35].

The ICC (*C*, 1) for consistency simply compares the consistency between trials. For example, for the hypothetical data from **Table 5**, will produce ICC (*C*, 1) = 1.0, which is interpreted as a perfect agreement. However, the absolute agreement ICC, or ICC (*A*, 1), compares both the consistency between trails and the agreement between ratings. So, the same pairs of data from **Table 5** will produce ICC (*A*, 1) = 0.67, which suggests some degree of disagreement (or moderate agreement).


**Table 5.** Hypothetical dataset of repeated measurements from instrument A.

However, the use of ICC in assessing agreement has been criticised by Bland and Altman [32]. In testing the agreement of instruments, the new method will usually be compared to the standard instrument [32]. The aim of testing is to ensure that the new method will produce the same measurements as the standard instrument (i.e. good agreement). This can also mean that the new method is designed to provide similar predictions of measurement as the standard instrument. So, there is clear ordering of the two variables, where the measurements from the standard instrument are usually denoted as *X* and measurements of the new method are denoted as *Y*.

The ICC also ignores the ordering and treats both methods as a random sample from a population of methods [32]. In an agreement study, there are two specific methods that will be compared, not two instruments chosen at random from some population. Another assumption in the ICC model, which is quite unjustified in methods comparison study, is that the measurement error of both methods has to be the same [32]. The main purpose in testing agreement is to identify the measurement error of the new instrument in comparison to the standard instrument. Another issue with ICC is that it is influenced by the range of data. If the variance between subjects is high, the reliability will certainly appear to be high [14].

#### **3.5. Comparing slopes and** *y***-intercepts**

Often, in testing for agreement, the slope is tested against one. The argument is that if the two methods or instrument are equivalent (i.e. if it measures the same variable of the same subject, both instruments will give the same reading), thus the slope of the straight line will be one [15]. Straight line equation will show the relationship between two variables, and can be expressed as: *y* = *α* + *βx*, where '*y*' is the predicted or expected value for any given value of '*x*', while '*α*' is the intercept of the straight line with the *y*-axis and '*β*' is the slope. The values of both '*α*' and '*β*' are constant. The slope '*β*' is also called the *regression coefficient*, and measures the amount of change in the '*y*' variable for a unit change in '*x*' [11]. So, if instrument A measures '*y*', and instrument B measures '*x*' and if *y* = *x*, the slope of the straight line equation is equal to one. It is true that the straight line of *y* = *x* will always have slope of 1. However, this is not always true in reverse, because for a line with a slope of 1, the straight line could be *y* = *x*, or could be *y* = *α* + *x*. Therefore, testing the slope is equal to 1, is also an inappropriate method of testing agreement. When the test of slope is equal to 1 is significant, some people proceed to test the *y*-intercept. Theoretically, if slope is 1 and *y*-intercept is 0, then *y* will be equal to *x* (*y* = *x*). However, testing both slope and intercept to assess agreement is not so popular compared to other methods.

Bland and Altman [36] also suggested that the old measurement (*y*) can be regressed on the new measurement (*x*), and then one can calculate the standard error of a prediction of the old value from the new. This can be used to estimate predicted value from old measurements for any observed value of new measurement, with a confidence interval, which is also known as a *prediction interval* [36]. However, the problem is that the prediction interval is not constant; it is smaller in the middle, and wider towards the extremes [36].

## **4. Most commonly used methods to assess reliability**

Various methods have also been used to estimate reliability. Recent review of reliability study in medicine found that among popular methods used include: intra-class correlation coefficient, comparing means, Bland-Altman limits of agreement, and correlation coefficient (*r*).

## **4.1. Intra-class correlation coefficient**

Although a total of 10 ICC models were summarised by Weir [34], there are similarities in some of the ICC formula for different types of ICC. This is because the difference between the random model and the fixed model is not in the calculation but in the interpretation of the ICC [35]. According to Shrout and Fleiss [33], there is only one ICC that measures the extent of absolute agreement, and that is ICC (2, 1), which is based on the two-way random-effects ANOVA (analysis of variances) [14, 33]. This model is similar to ICC (*A*, 1), as suggested by

The ICC (*C*, 1) for consistency simply compares the consistency between trials. For example, for the hypothetical data from **Table 5**, will produce ICC (*C*, 1) = 1.0, which is interpreted as a perfect agreement. However, the absolute agreement ICC, or ICC (*A*, 1), compares both the consistency between trails and the agreement between ratings. So, the same pairs of data from **Table 5** will produce ICC (*A*, 1) = 0.67, which suggests some degree of disagreement (or moderate agreement).

**Patient First reading Second reading**

1 2 4 2 4 6 3 6 8

230 Advances in Statistical Methodologies and Their Application to Real Problems

**Table 5.** Hypothetical dataset of repeated measurements from instrument A.

However, the use of ICC in assessing agreement has been criticised by Bland and Altman [32]. In testing the agreement of instruments, the new method will usually be compared to the standard instrument [32]. The aim of testing is to ensure that the new method will produce the same measurements as the standard instrument (i.e. good agreement). This can also mean that the new method is designed to provide similar predictions of measurement as the standard instrument. So, there is clear ordering of the two variables, where the measurements from the standard instrument are usually denoted as *X* and measurements of the new method are

The ICC also ignores the ordering and treats both methods as a random sample from a population of methods [32]. In an agreement study, there are two specific methods that will be compared, not two instruments chosen at random from some population. Another assumption in the ICC model, which is quite unjustified in methods comparison study, is that the measurement error of both methods has to be the same [32]. The main purpose in testing agreement is to identify the measurement error of the new instrument in comparison to the standard instrument. Another issue with ICC is that it is influenced by the range of data. If the

variance between subjects is high, the reliability will certainly appear to be high [14].

Often, in testing for agreement, the slope is tested against one. The argument is that if the two methods or instrument are equivalent (i.e. if it measures the same variable of the same subject, both instruments will give the same reading), thus the slope of the straight line will be one [15].

McGraw and Wong [35].

denoted as *Y*.

**3.5. Comparing slopes and** *y***-intercepts**

The intra-class correlation coefficient (ICC) is the most popular method used to assess reliability of medical instruments [13]. The ICC was originally proposed by Fisher [37, 38]. He was a statistician from England, and Fisher's exact test was one of his well-known contributions to statistics [37, 39]. The earliest ICCs were modifications of the Pearson correlation coefficient [34]. However, the modern version of ICC is now calculated using variance estimates, obtained from the analysis of variance or ANOVA, through partitioning of the total variance between and within subject variance [14].

The general formula for ICC is given as [34]:

\*\*2\*\*6\*\*:2-6\*\*: \*\*2\*\*: \*\*wilin suject variancė [14].

The general formula for ICC is given as [34]: 
$$\text{ICC} = \frac{\text{Subject variability} \left(\delta\_{\hat{s}}^{2}\right)}{\text{Subject variability} \left(\delta\_{\hat{s}}^{2}\right) + \text{Mmeasurement error} \left(\delta\_{\hat{t}}^{2}\right)}\tag{7}$$

Values obtained from ANOVA table:

Measurement error, *δ<sup>E</sup>* <sup>2</sup> <sup>=</sup> Mean square of Error, *<sup>M</sup> SE*

$$\begin{aligned} \text{Parameters} & \text{ } \text{Uniform} \, \text{ if } \text{NC} \, \text{VaR} \, \text{ where} \\ & \text{Measurement error}, \delta\_{\varepsilon}^{2} = \text{Mean square of Error}, MS\_{\varepsilon} \\ & \text{Subject variability}, \quad \delta\_{\varepsilon}^{2} = \quad \frac{\text{Mean square of Subject, } MS\_{\varepsilon} - \text{Mmean square of Error}, MS\_{\varepsilon}}{\text{Number of observed}} \end{aligned}$$

There is no ordering of the repeated measures and can be applied to more than two repeated measurements [5]. ICC is a ratio of variances derived from ANOVA, so it is unit less. The ratio is closer to 1.0, the higher the reliability [34]. Suppose, for example, that we measure carbon monoxide level for 10 patients using a same instrument three times. The hypothetical data are shown in **Table 6**. From the data, an ANOVA table can then be developed as in **Table 7**.


**Table 6.** Hypothetical data of repeated measurements of carbon monoxide level.


**Table 7.** Analysis of variance summary table.

From **Table 7**, the value of ICC can be calculated:

$$\begin{aligned} \text{Measurement error, } \delta\_{\hat{\varepsilon}}^2 &= \, ^\text{MS}\mathbf{S}\_{\hat{\varepsilon}} = 0.56\\ \text{Subject variability, } \delta\_{\hat{s}}^2 &= \, ^\text{Number of observed} \\ &= \frac{12.67 - 0.56}{3} = \, 4.04\\ \text{ICC} &= \, \frac{4.04}{4.04 + 0.56} = \, 0.88 \end{aligned} \tag{9}$$

The interpretation is that 88% of the variance in the measurements results from the 'true' variance among patients. However, note that this is according to the 'classical' definition of reliability. There are different forms of ICC depending on various assumptions or criteria. Chinn [40] recommended that any measure should have an intra-class correlation coefficient of at least 0.6 to be useful [40], whereas Rosner [41] suggested the interpretation of ICC as shown in **Table 8**.


**Table 8.** Interpretation of ICC.

There is no ordering of the repeated measures and can be applied to more than two repeated measurements [5]. ICC is a ratio of variances derived from ANOVA, so it is unit less. The ratio is closer to 1.0, the higher the reliability [34]. Suppose, for example, that we measure carbon monoxide level for 10 patients using a same instrument three times. The hypothetical data are shown in **Table 6**. From the data, an ANOVA table can then be developed as in **Table 7**.

**Patient First reading Second reading Third reading Mean** 1 6 7 8 7 2 4 5 6 5 3 2 2 2 2 4 3 4 5 4 5 5 4 6 5 6 8 9 10 9 7 5 7 9 7 8 6 7 8 7 9 4 6 8 6 10 7 9 8 8

232 Advances in Statistical Methodologies and Their Application to Real Problems

From **Table 7**, the value of ICC can be calculated:

Total 144 29

**Table 7.** Analysis of variance summary table.

Measurement error, *δ<sup>E</sup>*

**Table 6.** Hypothetical data of repeated measurements of carbon monoxide level.

**Source of variation Sum of squares Degree of freedom Mean square**

Patients 114 9 12.67 Raters/instrument 20 2 10 Error 10 18 0.56

Subject variability, *δ<sup>S</sup>*

<sup>2</sup> = *M SE* = 0.56

ICC <sup>=</sup> \_\_\_\_\_\_\_\_ 4.04

The interpretation is that 88% of the variance in the measurements results from the 'true' variance among patients. However, note that this is according to the 'classical' definition of reliability. There are different forms of ICC depending on various assumptions or criteria. Chinn [40] recommended that any measure should have an intra-class correlation coefficient

<sup>2</sup> <sup>=</sup> *<sup>M</sup> SS* <sup>−</sup> *<sup>M</sup> SE*

(9)

 \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Number of observer = \_\_\_\_\_\_\_\_\_ 12.67 <sup>−</sup> 0.56 <sup>3</sup> <sup>=</sup> 4.04

4.04 <sup>+</sup> 0.56 = 0.88

## **4.2. Comparing means/mean difference**

Second most popular method that has been used to assess reliability is to compare means of two sets of measurements (either using t-test or looking at the mean difference) [13]. Since reliability involves repeated measurement of the same subject, a paired t-test is usually applied. However, the paired t-test only gives information about differences between the means of two sets of data, and not about individual differences [14]. As in the explanation in Section 3.3, on assessing agreement, comparing means is also not a suitable method of assessing reliability.

## **4.3. Bland-Altman method**

The Bland-Altman limits of agreement (LoA) also has been used as a method to assess reliability. Bland and Altman [20] suggested that LoA are suitable for the analysis of repeatability of a single measurement method. However, the use of LoA to evaluate reliability has been criticised, as it only estimates reliability when there are two observations for each subject [20]. This breaches the concept of reliability, that allows repeated (more than two) numbers of observations per subject [11]. Although Bland and Altman [42] suggested methods to deal with multiple measurements in calculating the LoA, this method is more suitable for the analysis of agreement rather than reliability. They proposed calculating the mean of the replicated measurements by each instrument, for each subject [42]. Then, these pairs of means could be used to compare the two instruments using the limits of agreement [42]. The use of LoA in the analysis of reliability also has been criticised by Hopkins [43], who gave reasons why LoA is not the best method to use for reliability analysis. According to Hopkins [43], the values of the LoA can result in up to a 21% bias, and this depends on the degrees of freedom of the reliability study (i.e. number of participants and trials).

## **4.4. Correlation coefficient**

As discussed in Section 3.2, the correlation coefficient provides information about the association and the strength of linear relationship. Correlation will not detect any systematic or fixed errors, and it is possible to have two sets of scores that are highly correlated, but not repeatable [14]. Therefore, it is recommended that the correlation coefficient should not be used in isolation for measuring reliability [14, 44]. Furthermore, the correlation coefficient also breaches the concept of reliability, as it only estimates reliability when there are only two observations for each subject [11].

## **5. Summary**

Agreement signifies the accuracy of certain instruments, whereas reliability indicates precision. It is imperative that all medical instruments are accurate and precise. Otherwise, a failure may lead to critical medical errors. Therefore, there is a necessity for the proper evaluation of all medical instruments, and it is important to be sure that the appropriate statistical method has been used. Preferably, agreement and reliability should be assessed together in a validation study.

Simplicity, practicality (or interpretability) and ability of a certain method to detect systematic bias are among the important factors when choosing a method to evaluate agreement. Nonetheless, the ability of the method in detecting bias is still the priority. The use of multiple methods has the advantage of compensating for the limitations of any single method. However, the application of multiple inappropriate statistical methods should be avoided.

Finally, the inappropriate analysis in the method comparison study is a cause for concern in the medical field and cannot be ignored. It is important for medical researchers and clinicians from all specialties to be aware of this issue because inappropriate statistical analyses will lead to inappropriate conclusions. Thus, jeopardising the quality of the evidence, which may in turn, influences the quality of care given to the patients. Educating medical researchers on methods in validation study and clear recommendations and guidelines on how to perform the analysis will improve their knowledge in this area, and help reducing the problem of inappropriate statistical analysis. It is also important to involve statisticians who are able to understand in depth of various statistical methods in the medical education program. Consulting statisticians or inviting them as part of the medical research team also could help in reducing mistake in statistical analysis.

## **Acknowledgements**

I would to express my utmost gratitude to Prof. Dr. Awang Bulgiba and Prof. Dr. Noor Azina Ismail from the University of Malaya for their support, guidance and assistance when I was conducting research in this area. Also for their motivation and support in my career development.

## **Author details**

Rafdzah Zaki

Address all correspondence to: rafdzah@hotmail.com

Department of Social & Preventive Medicine, Faculty of Medicine, Julius Centre University of Malaya, Kuala Lumpur, Malaysia

## **References**

**5. Summary**

a validation study.

mistake in statistical analysis.

**Acknowledgements**

development.

Rafdzah Zaki

**Author details**

Address all correspondence to: rafdzah@hotmail.com

Malaya, Kuala Lumpur, Malaysia

avoided.

Agreement signifies the accuracy of certain instruments, whereas reliability indicates precision. It is imperative that all medical instruments are accurate and precise. Otherwise, a failure may lead to critical medical errors. Therefore, there is a necessity for the proper evaluation of all medical instruments, and it is important to be sure that the appropriate statistical method has been used. Preferably, agreement and reliability should be assessed together in

234 Advances in Statistical Methodologies and Their Application to Real Problems

Simplicity, practicality (or interpretability) and ability of a certain method to detect systematic bias are among the important factors when choosing a method to evaluate agreement. Nonetheless, the ability of the method in detecting bias is still the priority. The use of multiple methods has the advantage of compensating for the limitations of any single method. However, the application of multiple inappropriate statistical methods should be

Finally, the inappropriate analysis in the method comparison study is a cause for concern in the medical field and cannot be ignored. It is important for medical researchers and clinicians from all specialties to be aware of this issue because inappropriate statistical analyses will lead to inappropriate conclusions. Thus, jeopardising the quality of the evidence, which may in turn, influences the quality of care given to the patients. Educating medical researchers on methods in validation study and clear recommendations and guidelines on how to perform the analysis will improve their knowledge in this area, and help reducing the problem of inappropriate statistical analysis. It is also important to involve statisticians who are able to understand in depth of various statistical methods in the medical education program. Consulting statisticians or inviting them as part of the medical research team also could help in reducing

I would to express my utmost gratitude to Prof. Dr. Awang Bulgiba and Prof. Dr. Noor Azina Ismail from the University of Malaya for their support, guidance and assistance when I was conducting research in this area. Also for their motivation and support in my career

Department of Social & Preventive Medicine, Faculty of Medicine, Julius Centre University of


[17] Daly, L.E, Bourke, G.J. Interpretation and use of medical statistics. 5th ed. Oxford:

[18] Ahn, Y, Garruto, R.M. Estimations of body surface area in newborns. Acta Paediatrica

[19] Milias, G.A, Antonopoulou, S, Anthanasopoulos, S. Development, reliability and validity of a new motorized isometric dynamometer for measuring strength characteristics of elbow

[20] Bland, J.M, Altman, D.G. Statistical methods for assessing agreement between two

[21] Bland, J.M, Altman, D.G. Agreed statistics: measurement method comparison.

[22] Hopkins, W.G. Bias in Bland-Altman but not regression validity analyses. Sportscience.

[23] Zaki, R, Bulgiba, A, Ismail, N.A. Testing the agreement of medical instruments: overestimation of bias in the Bland-Altman analysis. Preventive Medicine. 2013;**57**(Suppl):S80–S82.

[24] Nazir, Z, et al. Revisiting the accuracy of peak flow meters: a double-blind study using

[25] Luiz, R.R, Szklo, M. More than one statistical strategy to assess agreement of quantitative measurements may usefully be reported. Journal of Clinical Epidemiology.

[26] Lee, J, Koh, D, Ong, C.N.Statistical evaluation of agreement between two methods for measuring a quantitative variable. Computers in Biology and Medicine. 1989;**19**(1):61–70.

[27] Lin, L.I. Total deviation index for measuring individual agreement with applications in laboratory performance and bioequivalence. Statistics in Medicine. 2000;**19**(2):255–270.

[28] Bland, J.M. An Introduction to medical statistics. 2nd ed. Oxford: Oxford University

[29] Wikipedia. The Free Encyclopedia. Correlation [Internet]. Available from: http://

[30] Hanbazaza, S.M, Mansoor, I. Accuracy evaluation of point-of-care glucose analyzers in

[31] Mårtensson, M, Winter, R, Cederlund, K, Ripsweden, J, Mir-Akbari, H, Nowak, J, Brodin, L. Assessment of left ventricular volumes using simplified 3-D echocardiography and computed tomography – a phantom and clinical study. Cardiovascular Ultrasound.

[32] Bland, J.M, Altman, D.G. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Computers in Biology

en.wikipedia.org/wiki/Correlation. [Accessed: 30 January 2009]

the Saudi market. Saudi Medical Journal. 2012;**33**(1):91–92.

2008;**6**(26)DOI: 10.1186/1476-7120-6-26.

and Medicine. 1990;**20**(5):337–340.

formal methods of agreement. Respiratory Medicine. 2005;**99**:592–595.

flexor muscles. Journal of Medical Engineering & Technology. 2008;**32**(1):66–72.

methods of clinical measurement. Lancet. 1986;1(8476):307–310.

Blackwell Science Ltd.; 2000.

(Oslo, Norway: 1992). 2008;**97**(3):366–370.

236 Advances in Statistical Methodologies and Their Application to Real Problems

Anesthesiology. 2012;**116**(1):182–185.

2004;**8:**42–46.

2005;**58**(4):215–216.

Press; 1995.


## **On Decoding Brain Electrocorticography Data for Volitional Movement Intention Prediction: Theory and On-Chip Implementation On Decoding Brain Electrocorticography Data for Volitional Movement Intention Prediction: Theory and On-Chip Implementation**

Mradul Agrawal, Sandeep Vidyashankar and Ke Huang Mradul Agrawal, Sandeep Vidyashankar and Ke Huang

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66149

#### **Abstract**

Brain-computer interface (BCI) has recently received an unprecedented level of consideration and appreciation in medical applications, such as augmentation and reparation of human cognitive or sensorimotor activities. Brain signals such as electroencephalogram (EEG) or electrocorticography (ECoG) can be used to generate stimuli or control device though decoding, translating, and actuating; this communication between the brain and computer is known as BCI. Moreover, signals from the sensors can be transmitted to a person's brain enabling them to see, hear, or feel from sensory inputs. This two-way communication is referred as bidirectional braincomputer interface (BBCI). In this work, we propose a field-programmable gate array (FPGA)-based on-chip implementation of two important data processing blocks in BCI systems, namely, feature extraction and decoding. Experimental results showed that our proposed architecture can achieve high prediction accuracy for decoding volitional movement intentions from ECoG data.

**Keywords:** ECoG data decoding, brain-computer interface, volitional movement intention prediction, FPGA implementation

## **1. Introduction**

Brain-computer interface has developed immensely in recent times. It has reached a point where a subject can use data collected from their brain to actually control external devices. This process involves feature extraction, decoding, signal processing, and actuating [1]. In the

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

last few years, BCIs have received a lot ofrecognition in various other fields apart from medical industry. They have gained a lot of popularity in the entertainment such as gaming industry. Electrical stimulations from the brain can be recorded either noninvasively which is called electroencephalography (EEG) or invasively called the electrocorticography (ECoG) signals. These electrical stimulations from the brain can be collected over time and decoded to know more about brain signals and its activities. EEG signals are recorded by placing the electrodes along the scalp of the brain. They can be used to diagnose multiple symptoms such as coma, epilepsy, sleep disorders, etc. ECoG signals are recorded in the cerebral cortex of the brain. In this case, the electrodes are implanted into the brain in order to record brain activities. Since ECoG signals are recorded invasively using implanted electrodes, this type of recording data has an advantage of higher spatial resolution and higher sampling rate than its counterpart. However, since the electrodes in this case are implanted into the brain, this setup involves a surgeon to operate on the subject and place the electrodes inside the skull. This unique feature of the ECoG data has made it more suitable for BCI applications that mainly focus on restoration of sensorimotorfunctions. A suitable example of this scenario is shown in [2]. Here, a subject with severe motor disabilities is able to control the prosthesis using the ECoG data recorded from the subject's brain activities. The ECoG data obtained from the electrodes have to be processed in the first place in order to interpret the information contained in them and to decode its volitional movements. Techniques such as time-frequency analysis, including power-spectrum analysis and fast Fourier transforms, have been proposed in Ref. [2]. Neuroscientists enjoy such off-line techniques of decoding ECoG cortical data as it enables them to further go in deep, study the brain activities, and map them to the volitional movements that were intended to perform. In Ref. [3], dynamic mode decomposition is proposed which is also an off-line decoding technique. But such off-line techniques cannot be implemented in applications that are focused on restoration of sensorimotor functions as they have to be on the fly and real time. Retrieving one's voluntary movements by sending the spinal stimulations data to paraplegics or by enabling the actions of the prosthetic control requires real-time signal decoding. Such real-time signal decoding circuits and data processing blocks provide the right platform forthe abovementioned applications. Any such system would have (i) an analog front-end circuit, which is used to amplify the raw signals recorded and filter the noise; (ii) an analog-to-digital converter (ADC), which as the name suggests, converts the incoming raw analog signals to digital format; (iii) data processing block, which is the most vital of them all as this block is used to decode the digitized brain signals into interpretable format, for example, any movement intention; and (iv) stimulator back-end circuits which are particularly placed to perform the actions or movements that are predicted by the previous data processing block. These actions could be anything such as enabling prosthetic actuators or delivering spinal stimulations for voluntary movements.

## **2. On-chip computing in BCI applications**

Decoding raw ECoG signals to restore sensorimotor function has been a great motivation. The most advanced techniques comprise time-frequency analysis with power-spectrum analysis, last few years, BCIs have received a lot ofrecognition in various other fields apart from medical industry. They have gained a lot of popularity in the entertainment such as gaming industry. Electrical stimulations from the brain can be recorded either noninvasively which is called electroencephalography (EEG) or invasively called the electrocorticography (ECoG) signals. These electrical stimulations from the brain can be collected over time and decoded to know more about brain signals and its activities. EEG signals are recorded by placing the electrodes along the scalp of the brain. They can be used to diagnose multiple symptoms such as coma, epilepsy, sleep disorders, etc. ECoG signals are recorded in the cerebral cortex of the brain. In this case, the electrodes are implanted into the brain in order to record brain activities. Since ECoG signals are recorded invasively using implanted electrodes, this type of recording data has an advantage of higher spatial resolution and higher sampling rate than its counterpart. However, since the electrodes in this case are implanted into the brain, this setup involves a surgeon to operate on the subject and place the electrodes inside the skull. This unique feature of the ECoG data has made it more suitable for BCI applications that mainly focus on restoration of sensorimotorfunctions. A suitable example of this scenario is shown in [2]. Here, a subject with severe motor disabilities is able to control the prosthesis using the ECoG data recorded from the subject's brain activities. The ECoG data obtained from the electrodes have to be processed in the first place in order to interpret the information contained in them and to decode its volitional movements. Techniques such as time-frequency analysis, including power-spectrum analysis and fast Fourier transforms, have been proposed in Ref. [2]. Neuroscientists enjoy such off-line techniques of decoding ECoG cortical data as it enables them to further go in deep, study the brain activities, and map them to the volitional movements that were intended to perform. In Ref. [3], dynamic mode decomposition is proposed which is also an off-line decoding technique. But such off-line techniques cannot be implemented in applications that are focused on restoration of sensorimotor functions as they have to be on the fly and real time. Retrieving one's voluntary movements by sending the spinal stimulations data to paraplegics or by enabling the actions of the prosthetic control requires real-time signal decoding. Such real-time signal decoding circuits and data processing blocks provide the right platform forthe abovementioned applications. Any such system would have (i) an analog front-end circuit, which is used to amplify the raw signals recorded and filter the noise; (ii) an analog-to-digital converter (ADC), which as the name suggests, converts the incoming raw analog signals to digital format; (iii) data processing block, which is the most vital of them all as this block is used to decode the digitized brain signals into interpretable format, for example, any movement intention; and (iv) stimulator back-end circuits which are particularly placed to perform the actions or movements that are predicted by the previous data processing block. These actions could be anything such as enabling prosthetic actuators

240 Advances in Statistical Methodologies and Their Application to Real Problems

or delivering spinal stimulations for voluntary movements.

Decoding raw ECoG signals to restore sensorimotor function has been a great motivation. The most advanced techniques comprise time-frequency analysis with power-spectrum analysis,

**2. On-chip computing in BCI applications**

fast Fourier transform [2], and dynamic mode decomposition [3] are used to decode unprocessed ECoG signals. External computational sources that treat the data received either from electrodes or the feedback from sensors can generate the signal for stimulus or can trigger the prosthetic control efficiently. This type of analysis and decoding ECoG data in an off-line way is very slow and not suitable for decoding in real time. Moreover, as these off-line-based computational resources are very complex to implement in terms of area and power consumption, they are impractical for portable applications. Hence, it is essential to design an onchip decoding system which is handy, power efficient, and also fast enough for real-time use. Many different on-chip techniques are presented for decoding ECoG-recorded signals. An interface between inserted chip and recording and stimulating electrodes was projected in [4], which is portable as well as operates independently. In Refs. [5–8], various on-chip signal recording and processing model can be seen. Such action potential of a basic computing unit (a neuron) was detected that produces the electrical stimuli by the use of time-amplitude discriminator. Mainly focuses on recording and processing (amplifying, filtering, etc.) cortically recorded signals to trigger the action stimuli signals. Another on-chip implementation is based on look-up table (LUT) that generates the corresponding stimulus action (such as eye blinking) by classifying the extracted, amplified, and filtered brain signals [7]. Moreover, in Ref. [9], discrete cosine transform and linear classifier are implemented on hardware to decode the ECoG movement intentions. As explained in Ref. [4], the large number of neurons that reside on the cortical surface controls the hand movements, and these hand movements usually occur in high-dimensional space; hence, giving a typical motor behavior range is still a challenge. Since the classification based on look-up tables and linear classifiers is narrow, developing a better and versatile on-chip classification method is essential that can take care of more complex task.

**Figure 1.** Overview of the BCI system in which the proposed framework is implemented.

Designing of data processing block and implementation are the main focus of this work. Proposed framework has been highlighted in **Figure 1** showing an outline of typical BCI system. The signals are the ECoG signals, recorded from the electrodes placed invasively on the cerebral cortex. There are several parameters that have to be taken care of when designing such circuits. Mainly, the area occupied on the chip should be really small with low power consumption and should be resistant to temperature and voltage variations and of course the limitation of the system on which this has been developed. Thus, it is not feasible to build such off-line hardware-hungry decoding schemes for real-time BCI applications. Our work involves a low-power and area-efficient hardware realization of principal component analysis (PCA) and multilayer perceptron (MLP), implemented on FPGA to show the usefulness of on-chip data decoding model for BCI applications. Openly accessible ECoG recordings and the experimental results from FPGA show the accuracy of 80% for predicting single-finger movement.

## **3. Feature extraction: principal component analysis**

Feature extraction is a process whose aim is to reduce the dimensionality and decrease the complexity of the dataset to a fewer dimensions with the largest amount of information possible. For BCI applications, it is an important requirement that the computational complexity of the system is very less. It must also be robust against noise influences and should only depend on historical data samples. Brain signals depend on different thinking activities that occur in the brain. BCI is considered to be a pattern recognition system that differentiates between different patterns and classifies them into different classes based on the features. The features extracted using BCI not only reflect the similarities to a certain class but also the differences from the rest of the classes. The features are measured using the properties of the signal that contain the information needed to distinguish between different classes.

Feature extraction or dimensionality reduction techniques such as PCA or independent component analysis can be applied to reduce the dimensions of the original brain signal data collected to help in removing irrelevant and redundant information. Such techniques will also reduce the overall computation cost as well.

## **3.1. Principal component analysis (PCA)**

PCA is an effective and powerful tool for analyzing data and finding patterns in it. It is used for data compression, and it is a form of unsupervised learning. Dimensionality reduction methods can significantly simplify and progress process monitoring procedures by projecting the data from a higher dimensional space to a lower dimensional space that exactly characterizes the state of the process. PCA is a dimensionality reduction technique which produces a lower dimensional representation of a given data in such a way that the correlation between the process variables is conserved and is also good in terms of covering the maximum possible variance in the given data. The projection of higher dimensional data to a lower dimensional data as explained before happens in a least square sense; small inconsistencies in the data are ignored, and only large inconsistencies are considered.

#### **3.2. Characteristics of principal components**

Designing of data processing block and implementation are the main focus of this work. Proposed framework has been highlighted in **Figure 1** showing an outline of typical BCI system. The signals are the ECoG signals, recorded from the electrodes placed invasively on the cerebral cortex. There are several parameters that have to be taken care of when designing such circuits. Mainly, the area occupied on the chip should be really small with low power consumption and should be resistant to temperature and voltage variations and of course the limitation of the system on which this has been developed. Thus, it is not feasible to build such off-line hardware-hungry decoding schemes for real-time BCI applications. Our work involves a low-power and area-efficient hardware realization of principal component analysis (PCA) and multilayer perceptron (MLP), implemented on FPGA to show the usefulness of on-chip data decoding model for BCI applications. Openly accessible ECoG recordings and the experimental results from FPGA show the accuracy of 80% for predicting single-finger

Feature extraction is a process whose aim is to reduce the dimensionality and decrease the complexity of the dataset to a fewer dimensions with the largest amount of information possible. For BCI applications, it is an important requirement that the computational complexity of the system is very less. It must also be robust against noise influences and should only depend on historical data samples. Brain signals depend on different thinking activities that occur in the brain. BCI is considered to be a pattern recognition system that differentiates between different patterns and classifies them into different classes based on the features. The features extracted using BCI not only reflect the similarities to a certain class but also the differences from the rest of the classes. The features are measured using the properties of the

signal that contain the information needed to distinguish between different classes.

Feature extraction or dimensionality reduction techniques such as PCA or independent component analysis can be applied to reduce the dimensions of the original brain signal data collected to help in removing irrelevant and redundant information. Such techniques will also

PCA is an effective and powerful tool for analyzing data and finding patterns in it. It is used for data compression, and it is a form of unsupervised learning. Dimensionality reduction methods can significantly simplify and progress process monitoring procedures by projecting the data from a higher dimensional space to a lower dimensional space that exactly characterizes the state of the process. PCA is a dimensionality reduction technique which produces a lower dimensional representation of a given data in such a way that the correlation between the process variables is conserved and is also good in terms of covering the maximum possible variance in the given data. The projection of higher dimensional data to a lower dimensional

**3. Feature extraction: principal component analysis**

242 Advances in Statistical Methodologies and Their Application to Real Problems

reduce the overall computation cost as well.

**3.1. Principal component analysis (PCA)**

movement.

	- **a.** This component covers most of the variance that was unaccounted for in the first principal component, which means that the second component will be correlated with most of the variables that did not display strong correlation with the first component.
	- **b.** The second characteristic is that it is completely uncorrelated with the first component. The correlation between the two will be zero when it is matched.
	- **a.** Each component calculated will account for a maximum variance of the variables that were not covered by the preceding component.
	- **b.** Each component will be uncorrelated with all the preceding components calculated.

From all the above characteristics, it is clear that with each new component calculated, it accounts for progressively smaller and smaller amounts of variance which clearly explains why only the first few components are generally considered for any data analysis and interpretation. When the analysis is complete and all the principal components are obtained, each of these components will display varying amounts of correlation with the input variables but are all completely uncorrelated from each other.

## **4. Feature decoding: artificial neural networks**

The ECoG-recorded data need to be decoded to be able to detect the intended movement. Once the dimensionality of data is reduced by PCA, we can further decode the data to trigger external devices. Artificial neural network (ANN) is a very good choice for decoding such signals.

## **4.1. Artificial neural networks (ANNs)**

Artificial neural networks are designed to model the data processing abilities of a biological nervous system, which are the major paradigm for data mining applications. The human brain is estimated to have around 10 billion neurons each connected with an average of 10,000 other neurons. The basic cell of artificial neural network is a mathematical model of a neuron represented in **Figure 2**. There are three basic components in an artificial neuron:


**Figure 2.** Mathematical model of a basic cell of artificial neural networks.

#### *4.1.1. Multilayer perceptron*

The feedforward networks with more than one layer are called multilayer perceptron (MLP). This is a very popular multilayer feedforward architecture. The neurons in each layer of MLP (minimum two layers, one hidden layer, and one output layer) are connected to the neurons of the next layer. The input layer accepts input values and forwards them to the successive layers. The last layer is called the output layer. Layers between input and output layers are called hidden layers. In this work, we adopt the sigmoid (or the logistic) function for implementing activation function. **Figure 3** shows an example of MLP architecture. There is no universal approach to systematically obtain the optimal number of neurons and number of layers. Cross validation is a common practice to obtain the optimal MLP structure, although some other practical constraints such as area and power overhead should also be taken into account when on-chip implementation is considered.

**Figure 3.** Example of MLP architecture.

**1.** The connecting links possessing weights to the inputs (analogous to synapses in biological

**2.** The weighted input values are summed in an adder with a bias, *w0*,

The feedforward networks with more than one layer are called multilayer perceptron (MLP). This is a very popular multilayer feedforward architecture. The neurons in each layer of MLP (minimum two layers, one hidden layer, and one output layer) are connected to the neurons of the next layer. The input layer accepts input values and forwards them to the successive layers. The last layer is called the output layer. Layers between input and output layers are called hidden layers. In this work, we adopt the sigmoid (or the logistic) function for implementing activation function. **Figure 3** shows an example of MLP architecture. There is no universal approach to systematically obtain the optimal number of neurons and number of layers. Cross validation is a common practice to obtain the optimal MLP structure, although some other practical constraints such as area and power overhead should also be taken into

neuron).

∑ = *x*1 \* *w*1 + *x*2 \* *w*2 + *x*3 \* *w*3 + *w*0.

**3.** An activation function that maps the output on a neuron.

244 Advances in Statistical Methodologies and Their Application to Real Problems

**Figure 2.** Mathematical model of a basic cell of artificial neural networks.

account when on-chip implementation is considered.

*4.1.1. Multilayer perceptron*

Training of MLP consists of tuning the weight values associated with each neuron in an iterative manner such that the outputs of MLP gradually change toward the desired outputs, also known as target values. The most commonly used training algorithm is backpropagation. Backpropagation is an expression for the partial derivative of the cost function with respect to any weight or bias in the network, which tells us how quickly the cost changes when we change the weights and biases. The algorithm requires a desired output (target) for each input to train the neural network. This type of training is known as supervised learning. When no target values are specified during the training, the procedure is then refer to as unsupervised learning.

## **4.2. Training of artificial neural networks**

Training artificial neural networks using backpropagation is an iterative process, which uses chain rule to compute the gradient for each layer having two distinct passes for each iteration, a forward pass and a backward pass layer by layer.

Carrying the forward pass, the outputs of each layer are calculated by first considering arbitrary weights and inputs until the last layer is reached. This output is a prediction of neural nets that are compared to the targets. Based on this, weights are updated in the backward pass, which starts with calculating the error for each neuron in the output layer and then updating weights of the connections between the current and previous layer. This continues until the first hidden layer and one iteration is completed. A new matrix of weights is then generated, and the output is calculated in the forward pass. The input to neural networks is the data that need to be decoded. In the context of BCI data decoding, ECoG signals are recorded from an array of implantable electrodes, which are then processed, and volitional movement intentions are predicted.

## **5. Proposed framework**

Till now, we have shown the theory behind the data processing blocks. Here, detailed explanation of the real-time on-chip ECoG signal extraction and decoding is given. The discussion includes the overview of proposed design and in-depth description of principal component analysis and multilayer perceptron implementation.

## **5.1. Overview of the proposed framework**

In **Figure 4**, we have shown the overview of different blocks in BCI system, and as mentioned before, our work mainly focuses on the design of data processing block—PCA and MLP. To record the electrical activity from the brain, an array of implantable electrodes is placed invasively on the cerebral cortex, and raw ECoG signals are recorded. The interpretable actions for prosthesis control or simulation by the spinal cord are produced by various preprocessing on-chip blocks after recording of ECoG signals. This includes:


**Figure 4.** Overview of the BCI system in which the proposed framework is implemented.

Designing the data processing blocks to decode the volitional movement intentions is main focus of this work. More accurate ECoG signals are obtained by intracranial depth electrode technology, which also gives good spatial resolution and high sampling rate. For example, ECoG signals can be recorded using electrodes placed in an array manner. In Ref. [10], each array consists of 62 electrodes. Hence, in real time, it is challenging to decode high-dimensional ECoG data. As shown in **Figure 4**, the two main parts of data processing blocks are feature extraction and ECoG signal decoding. The following sections will explain in detail the hardware friendly implementation of these components.

## **5.2. Hardware friendly PCA**

includes the overview of proposed design and in-depth description of principal component

In **Figure 4**, we have shown the overview of different blocks in BCI system, and as mentioned before, our work mainly focuses on the design of data processing block—PCA and MLP. To record the electrical activity from the brain, an array of implantable electrodes is placed invasively on the cerebral cortex, and raw ECoG signals are recorded. The interpretable actions for prosthesis control or simulation by the spinal cord are produced by various preprocessing

**b.** An analog-to-digital converter (ADC) that converts analog ECoG signals into its digital

**d.** A stimulator that sends the spinal stimulations or triggers the prosthetic actuators.

analysis and multilayer perceptron implementation.

246 Advances in Statistical Methodologies and Their Application to Real Problems

on-chip blocks after recording of ECoG signals. This includes:

**a.** Raw signal amplifiers and noise filters combine *an analog front*-*end circuit*.

**c.** Feature extraction and feature decoding blocks—*data processing block*.

**Figure 4.** Overview of the BCI system in which the proposed framework is implemented.

hardware friendly implementation of these components.

Designing the data processing blocks to decode the volitional movement intentions is main focus of this work. More accurate ECoG signals are obtained by intracranial depth electrode technology, which also gives good spatial resolution and high sampling rate. For example, ECoG signals can be recorded using electrodes placed in an array manner. In Ref. [10], each array consists of 62 electrodes. Hence, in real time, it is challenging to decode high-dimensional ECoG data. As shown in **Figure 4**, the two main parts of data processing blocks are feature extraction and ECoG signal decoding. The following sections will explain in detail the

**5.1. Overview of the proposed framework**

version.

Principal component analysis (PCA) is a classical data processing technique that retains a minimum set of data from the original set with a maximum amount of variance. It is a popular algorithm that is used in the feature extraction methods. In PCA, the most challenging part is the calculation of the eigenvectors. These eigenvectors can be calculated using the covariance matrix, and this is very challenging [11]. Furthermore, we need to come up with a hardwareimplementable algorithm, as our ultimate goal is to implement it onto the FPGA platform. In order to achieve this, we are using a hardware friendly PCA algorithm, which is not only very efficient in terms of implementation but also helps in extracting those features that are very significant for classification. These features are extracted from the recorded ECoG data [11, 12].

**Figure 5.** Functional blocks of the hardware friendly version of the PCA.

**Figure 5** briefly comprises the functions of the hardware friendly PCA that is used in this system. The input to this algorithm is a covariance matrix and random variables to generate the eigenvectors. These inputs to the algorithm are stored in a look-up table (LUT). The covariance matrix Σ*cov* is calculated from the input data, which is the recorded ECoG data. This input data is of the order of × *n*, where *n* is the number of features and *k* is the number of samples collected for each feature. The algorithm is designed such that two parameters, viz., the total number of eigenvectors required *p* (*p* ≤ *n*) and the iteration number to calculate each of the eigenvectors *r*, have to be declared initially. Once the input to the algorithm is stored and the initial values are declared, the declared random variables are constantly multiplied with the covariance matrix till the iteration number is reached and the first principal component *PC* is obtained. This technique is called eigenvector distilling process [11]. Subsequently, to compute the remaining eigenvectors, they require *r* iterations as it was declared at the beginning. To compute all the other *PCs* apart from the first *PC*, an additional step other than the eigenvector distilling process is required which is the orthogonal process. We use Gram-Schmidt orthogonalization in this orthogonal process. This process is used to do away with all the previously measured *p* − 1 *PCs* from the current eigenvector and its intermediary values.

All four basic math operations were used to compute these eigenvectors in the original algorithm which is the fast PCA algorithm [12]. They made use of addition, multiplication, and norm operators which include division and square root operations. In Ref. [11], flipped structure is proposed to achieve a minimum power and lesser area requirements. Here the equation *ϕp* = *ϕp*/∥*ϕp* ∥ which is a norm operation is eliminated and substituted with = <sup>−</sup> ∥ <sup>∥</sup> ∥ <sup>∥</sup> . This equation is then multiplied with <sup>∥</sup> *ϕj* <sup>∥</sup> <sup>2</sup> . By doing so, the orthogonal process turns into <sup>=</sup> <sup>−</sup> . This equation can now be implemented only using adders and multipliers which are much more efficient in terms of hardware implementation than the division and square root operations. However, this implementation drastically escalates the dynamic range of all the values. In Ref. [11], an adaptive level-shifting scheme was proposed to keep the dynamic range within a limit, but since we are implementing this algorithm using fixed-point mathematical operators, these operators automatically keep a check on the dynamic range and hence we have eliminated this adaptive level shifting scheme in our implementation. The algorithm described so far is shown in Algorithm 1 in **Figure 6**. This algorithm is designed to pick *p* eigenvectors which gives a *k* × *p* feature matrix of *M*′.


**Figure 6.** Eigenvector distilling algorithm.

## **5.3. MLP design**

All four basic math operations were used to compute these eigenvectors in the original algorithm which is the fast PCA algorithm [12]. They made use of addition, multiplication, and norm operators which include division and square root operations. In Ref. [11], flipped structure is proposed to achieve a minimum power and lesser area requirements. Here the equation *ϕp* = *ϕp*/∥*ϕp* ∥ which is a norm operation is eliminated and substituted with

∥ <sup>∥</sup> ∥ <sup>∥</sup> . This equation is then multiplied with <sup>∥</sup> *ϕj* <sup>∥</sup> <sup>2</sup>

<sup>−</sup>

 

implemented only using adders and multipliers which are much more efficient in terms of hardware implementation than the division and square root operations. However, this implementation drastically escalates the dynamic range of all the values. In Ref. [11], an adaptive level-shifting scheme was proposed to keep the dynamic range within a limit, but since we are implementing this algorithm using fixed-point mathematical operators, these operators automatically keep a check on the dynamic range and hence we have eliminated this adaptive level shifting scheme in our implementation. The algorithm described so far is shown in Algorithm 1 in **Figure 6**. This algorithm is designed to pick *p* eigenvectors which gives a

. By doing

. This equation can now be

= <sup>−</sup>

*k* × *p* feature matrix of *M*′.

**Figure 6.** Eigenvector distilling algorithm.

so, the orthogonal process turns into <sup>=</sup>

248 Advances in Statistical Methodologies and Their Application to Real Problems

For the classification of data particularly for volitional movement intentions, the design of MLP is the next step once the dimensionality is reduced using on-chip PCA algorithm described above. The structure of MLP contains multiple layers of nodes with each layer fully connected to the next one. The basic computing unit is called an artificial neuron, which is designed using fixed-point multiplication and addition operators. The inputs to these neurons are first multiplied with preferred weights and added. The output of this combination is summed together and is given to a nonlinear differentiable activation function, log-sigmoid here. LUT is used to implement the transfer function. The number of neurons in each layer and total number of layers are reconfigurable. Training the MLP is done off-line manner, and learned weights are updated using random access memory (RAM) to embed on FPGA board.

## *5.3.1. Fixed-point multiplier*

A parameterized fixed-point-signed multiplier is designed for the multiplication operation taking place inside a neuron. The parameters that can be altered based on the design requirements and available bit widths are the bit widths of two operands (WI1 + WF1; WI2 + WF2) and that of the output (WIO + WFO), where WIx is the integral part bit width and WFx is for fractional part. In a normal multiplier operation, the integral and fractional bit widths (WIO and WFO) of output are obtained by adding the operand's integral and fractional bit widths: WIO = WI1 + WI2 and WFO = WF1 + WF2. However, as we see, the bit width is doubled after every operation. To make it hardware efficient, truncation and rounding are done to reduce the bit width according to the needs. In this experiment, integer and fraction bit widths are kept equal for the two operands. Truncation is done in integral part by removing all the extra bits (WI1 + WI2−WIO) and keeping the signed bit. In the fractional part, only required bits (WFO) are kept and truncate all extra lower significant bits. Overflow flag, which represents the incorrect result, is set to 1 if the signed bits are not similar to the truncated bits.

## *5.3.2. Fixed-point adder*

Similar to the multiplier, in case of normal addition operation, the bit width of integer part is equal to one plus the integral bit width of operand having more bits than the other (if WI1 > WI2, WIO = WI1 + 1 else WIO = WI2 + 1), and for the addition of fractional part, bit width is equal to the greater fractional bit-width operand (if WF1 > WF2, WFO = WF1 else WFO = WF2). Similarly, as in the multiplier, we perform truncation and rounding with the overflow flag.

## *5.3.3. Activation function*

As mentioned before, a nonlinear differentiable transfer function is used at the final stage of a neuron. Activation function accepts the addition of product of all inputs with their weights as an input to generate a nonlinear output. The most used activation function is logistic function (log-sigmoid or tan-sigmoid) for multilayer perceptron for pattern recognition. The output of log-sigmoid function generates outputs in range of 0 and 1 as the input of the neuron's net goes from negative infinity to positive infinity, while tan-sigmoid function ranges between −1 and +1. Log-sigmoid function is an exceptional case of logistic function. The equation and the curve of log-sigmoid function are shown in **Figure 7**:

$$f\left(\mathbf{x}\right) = \frac{1}{1 + e^{-\mathbf{x}}} \tag{1}$$

**Figure 7.** Equation and curve of log-sigmoid function.

The value of *x* (input to activation function) is considered between −5 and +5 for this work at a 5-bit precision. This will give 32 (25 ) values for every two integers that totals 320 values. These fixed-point binary values of f(x) are stored in LUT, whose address can be represented by 9 bits.

#### *5.3.4. Implementing LUT on hardware*

The LUT in our case is used in an efficient manner as compared to LUT usage in general. One way was to use separate LUTs, for input *x* and for output *f*(*x*) with 320 values each, but only one LUT for *f*(*x*) is used. By doing this, the area utilization is reduced by half for every neuron, and speed is doubled. The input for the sigmoid function is a 12-bit binary number (six for integer part and six for fractional part) coming from the output of an adder. The consideration of 5-bit precision is to get the address of LUT from the fractional part, ranging from 00,000 to 11,111. The output should be 0 or 1 if the input is less than −5 or more than +5, respectively. Hence, 0 is stored at the first address of LUT for mapping any value less than or equal to −5. So for −2.96875 (111,101:00,001) as input, the 65th address of LUT is the output value, which is (−3 + 5) \*32 + 1, where (111,101) or −3 is the input integer part and (111,100) or 1 is the input fractional part. Equation 2 is used to calculate the address, specifically when the input lies in the range −5 and +5:

$$\text{addres}\_{\text{9\text{\textquotedblleft}}} = \left(\text{input}\_{\text{inage}} + \\$\right) \text{\*} \text{\textquotedblright} \text{2} + \text{input}\_{\text{fraction}} \tag{2}$$

A number of multipliers and adders (i.e., number of inputs) required to form a neuron depend on the number of neurons in the previous layer.

#### *5.3.5. Implementing MLP on hardware*

tion ranges between −1 and +1. Log-sigmoid function is an exceptional case of logistic func-

The value of *x* (input to activation function) is considered between −5 and +5 for this work at

fixed-point binary values of f(x) are stored in LUT, whose address can be represented by 9 bits.

The LUT in our case is used in an efficient manner as compared to LUT usage in general. One way was to use separate LUTs, for input *x* and for output *f*(*x*) with 320 values each, but only one LUT for *f*(*x*) is used. By doing this, the area utilization is reduced by half for every neuron, and speed is doubled. The input for the sigmoid function is a 12-bit binary number (six for integer part and six for fractional part) coming from the output of an adder. The consideration of 5-bit precision is to get the address of LUT from the fractional part, ranging from 00,000 to 11,111. The output should be 0 or 1 if the input is less than −5 or more than +5, respectively. Hence, 0 is stored at the first address of LUT for mapping any value less than or equal to −5. So for −2.96875 (111,101:00,001) as input, the 65th address of LUT is the output value, which is (−3 + 5) \*32 + 1, where (111,101) or −3 is the input integer part and (111,100) or 1 is the input fractional part. Equation 2 is used to calculate the address, specifically when the input lies in

<sup>1</sup> *<sup>x</sup> f x <sup>e</sup>*- <sup>=</sup> <sup>+</sup> (1)

) values for every two integers that totals 320 values. These

<sup>9</sup>*bit* ( *integer* 5 \*32 ) *fraction address input* = ++ *input* (2)

tion. The equation and the curve of log-sigmoid function are shown in **Figure 7**:

250 Advances in Statistical Methodologies and Their Application to Real Problems

**Figure 7.** Equation and curve of log-sigmoid function.

a 5-bit precision. This will give 32 (25

*5.3.4. Implementing LUT on hardware*

the range −5 and +5:

( ) <sup>1</sup>

After designing a neuron, the next task is to form neural networks, which have hidden layers and output layer with a number of neurons interconnected in the defined manner. There is no thumb rule to decide the number of layers and neurons in each layer. A single increase of neuron in hidden layer will increase at least one multiplier and one adder for every neuron in the next layer, which increase the hardware utilization.

These parameters can also be selected based on a criterion called Akaike information criterion (AIC). For better generalization, this statistical approach can be used to determine the optimum number of hidden units in a neural network, as it is complex because of strong nonlinearity. AIC can be represented by the following equation:

$$AIC = \text{n\*} \ln \left( \frac{RSS}{n} \right) + \text{2\*} \, K \tag{3}$$

where *n* is the number of observations , *RSS* is the residual sum of square errors, and *k* is the number of parameters (total number of weights). The lower the value of AIC, the better is the architecture of neural network. On increasing the number of neurons in the hidden layer, AIC may improve up to an extent. After certain number of neurons in hidden layer, AIC starts increasing and changes very less with change of architecture [13].

As shown in **Figure 8**, our architecture has one hidden layer and an output layer with five and three neurons (3 bit output), respectively. The use of delay elements (boxes) is explained later in this chapter.

**Figure 8.** Architecture of the implemented MLP.

The neural networks are pipelined to fully utilize the hardware and give a better frequency of operation. The boxes seen in **Figure 8** are the delay elements (registers) that store temporary values of the previous outputs. Since two stages of pipeline are implemented to increase the throughput with exchange of latency, the frequency of operation is achieved up to 83 MHz.

## **6. Experimental setup and results**

To show the on-chip implementation of the proposed work, we are using openly accessible ECoG data collected for studies related to sensorimotor restoration [10]. We will demonstrate that using our approach, voluntary movements can be decoded efficiently in real time from a high-dimensional ECoG data. This will strongly serve as a strong basement for a completely automated BCI system.

## **6.1. Experimental setup**

In Ref. [10], an off-chip analog front-end amplifier/filter and an ADC were used to amplify and digitize the amplified ECoG signals that were collected using electrode grids. The electrodes were arranged in an array where each array had 62 platinum electrodes. Each of these electrodes was organized in an 8 × 8 manner. Therefore, 62 channels of ECoG data were collected at once, and each of these measurements was measured with respect to scalp reference and ground.

**Figure 9.** FPGA board used for the experimental study.

A computer display monitor is placed alongside the subject, and the finger to be moved is displayed in this monitor. The subject is asked to move that particular finger, and the ECoG signals are collected during this movement. There is a 2-second gap between each movement, and during this time, the screen displayed nothing. Each of these movements was also recorded for a 2-second time period. Along with the recording of the ECoG signals, the position of the finger was also recorded. The data collection is explained in detail in [10]. The data set we used is a 400,000 62 matrix, where 400,000 is the number of observations collected for study purpose and each of these observations were collected across 62 channels. Here, our main focus is on the movement and non-movement of the finger. Thus, if a finger is moved, it is classified as 1 and 0 otherwise. The predictor model outputs one class for each of the five finger movements, and the sixth class is when all the fingers are at rest. The output of our model is a 400,000 × 1 matrix. We used a Xilinx ARTIX-7 FPGA kit for demonstrating the proposed model. As explained earlier, we are using the embedded RAM in the FPGA board to store our input values. An RS-232 serial port is used to read back the values from the FPGA after all the computations are finished. We achieved a 83.33 MHz frequency which is a maximum possible frequency we could achieve along with a +0.5 ns of worst negative slack. This can further be optimized to obtain 86 MHz frequency of operation. A snap of the Xilinx Artix-7 FPGA kit used for this work is showed in **Figure 9**.

#### **6.2. Feature extraction based on on-chip PCA**

The neural networks are pipelined to fully utilize the hardware and give a better frequency of operation. The boxes seen in **Figure 8** are the delay elements (registers) that store temporary values of the previous outputs. Since two stages of pipeline are implemented to increase the throughput with exchange of latency, the frequency of operation is achieved up to 83 MHz.

To show the on-chip implementation of the proposed work, we are using openly accessible ECoG data collected for studies related to sensorimotor restoration [10]. We will demonstrate that using our approach, voluntary movements can be decoded efficiently in real time from a high-dimensional ECoG data. This will strongly serve as a strong basement for a completely

In Ref. [10], an off-chip analog front-end amplifier/filter and an ADC were used to amplify and digitize the amplified ECoG signals that were collected using electrode grids. The electrodes were arranged in an array where each array had 62 platinum electrodes. Each of these electrodes was organized in an 8 × 8 manner. Therefore, 62 channels of ECoG data were collected at once, and each of these measurements was measured with respect to scalp reference

A computer display monitor is placed alongside the subject, and the finger to be moved is displayed in this monitor. The subject is asked to move that particular finger, and the ECoG

**6. Experimental setup and results**

252 Advances in Statistical Methodologies and Their Application to Real Problems

**Figure 9.** FPGA board used for the experimental study.

automated BCI system.

**6.1. Experimental setup**

and ground.

The original dataset is divided into training Str and validation Svat. Str is a 240 × 62 matrix where 240 is the samples across 62 channels. Out of 240 samples, we choose 40 samples in random for all the six classes as explained above. Thus, the validation matrix now becomes (400,000 − 240) × 62 for the equivalent classes. The finger positions of the six different classes are shown in **Figure 10** which is plotted as a function of 240 samples recoding time. With the aim of decreasing the size of the input data matrix, the hardware friendly PCA is used to extract features from the input matrix. A total of 240 samples are projected on the first and second principal components obtained after performing PCA. **Figure 11(a)** shows this setup as a scatter plot.

**Figure 10.** Merged recorded finger positions of the six classes as a function of recording time.

**Figure 11.** Projection of the 240 training samples onto the first two principle components using (a) the proposed onchip PCA and (b) MATLAB PCA function based on singular value decomposition (SVD) algorithm.

Each color denotes a different class, and it is clearly evident from this figure that the training samples are all distinguished in this space. The hardware friendly PCA algorithm used in this work is compared with the singular value decomposition (SVD)-based MATLAB PCA function which is shown in **Figure 11(b)**. It is clear that the samples calculated from the proposed PCA algorithm closely match with that obtained from the traditional SVD-based MATLAB function. This goes on to prove the accuracy and the effectiveness of the proposed on-chip PCA algorithm. The number of principal components required to represent the reduced dataset is determined by the sum of the amount of variance covered by each of the principal components. In our case, the first three principal components add up to 80% of the total variance in the dataset. Considering more than three principal components would not increase the variance significantly but rather increase the computational capability of the algorithm in multiple folds in terms of hardware utilization.

Furthermore, **Figure 12** displays the mean squared error (MSE) for the first three principal components with different iterations by comparing the principal components obtained from the MATLAB function (which is used as a reference) and the algorithm used in this work. We obtained MSE values lesser than 0.2 for the three principal components computed with a maximum of 12 iterations to show the efficiency of the proposed framework.

**Figure 12.** Mean square error of the first three principle components as a function of iterations by taking the principle component values computed using MATLAB SVD algorithm as the baseline.

## **6.3. Classification based on on-chip MLP**

**Figure 11.** Projection of the 240 training samples onto the first two principle components using (a) the proposed on-

Each color denotes a different class, and it is clearly evident from this figure that the training samples are all distinguished in this space. The hardware friendly PCA algorithm used in this work is compared with the singular value decomposition (SVD)-based MATLAB PCA function which is shown in **Figure 11(b)**. It is clear that the samples calculated from the proposed PCA algorithm closely match with that obtained from the traditional SVD-based MATLAB function. This goes on to prove the accuracy and the effectiveness of the proposed on-chip PCA algorithm. The number of principal components required to represent the reduced dataset is determined by the sum of the amount of variance covered by each of the principal components. In our case, the first three principal components add up to 80% of the total variance in the dataset. Considering more than three principal components would not increase the variance significantly but rather increase the computational capability of the algorithm in multiple folds

Furthermore, **Figure 12** displays the mean squared error (MSE) for the first three principal components with different iterations by comparing the principal components obtained from the MATLAB function (which is used as a reference) and the algorithm used in this work. We obtained MSE values lesser than 0.2 for the three principal components computed with a

**Figure 12.** Mean square error of the first three principle components as a function of iterations by taking the principle

maximum of 12 iterations to show the efficiency of the proposed framework.

component values computed using MATLAB SVD algorithm as the baseline.

chip PCA and (b) MATLAB PCA function based on singular value decomposition (SVD) algorithm.

254 Advances in Statistical Methodologies and Their Application to Real Problems

in terms of hardware utilization.

In this work, we have tested different structures of neural networks with different numbers of neurons in hidden layer and then calculating error of classification. For the output layer, three neurons are chosen corresponding to the three outputs of 1 bit each. For the hidden layer, 2– 10 neurons work good for most applications. Testing for 3–8 neurons, three neurons give AIC equal to −929, and for four neurons, the value is −1082. This value is decreased up to −1286 for five and six neurons. But the total number of weights increased to 30 ((3\*5) + (5\*3)) for five neurons and 36 ((3\*6) + (6\*3)) for six neurons. Further increase in neurons will not show much improvement in AIC, though will increase the hardware utilization as the number of weights increases by 6 for each neuron and hence the computing elements (adders and multipliers) in the next stage.


**Table 1.** Summary of area utilization of the proposed architecture.

The neural network is trained by giving the first three principal components from the total of 62 as inputs, 40 samples for each class, and a data matrix of size 240 × 3. The implemented design with one hidden layer having five neurons and an output layer having three neurons has two-stage pipelining. The clock period is 12 ns so as the throughput; two stages will increase the latency to 24 ns. One hundred percent accuracy is reached for the training set (240 samples). The remaining validation set of 399,760 (400,000 − 240) samples are given to the trained neural network. This data gives the correct classification accuracy, which is 82.4%. Since we have considered all the noise and perturbations during recording, this is reasonable.


**Table 2.** Summary of difference performances for five different bit-width values.

The power consumption in this architecture is 152 mW. The area utilization is summarized in **Table 1**. As seen in the table, the area utilization of the proposed architecture is lesser than 25% of the available resources; this can be a good lead for future application-specific integrated circuit (ASIC) design development which can lead to even less power consumption.

We have also tried for various MLP architectures with different bit widths that give different accuracy, power, speed, and area. **Table 2** summarizes difference performances for five different bit-width values. The bit-widths column represents the number of bit widths used to represent data including the covariance matrix, intermediate results from the algorithm, and also the final principal components that are computed. They are represented in "integer length and fractional length" form.

#### **6.4. Discussions**

The bit width that we choose has a direct impact on the accuracy of the algorithm and also its power consumption. The bit width of the input covariance matrix, the intermediate results of bit width and also the bit width of the output of the algorithm need to be considered primarily in order to determine the accuracy of the algorithm for our applications. Second, the number of principal components needs to be carefully selected to improve computational efficiency. Third, the number of iterations required to compute each of the principal components should also be optimally chosen.

For a given operating frequency, the silicon area utilization and the power consumption mainly depend on the first parameter, whereas the processing capability of the algorithm is influenced by the second and third parameters. Processing capability is mainly determined by the number of channels that can be trained using the PCA algorithm under a given amount of time. Power consumption for different bit widths can be kept constant with reduced frequency or reduced speed of execution. When higher bit widths are chosen for the sake of accuracy, the area required for covariance matrix memory, register files, and processing units increases drastically in order to store more numbers of bits and process more data. It can be also observed that the power consumption increases with higher frequencies.

## **7. Conclusion**

This chapter presents a structure of on-chip computation that decodes ECoG brain signals in a BCI system, serving a pathway to developing a real-time BCI system. The two main blocks of our proposed decoding model are a hardware friendly PCA model and an artificial neural network (ANN). Openly accessible ECoG recordings and the experimental results from FPGA show the accuracy of over 80% for predicting single-finger movement.

## **Acknowledgements**

This project was carried by Award Number EEC-1028725 from the National Science Foundation. The author takes the full responsibility of the content, and the official views of the National Science Foundation are not represented.

## **Author details**

We have also tried for various MLP architectures with different bit widths that give different accuracy, power, speed, and area. **Table 2** summarizes difference performances for five different bit-width values. The bit-widths column represents the number of bit widths used to represent data including the covariance matrix, intermediate results from the algorithm, and also the final principal components that are computed. They are represented in "integer length

The bit width that we choose has a direct impact on the accuracy of the algorithm and also its power consumption. The bit width of the input covariance matrix, the intermediate results of bit width and also the bit width of the output of the algorithm need to be considered primarily in order to determine the accuracy of the algorithm for our applications. Second, the number of principal components needs to be carefully selected to improve computational efficiency. Third, the number of iterations required to compute each of the principal components should

For a given operating frequency, the silicon area utilization and the power consumption mainly depend on the first parameter, whereas the processing capability of the algorithm is influenced by the second and third parameters. Processing capability is mainly determined by the number of channels that can be trained using the PCA algorithm under a given amount of time. Power consumption for different bit widths can be kept constant with reduced frequency or reduced speed of execution. When higher bit widths are chosen for the sake of accuracy, the area required for covariance matrix memory, register files, and processing units increases drastically in order to store more numbers of bits and process more data. It can be also observed that

This chapter presents a structure of on-chip computation that decodes ECoG brain signals in a BCI system, serving a pathway to developing a real-time BCI system. The two main blocks of our proposed decoding model are a hardware friendly PCA model and an artificial neural network (ANN). Openly accessible ECoG recordings and the experimental results from FPGA

This project was carried by Award Number EEC-1028725 from the National Science Foundation. The author takes the full responsibility of the content, and the official views of the National

the power consumption increases with higher frequencies.

256 Advances in Statistical Methodologies and Their Application to Real Problems

show the accuracy of over 80% for predicting single-finger movement.

and fractional length" form.

also be optimally chosen.

**7. Conclusion**

**Acknowledgements**

Science Foundation are not represented.

**6.4. Discussions**

Mradul Agrawal, Sandeep Vidyashankar and Ke Huang\*

\*Address all correspondence to: khuang@mail.sdsu.edu

Department of Electrical and Computer Engineering, Center for Sensorimotor Neural Engineering (CSNE), San Diego State University, San Diego, CA, USA

## **References**


cosine transform. IEEE Engineering in Medicine and Biology Society (EMBC). Chicago, IL, USA. 2014; 1626–1629.


**Provisional chapter**

## **The Usage of Statistical Learning Methods on Wearable Devices and a Case Study: Activity Recognition on Smartwatches Devices and a Case Study: Activity Recognition on Smartwatches**

**The Usage of Statistical Learning Methods on Wearable** 

Serkan Balli and Ensar Arif Sağbas Serkan Balli and Ensar Arif Sağbas Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66213

## **Abstract**

cosine transform. IEEE Engineering in Medicine and Biology Society (EMBC). Chicago,

[10] G. Schalk, J. Kubanek, K. Miller, N. Anderson, E. Leuthardt, J. Ojemann, D. Limbrick, D. Moran, L. Gerhardt, and J. Wolpaw. Decoding two-dimensional movement trajectories using electrocorticographic signals in humans. Journal of Neural Engineering.

[11] T. Chen, W. Liu, and L. Chen. VLSI architecture of leading eigenvector generation for on-chip principal component analysis spike sorting system. 30th Annual International

[12] A. Sharma, and K. Paliwal. Fast principal component analysis using fixed-point

[13] G. Panchal, A. Ganatra, Y.P. Kosta, and D. Panchal. Searching most efficient neural network architecture using Akaike's information criterion (AIC). International Journal

IEEE EMBS Conference. Vancouver, BC, Canada. 2008; 3192–3195.

algorithm. Pattern Recognition Letters. 2007; 28:1151–1155.

of Computer Applications. 2010; 1(5), 41–44.

258 Advances in Statistical Methodologies and Their Application to Real Problems

IL, USA. 2014; 1626–1629.

2007; 4:264–275.

The aim of this study is to explore the usage of statistical learning methods on wearable devices and realize an experimental study for recognition of human activities by using smartwatch sensor data. To achieve this objective, mobile applications that run on smart‐ watch and smartphone were developed to gain training data and detect human activity momentarily; 500 pattern data were obtained with 4‐second intervals for each activity (walking, typing, stationary, running, standing, writing on board, brushing teeth, clean‐ ing and writing). Created dataset was tested with five different statistical learning meth‐ ods (Naive Bayes, k nearest neighbour (kNN), logistic regression, Bayesian network and multilayer perceptron) and their performances were compared.

**Keywords:** statistical learning, activity recognition, wearable devices, smartwatch, Bayesian networks

## **1. Introduction**

The usage of wearable technology is increasing rapidly, and the effects of user healthcare are enormous. Today's smart devices have more built‐in sensors than before. Wearable sen‐ sors are small devices which are carried by people, while they are performing daily activi‐ ties. These sensors such as an accelerometer, microphone, GPS and barometer record the physical condition of person such as location change, moving direction and moving speed. Latest smartphones and smartwatches have many wearable sensors as built‐in [1, 2]. Because of equipped with various on‐board sensors, smartphones and wrist‐worn devices such as smartwatches are being extensively used for activity recognition in recent studies [3]. With the popularity of the smartwatches, wrist‐worn sensor devices will become an increasingly

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

important tool in personal health monitoring [4]. Statistical learning methods are generally used in activity recognition studies. Statistical learning refers to a set of tools for modelling and understanding complex datasets. It is a recently developed area in statistics and blends with parallel developments in computer science and, in particular, machine learning [5].

The aim of this chapter is to investigate the usage of statistical learning methods on wearable devices and carry out a case study for recognition of human activities with accelerometer data of smartwatch by using statistical learning methods. This chapter is organized as follows: related works are described in detail in Section 2. Then, overview of statistical methods is mentioned in Section 3. Next, human activity recognition with smartwatches is explained in Section 4. Finally, Section 5 concludes the chapter.

## **2. Related works**

When examining the literature, various studies are found statistical learning methods with wearable devices. Wang et al. [6] imagined a user typing on a laptop keyboard while wearing a smartwatch. The accelerometer and gyroscope data, which obtained from Samsung Galaxy Live, were used as training data, and processed through a sequence of steps, including key‐ press detection, hand‐motion tracking, character point cloud computation and Bayesian mod‐ elling and inference. Shoaib et al. [3] carried out to recognize of different living activities by using a smartphone and a smartwatch simultaneously and evaluated their effectiveness in recognizing human activities. They used J48, kNN and SVM (support vector machines) to recognize 13 various activities. da Silva and Galeazzo [7] presented the development of a sys‐ tem based on computational intelligence techniques and on an accelerometer to perform, in a comfortable and non‐intrusive manner, the recognition of basic movements of a person's rou‐ tine. Three different computational intelligence techniques were evaluated in order to search for the best performance of the recognition of the movements executed by the watch user. Chernbumroong et al. [8] studied classification of five human activities by using only acceler‐ ometer data and two learning algorithms: Artificial Neural Networks and Decision Tree C4.5. Scholl and van Laerhoven [9] presented a feasibility study with smokers wearing an acceler‐ ometer device on their wrist over the course of a week to detect their smoking habits based on detecting typical gestures carried out while smoking a cigarette. The Gaussian method was used as a classifier. Dong et al. [10] described a new method that uses a watch‐like configu‐ ration of sensors to continuously track wrist motion throughout the day and automatically detect periods of eating. Accelerometer and gyroscope sensor data were used in this study. Ramos‐Garcia and Hoover [11] developed a Hidden Markov model (HMM) and compared its recognition performance against a non‐sequential classifier (kNN), using a set of four actions (rest, utensiling, bite and drink). Trost et al. [12] compared the activity recognition rates of an activity classifier trained on acceleration signal collected on the wrist and hip. Features were extracted from 10 seconds windows and inputted into a regularized logistic regres‐ sion model. Guiry et al. [1] investigated the role of smart devices including smartphones and smartwatches which can play in identifying activities of daily living. The activities examined include walking, running, cycling, standing, sitting, elevator ascents, elevator descents, stair ascents and stair descents. Data from this study were used to train and test five well‐known statistical machine learning algorithms: C4.5, CART, naïve Bayes, multilayer perceptrons and finally support vector machines. Mortazavi et al. [4] introduced a framework for platform creation (e.g. accelerometer only system versus accelerometer and gyroscope) and machine learning of some activities, which can be especially useful in the emerging market of smart‐ watches. Random forests, decision trees, Naive Bayes and SVM methods were compared. Khan et al. [13] implemented a smartphone‐based HAR scheme in accordance with these requirements. Time domain features were extracted from only three smartphone sensors, and a nonlinear discriminatory approach was employed to recognize 15 activities with a high accuracy. Evaluations were performed in both offline and online settings. Dadashi et al. [14] carried out detection of important breaststroke swimming events automatically by using Hidden Markov model (HMM) and wearable sensors. Parkka et al. [15] used accelerometers and gyroscopes attached to ankle, wrist and hip to estimate intensity of physical activity. Data from common everyday tasks and exercise were collected with 11 subjects. Shen et al. [16] tracked the 3D posture of the entire arm—both wrist and elbow—using the motion and magnetic sensors on smartwatches. Bieber and Peter [17] studied behaviour analysis using 3D sensor data and learning techniques and obtained sufficient results. Bao and Intille [18] developed an algorithm and evaluated to detect physical activities from data acquired using five small biaxial accelerometers worn simultaneously on different parts of the body. Kim et al. [19] developed an application by using sensor signals from smartphone and smartwatch. Summary of the literature is given in **Table 1**.

important tool in personal health monitoring [4]. Statistical learning methods are generally used in activity recognition studies. Statistical learning refers to a set of tools for modelling and understanding complex datasets. It is a recently developed area in statistics and blends with parallel developments in computer science and, in particular, machine learning [5].

The aim of this chapter is to investigate the usage of statistical learning methods on wearable devices and carry out a case study for recognition of human activities with accelerometer data of smartwatch by using statistical learning methods. This chapter is organized as follows: related works are described in detail in Section 2. Then, overview of statistical methods is mentioned in Section 3. Next, human activity recognition with smartwatches is explained in

When examining the literature, various studies are found statistical learning methods with wearable devices. Wang et al. [6] imagined a user typing on a laptop keyboard while wearing a smartwatch. The accelerometer and gyroscope data, which obtained from Samsung Galaxy Live, were used as training data, and processed through a sequence of steps, including key‐ press detection, hand‐motion tracking, character point cloud computation and Bayesian mod‐ elling and inference. Shoaib et al. [3] carried out to recognize of different living activities by using a smartphone and a smartwatch simultaneously and evaluated their effectiveness in recognizing human activities. They used J48, kNN and SVM (support vector machines) to recognize 13 various activities. da Silva and Galeazzo [7] presented the development of a sys‐ tem based on computational intelligence techniques and on an accelerometer to perform, in a comfortable and non‐intrusive manner, the recognition of basic movements of a person's rou‐ tine. Three different computational intelligence techniques were evaluated in order to search for the best performance of the recognition of the movements executed by the watch user. Chernbumroong et al. [8] studied classification of five human activities by using only acceler‐ ometer data and two learning algorithms: Artificial Neural Networks and Decision Tree C4.5. Scholl and van Laerhoven [9] presented a feasibility study with smokers wearing an acceler‐ ometer device on their wrist over the course of a week to detect their smoking habits based on detecting typical gestures carried out while smoking a cigarette. The Gaussian method was used as a classifier. Dong et al. [10] described a new method that uses a watch‐like configu‐ ration of sensors to continuously track wrist motion throughout the day and automatically detect periods of eating. Accelerometer and gyroscope sensor data were used in this study. Ramos‐Garcia and Hoover [11] developed a Hidden Markov model (HMM) and compared its recognition performance against a non‐sequential classifier (kNN), using a set of four actions (rest, utensiling, bite and drink). Trost et al. [12] compared the activity recognition rates of an activity classifier trained on acceleration signal collected on the wrist and hip. Features were extracted from 10 seconds windows and inputted into a regularized logistic regres‐ sion model. Guiry et al. [1] investigated the role of smart devices including smartphones and smartwatches which can play in identifying activities of daily living. The activities examined include walking, running, cycling, standing, sitting, elevator ascents, elevator descents, stair

Section 4. Finally, Section 5 concludes the chapter.

260 Advances in Statistical Methodologies and Their Application to Real Problems

**2. Related works**



**Table 1.** Summary of the studies.

## **3. Overview of statistical learning**

Statistical learning contains a large number of unsupervised and supervised tools for infer‐ encing from data. In general terms, supervised statistical learning is employed as a statistical model to estimate or predict an output using relevant inputs in various areas such as pub‐ lic policy, medicine, astrophysics and business. In unsupervised statistical learning, learning of relationships and structure of data is possible without supervising the output [5]. In this chapter, supervised statistical learning methods (Naive Bayes, logistic regression, Bayesian net‐ work, k nearest neighbour (kNN) and multilayer Perceptron) are used for activity recognition.

The Naive Bayes method is applied to learn and represent probabilistic information from data with clear and easy understanding by using supervised learning tasks in which classes are known in training phase, in which prediction of classes is realized in the test phase [20]. Multilayer perceptron is a feedforward structure of artificial neural networks because the out‐ put of the input layer and all intermediate layers is submitted only to the higher layer. Here 'layer' means a layer of perceptrons. The number of hidden layers and the number of percep‐ trons at each hidden layer are not limited [21]. In kNN, the whole of the calibration data set is used as a classification model. In other words, kNN does not create a different model from calibration data set due to its non‐parametric construction. In the same multidimensional hyperspace, a test set is used as the calibration set for classification. From the new test set object to the calibration objects, the K nearest neighbours are computed. The smallest length using a chosen norm is called as 'nearest' [22]. Logistic regression is used to describe and test suppositions about associations between class variable and other related predictor variables by estimating probabilities using a logistic function. Logistic regression can be binomial, ordi‐ nal or multinomial [23]. One of the probabilistic graphical models is Bayesian networks. In Bayesian networks, the knowledge about a vague subject is showed as graphical structures. In particular, variables are represented as nodes in the graph, whereas probabilistic dependen‐ cies among the variables are represented as the edges. The values of the edges in the graph can be calculated by using known computational and statistical methods [24]. The model structure of the Bayesian Network used for the research in the case study is shown in **Figure 1**. Variables are standard deviations and averages of *x*‐, *y*‐ and *z*‐axis of accelerometer sensor.

**Figure 1.** The model structure of the Bayesian network.

**3. Overview of statistical learning**

**Table 1.** Summary of the studies.

Statistical learning contains a large number of unsupervised and supervised tools for infer‐ encing from data. In general terms, supervised statistical learning is employed as a statistical model to estimate or predict an output using relevant inputs in various areas such as pub‐ lic policy, medicine, astrophysics and business. In unsupervised statistical learning, learning of relationships and structure of data is possible without supervising the output [5]. In this

**Ref No. Author Year Detection Device Sensors Methods**

[10] Dong et al. 2014 Period of Eating iPhone 4 Accelerometer

temporal phases

2013 Walking, running, sitting, standing, lying, climbing stairs, coming down stairs and working on computer

2011 Sitting, standing, lying, walking, running

running, cycling on exercise

[17] Bieber and Peter 2008 Walking, running, cycling, and resting

[15] Parkka et al. 2007 ironing, vacuuming, walking,

bicycle

[18] Bao and Intille 2004 20 different subject ADXL210E

2013 Gesture recognition Wrist‐worn

Smartphone

accelerometer and gyroscope

IMU wearable sensor

2012 Cigarette smoking Hedgehog Accelerometer Gaussian

Bosch 3D‐ acceleration sensor

Kionix accelerometer, XV‐3500 gyroscope

accelerometers (On Body)

Accelerometer, pressure, microphone

gyroscope

Accelerometer gyroscope

Accelerometer, gyroscope

Ez‐430 Choronos Accelerometer Artificial neural

Accelerometer, gyroscope

Ez‐430 Choronos Accelerometer Multilayer

Artificial neural network, Support vector machines and Gaussian mixture model

Naive Bayes

Hidden Markov model

perceptron, k nearest neighbour, support vector machine

classifier

network, decision tree

nets and decision trees,

Pearson linear correlation

IBL, C4.5, naïve Bayes

J48

Accelerometer SVM, Bayesian

Accelerometer Decision table,

Hidden Markov model, k Nearest neighbour

[13] Khan et al. 2014 16 different subject LG Nexus 4

262 Advances in Statistical Methodologies and Their Application to Real Problems

[14] Dadashi et al. 2013 Breaststroke swimming

[11] Ramos‐Garcia and Hoover

[7] da Silva and Galeazzo

[9] Scholl and van Laerhoven

[8] Chernbumroong et al.

## **4. Case study: activity recognition on smartwatches using statistical learning methods**

In this study, activity recognition is performed by using accelerometer sensor data. Accelerometer measures the acceleration force in m/s2 that is applied to a device on all three physical axes (*x*, *y* and *z*) given in **Figure 2**, including the force of gravity [25, 26].

**Figure 2.** Smartwatch accelerometer axes.

**Figure 3** shows amplitude change of accelerometer *x*‐axis for nine different daily activities (typing, writing, writing on board, walking, running, cleaning, standing, brushing teeth and stationary).

Accelerometer signals of smartwatch are utilized for activity detection by using statistical learning methods. **Figure 4** represents the flowchart of activity recognition which includes collecting data, feature selection, classification and development of smartwatch application steps. Information about these steps is given in the following sub‐sections.

#### **4.1. Collecting data and feature selection**

**Hardware:** Motorola Moto 360 [27] smartwatch (**Figure 5**) is used. This device has quad core 1.2 GHz processor, 512 MB RAM and built‐in accelerometer, pedometer ambient light and optical heart rate monitor sensors. In this chapter, only accelerometer sensor is used to detect

**Figure 3.** Amplitude change of accelerometer x‐axis.

**4. Case study: activity recognition on smartwatches using statistical** 

physical axes (*x*, *y* and *z*) given in **Figure 2**, including the force of gravity [25, 26].

In this study, activity recognition is performed by using accelerometer sensor data.

**Figure 3** shows amplitude change of accelerometer *x*‐axis for nine different daily activities (typing, writing, writing on board, walking, running, cleaning, standing, brushing teeth and stationary). Accelerometer signals of smartwatch are utilized for activity detection by using statistical learning methods. **Figure 4** represents the flowchart of activity recognition which includes collecting data, feature selection, classification and development of smartwatch application

**Hardware:** Motorola Moto 360 [27] smartwatch (**Figure 5**) is used. This device has quad core 1.2 GHz processor, 512 MB RAM and built‐in accelerometer, pedometer ambient light and optical heart rate monitor sensors. In this chapter, only accelerometer sensor is used to detect

steps. Information about these steps is given in the following sub‐sections.

**4.1. Collecting data and feature selection**

**Figure 2.** Smartwatch accelerometer axes.

that is applied to a device on all three

**learning methods**

Accelerometer measures the acceleration force in m/s2

264 Advances in Statistical Methodologies and Their Application to Real Problems

human activities. While coding the smartwatch application, SENSOR\_DELAY\_UI is set as sampling rate which allows the sampling rate of 50 Hz. This device is capable of tracing daily life activities about all day long with 400 mAh battery. It has Android wear operating system.

**Figure 4.** Flowchart of activity recognition.

**Figure 5.** Smartwatch that used in this case study.

**Software:** For collecting dataset, two Android‐based applications are developed by using Java programming language. One of these applications runs on smartwatch (**Figure 6b**) and the other one runs on smartphone (**Figure 6a**) that connected smartwatch. Because collected sen‐ sor data are sent to smartphone to keep in internal storage.

**Figure 6.** (a) Smartphone dataset application, (b) smartwatch dataset application.

**Figure 7.** Structure of dataset application.

**Software:** For collecting dataset, two Android‐based applications are developed by using Java programming language. One of these applications runs on smartwatch (**Figure 6b**) and the other one runs on smartphone (**Figure 6a**) that connected smartwatch. Because collected sen‐

sor data are sent to smartphone to keep in internal storage.

**Figure 4.** Flowchart of activity recognition.

266 Advances in Statistical Methodologies and Their Application to Real Problems

**Figure 5.** Smartwatch that used in this case study.

Smartwatch application has only one push button. This button serves to begin and end col‐ lecting the sensor data. **Figure 7** shows structure of storing sensor data to smartphone internal storage. The collected sensor data are transferred to the connected smartphone and stored in smartphone internal memory as CSV format with the desired label name. In order to start the data collection process, the user writes performing activity name to mobile phone application and press the 'Begin' button on smartwatch application. During the data collection, smart‐ watch must be located on the wrist.

For training statistical methods, raw sensor data are collected on nine different human activi‐ ties viz running, walking, typing, writing, standing, writing on board, stationary, cleaning and teeth brushing which consists 900.000 lines (100.000 samples for each activity). Then data are split into parts of 200 lines (4‐second intervals) to form a pattern. Thus, each activity has 500 patterns. Features are extracted from raw accelerometer data. These features are standard deviations and average values of *x*‐, *y*‐ and *z*‐axes of accelerometer data given in **Figure 1**.

#### **4.2. Classification with statistical learning methods**

**Experimental study:** Extracted features are tested by five different statistical learning meth‐ ods (Naive Bayes, kNN, logistic regression, Bayesian network and multilayer perceptron) using WEKA Toolkit [28] comprising these methods. Half of the data is used for training and remaining to test. For training and testing data are split randomly. **Table 2** displays the comparison of evaluation metrics such as accuracy rates, *F*‐measure, ROC area, root mean squared error (RMSE) of statistical learning methods. *F*‐measure is a measure of a test's accu‐ racy. Formulation of *F*‐measure is given in Eq. (1). FN, FP, TP and TN represents the number of false negatives, the number of false positives, the number of true positives and the number of true negatives, respectively [29].

$$F \text{ - measure } = \frac{2 \times \frac{TN}{TP + FP} \times \frac{TP}{TP + FN}}{\frac{TP}{TP + FN} + \frac{TN}{TP + FP}} \tag{1}$$

The RMSE of a model prediction with respect to the estimated variable *X*model is defined as the square root of the mean squared error given in Eq. (2):

$$\begin{aligned} \text{square root of the mean squared error given in Eq. (2):}\\ \text{RMSE} &= \sqrt{\frac{1}{\text{ft}} \ge \sum\_{i=1}^{n} (X\_{\text{obs},i} - X\_{\text{model},i})^2} \\ &\tag{2} \end{aligned} \tag{2}$$

where Xobs is observed values and Xmodel is modelled values at time/place *i* [30].

ROC (receiver operating characteristic) area is also known as area under curve (AUC) is calculated as in Eq. (3).


$$\mathbf{AUC} = \frac{1}{mn} \sum\_{i=1}^{m} \sum\_{j=1}^{n} \mathbf{1}\_{p\_{\mathcal{I}}^{\circ} p\_{j}} \tag{3}$$

**Table 2.** The accuracy rates, *F*‐measure, ROC area, Root mean squared error values of statistical methods.


**Table 3.** Confusion matrix of Bayesian network.

For training statistical methods, raw sensor data are collected on nine different human activi‐ ties viz running, walking, typing, writing, standing, writing on board, stationary, cleaning and teeth brushing which consists 900.000 lines (100.000 samples for each activity). Then data are split into parts of 200 lines (4‐second intervals) to form a pattern. Thus, each activity has 500 patterns. Features are extracted from raw accelerometer data. These features are standard deviations and average values of *x*‐, *y*‐ and *z*‐axes of accelerometer data given in **Figure 1**.

**Experimental study:** Extracted features are tested by five different statistical learning meth‐ ods (Naive Bayes, kNN, logistic regression, Bayesian network and multilayer perceptron) using WEKA Toolkit [28] comprising these methods. Half of the data is used for training and remaining to test. For training and testing data are split randomly. **Table 2** displays the comparison of evaluation metrics such as accuracy rates, *F*‐measure, ROC area, root mean squared error (RMSE) of statistical learning methods. *F*‐measure is a measure of a test's accu‐ racy. Formulation of *F*‐measure is given in Eq. (1). FN, FP, TP and TN represents the number of false negatives, the number of false positives, the number of true positives and the number

The RMSE of a model prediction with respect to the estimated variable *X*model is defined as the

ROC (receiver operating characteristic) area is also known as area under curve (AUC) is

*mn* ∑ *i*=1 *m* ∑ *j*=1 *n* 1*pi* >*pj*

*<sup>n</sup> x* ∑ *i*=1 *n*

where Xobs is observed values and Xmodel is modelled values at time/place *i* [30].

**Methods Accuracy rates F measure ROC area RMSE Naive Bayes** 81.33 0.819 0.974 0.1644 **Bayesian network** 91.55 0.916 0.993 0.1242 **kNN (k = 3)** 89.68 0.896 0.971 0.135 **Logistic regression** 85.55 0.854 0.977 0.1507 **Multilayer perceptron** 74.57 0.734 0.957 0.1937

**Table 2.** The accuracy rates, *F*‐measure, ROC area, Root mean squared error values of statistical methods.

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ \_\_1

(*X*obs,*<sup>i</sup>* − *X*model,*<sup>i</sup>* )2 (2)

*TP* <sup>+</sup> *FP <sup>x</sup>* \_\_\_\_\_\_ *TP TP* <sup>+</sup> *FN* \_\_\_\_\_\_\_\_\_\_\_\_\_ \_\_\_\_\_\_ *TP TP* <sup>+</sup> *FN* <sup>+</sup> \_\_\_\_\_\_ *TN TP* + *FP*

(1)

(3)

**4.2. Classification with statistical learning methods**

268 Advances in Statistical Methodologies and Their Application to Real Problems

*<sup>F</sup>* <sup>−</sup> measure <sup>=</sup> <sup>2</sup>*<sup>x</sup>* \_\_\_\_\_\_ *TN*

square root of the mean squared error given in Eq. (2):

RMSE <sup>=</sup> <sup>√</sup>

AUC = \_\_\_1

calculated as in Eq. (3).

of true negatives, respectively [29].

Here, *i* runs over all *m* data points with true label 1, and *j* runs over all *n* data points with true label 0; *p*<sup>i</sup> and *p*<sup>j</sup> denote the probability score assigned by the classifier to data point *i* and *j*, respectively. 1 is the indicator function: it outputs 1 if the condition is satisfied [31].

According to **Table 2**, the Bayesian network method has the best accuracy rate 91.55% and minimum RMSE value. *F*‐measure is a measure of a test's accuracy. The best values of both ROC area and *F*‐measure belong to the Bayesian network method. Confusion matrix for Bayesian network is given in **Table 3** and ROC curves of five different methods (Bayesian network, kNN, Naïve Bayes, logistic regression and multilayer perceptron) are given in **Figures 8**–**12**.

According to **Table 3**, recognition accuracy for cleaning is about 75%. This activity does not have simple characteristics and is easily confused with other activities. For example, 19 of 235 brushing teeth activity are misclassified as cleaning and 39 of 256 cleaning activity are misclassified as brushing teeth. In addition, writing board activities are confused with brush‐ ing teeth and cleaning activities are confused with running and walking. Because cleaning activity involves walking.

**Development mobile application:** According to the results shown in **Table 2**, the Bayesian network method is used in Android wear‐based classification application for recognition human activities (**Figure 13**).

Developed mobile application for smartwatches collects sensor data and converts it as a pattern in 4 seconds intervals. Then it classifies the data by using trained Bayesian network model and WEKA API and shows detected activity on smartwatch screen (**Figure 14b**). At this step, the smartwatch application does not need the smartphone. Also it is possible to report detected activities on smartphone screen via developed application for Android smartphone (**Figure 14a**).

Steps of the algorithm and sample Java codes used in activity detection application are given in **Figure 15**.

**Figure 8.** ROC curve for classification by Bayesian network.

**Figure 9.** ROC curve for classification by kNN.

**Figure 10.** ROC curve for classification by naïve Bayes.

**Figure 8.** ROC curve for classification by Bayesian network.

270 Advances in Statistical Methodologies and Their Application to Real Problems

**Figure 9.** ROC curve for classification by kNN.

**Figure 11.** ROC curve for classification by logistic regression.

**Figure 12.** ROC curve for classification by multilayer perceptron.

**Figure 13.** Detected human activities.

**Figure 14.** (a) Detected activities reporting application for smartphone, (b) activity recognition application for smartwatch.

**Figure 15.** Steps of algorithm and sample Java codes.

**Figure 12.** ROC curve for classification by multilayer perceptron.

272 Advances in Statistical Methodologies and Their Application to Real Problems

**Figure 13.** Detected human activities.

## **5. Conclusion**

In this chapter, human activity recognition on smartwatches by using statistical methods is studied. It is found that the Bayesian network method is the best method for the dataset used in the study. Through this work, it is possible to understand how to classify the human activi‐ ties by using statistical learning methods and sensor data. Only accelerometer sensor data are used for nine different activities. To use different sensors, which smartwatches have (heart rate monitor, ambient light, GPS and gyroscope), to detect more activities by increasing the number of classes (handshake, smoking, drinking, etc.) or to separate more complex parts of activities (e.g. walking hands in pockets, walking hand in hand, etc.) can improve the studies for human activity recognition in the future.

Nowadays, smartwatches and wrist‐worn sensors are used in daily activity monitoring and healthy lifestyle applications. These devices can also help to warn the user in daily life for cre‐ ating a healthy sportive habit. For example, smartwatch can send a reminder to user to warn about staying stationary for a long time. Such devices and applications can give information about people such as how much they walk, how long they sleep and how many calories they burn. In addition, this kind of work also contributes to virtual reality applications.

## **Acknowledgment**

This study is supported by Muğla Sıtkı Koçman University Scientific Research Projects under the grant number 016‐061.

## **Author details**

Serkan Balli\* and Ensar Arif Sağbas

\*Address all correspondence to: serkan@mu.edu.tr

Department of Information Systems Engineering, Faculty of Technology, Muğla Sıtkı Koçman University, Muğla, Turkey

## **References**


Sensors. In: Pervasive Computing and Communication Workshops; 23–27 March; St. Louis, MO. IEEE; 2015. pp. 591–596. DOI: 10.1109/PERCOMW.2015.7134104

**5. Conclusion**

**Acknowledgment**

the grant number 016‐061.

University, Muğla, Turkey

Serkan Balli\* and Ensar Arif Sağbas

\*Address all correspondence to: serkan@mu.edu.tr

2014;**14**(3):5687–5701. DOI: 10.3390/s140305687

2016;**22**(5):376–383. DOI: 10.5505/pajes.2015.63308

**Author details**

**References**

for human activity recognition in the future.

274 Advances in Statistical Methodologies and Their Application to Real Problems

In this chapter, human activity recognition on smartwatches by using statistical methods is studied. It is found that the Bayesian network method is the best method for the dataset used in the study. Through this work, it is possible to understand how to classify the human activi‐ ties by using statistical learning methods and sensor data. Only accelerometer sensor data are used for nine different activities. To use different sensors, which smartwatches have (heart rate monitor, ambient light, GPS and gyroscope), to detect more activities by increasing the number of classes (handshake, smoking, drinking, etc.) or to separate more complex parts of activities (e.g. walking hands in pockets, walking hand in hand, etc.) can improve the studies

Nowadays, smartwatches and wrist‐worn sensors are used in daily activity monitoring and healthy lifestyle applications. These devices can also help to warn the user in daily life for cre‐ ating a healthy sportive habit. For example, smartwatch can send a reminder to user to warn about staying stationary for a long time. Such devices and applications can give information about people such as how much they walk, how long they sleep and how many calories they

This study is supported by Muğla Sıtkı Koçman University Scientific Research Projects under

Department of Information Systems Engineering, Faculty of Technology, Muğla Sıtkı Koçman

[1] John J. Guiry, Pepijn van de Ven and John Nelson. Multi‐Sensor Fusion for Enhanced Contextual Awareness of Everyday Activities with Ubiquitous Devices. Sensors.

[2] Ensar Arif Sağbaş and Serkan Ballı. Transportation Mode Detection By Using Smartphone Sensors And Machine Learning. Pamukkale University Journal of Engineering Sciences.

[3] Muhammad Shoaib, Stephan Bosch, Hans Scholten, Paul J. M. Havinga and Ozlem Durmaz Incel. Towards Detection of Bad Habits by Fusing Smartphone and Smartwatch

burn. In addition, this kind of work also contributes to virtual reality applications.


[28] Stephen R. Garner. WEKA: The Waikato Environment for Knowledge Analysis. In: Computer Science Research Students Conference; April; New Zealand 1995. pp. 57–64.

[15] Juha Parkka, Mikka Ermes, Kari Antila, Mark Van Gils, Ari Manttari and Heikki Nieminen. Estimating Intensity of Physical Activity: A Comparison of Wearable Accelerometer and Gyro Sensors and 3 Sensor Locations. In: Engineering in Medicine and Biology Society; 22‐26 August; Lyon. IEEE; 2007. pp. 1511–1514. DOI: 10.1109/IEMBS.2007.4352588 [16] Sheng Shen, He Wang and Romit Roy Choudhury. I Am a Smartwatch and I Can Track My User's Arm. In: Mobile Systems, Applications and Services; 25–30 June; Singapore.

[17] Gerald Bieber and Christian Peter. Using Physical Activity for User Behavior Analysis. In: Pervasive Technologies Related to Assistive Environments; 15–19 July; Athens,

[18] Ling Bao and Stephen S. Intille. Activity Recognition from User‐Annotated Acceleration Data. In: Pervasive Computing; 18‐23 April; Linz. Berlin: Springer‐Verlag; 2004. p. 1–17.

[19] Ki‐Hoon Kim, Mi‐Young Jeon, Ju‐Young Lee, Ji‐Hoon Jeong and Gu‐Min Jeong . A Study on the App Development Using Sensor Signals from Smartphone and Smartwatch. Advanced Science and Technology Letters. 2014;**62**:66–69. DOI: 10.14257/astl.2014.62.17

[20] George H. John and Pat Langley. Estimating Continuous Distributions in Bayesian Classifiers. In: Uncertainty in Artificial Intelligence; 18–20 August; Quebec. San Francisco:

[21] Ludmila I. Kuncheva. Combining Pattern Classifiers Methods and Algorithms. New

[22] Bjørn Kåre Alsberg, Royston Goodacre, Jem J Rowland and Douglas Kell. Classification of Pyrolysis Mass Spectra by Fuzzy Multivariate Rule Induction‐Comparison with Regression, K‐Nearest Neighbour, Neural and Decision‐Tree Methods. Analytical

[23] Chao‐Ying Joanne Peng, Kuk Lida Lee and Gary M. Ingersoll. An Introduction to Logistic Regression Analysis and Reporting. The Journal of Educational Research. 2002;**96**(1):3–

[24] Irad Ben‐Gal. Bayesian Networks. In: Fabrizio Ruggeri, Ron Kenett and Frederick Faltin, editors. Encyclopedia of Statistics in Quality and Reliability. Chichester, UK: Wiley; 2007.

[25] Rahul Ravindran, Riya Suchdev, Yash Tanna and Sridhar Swamy. Context Aware and Pattern Oriented Machine Learning Framework (CAPOMF) for Android. In: Advances in Engineering and Technology Research; 1–2 August; Unnao. IEEE; 2014. pp. 1–7. DOI:

[26] Android. Sensors Overview [Internet]. Available from: https://developer.android.com/

[27] Motorola. Moto 360 [Internet]. Available from: http://www.motorola.com/us/products/

guide/topics/sensors/sensors\_overview.html [Accessed: 15.06.2016]

Chimica Acta. 1997;**348**(1):389–407. DOI: 10.1016/S0003‐2670(97)00064‐0

Singapore: ACM; 2016. DOI: 10.1145/2906388.2906407

276 Advances in Statistical Methodologies and Their Application to Real Problems

Morgan Kaufmann; 1995. pp. 338–345.

14. DOI: 10.1080/00220670209598786

10.1109/ICAETR.2014.7012912

moto‐360 [Accessed: 15.06.2016]

Jersey: John Wiley & Sons, Inc.; 2004. 350 p.

Greece. New York: ACM; 2008. DOI: 10.1145/1389586.1389692


**Provisional chapter**

## **A Statistic Method for Anatomical and Evolutionary Analysis Analysis**

**A Statistic Method for Anatomical and Evolutionary** 

Roqueline Ametila e Gloria Martins de Freitas Aversi-Ferreira, Hisao Nishijo and Tales Alexandre Aversi-Ferreira Freitas Aversi-Ferreira, Hisao Nishijo and Tales Alexandre Aversi-Ferreira

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

Roqueline Ametila e Gloria Martins de

http://dx.doi.org/10.5772/66445

#### **Abstract**

Rules, formulas, and statistical tests have been widely used in studies that analyze continuous variables with the normal (Gaussian) distribution or defined parameters. Nevertheless, in some studies such as those in gross anatomy, only statistics with discrete or nominal variables are available. In fact, the existence or absence of an anatomical structure, its features and internal aspects, innervation, arterial and vein supplies, etc. can be analyzed as discrete and/or nominal variables. However, there have been no adequate methods, which allow transformation of data with qualitative/nominal variables in gross anatomy to those with quantitative variables. To resolve the issue, we have purposed a new method that allows, in order, descriptions based on numerical analyses, the statistical method for comparative anatomy (SMCA), and proposed the formula for comparison of groups of anatomical structures among different species that allows to infer evolutionary perspective. The important features of this method are as follows: (1) to allow to analyze numerical data, which are converted from discrete or nominal variables in morphological areas and (2) to quantitatively compare identical structures within the same species and across different species. The SMCA fills the lack of a specific method for statistical works in comparative anatomy, morphology, in general, and evolutional correlations.

**Keywords:** statistic nonparametric, nominal variables, anatomy, morphology, evolution

## **1. Introduction**

The statistical analysis is widely used in studies in almost all scientific fields to lead to discussions and conclusions of the data [1]. In most of the cases, the variables analyzed are

continuous ones that can be analyzed by calculating an average, standard deviation, i.e., the data can be fitted approximately to Gaussian distribution (normal distribution) of probability. Gaussian distribution means that most population, in studied variables, is concentrated around the average, i.e., the data are grouped around the average symmetrically [2]; these data, when fitted in Cartesian plane, display a geometrical shape similar to an inverted bell, called Gaussian curve showing a central tendency. In a perfect Gaussian distribution, the average is located at the center of the curve and the frequency of the data, in a studied variable, decreases quantitatively toward lateral extremes. This type of data distribution, similar distribution around the average, is called parametric statistic.

It is important that data characteristics regarding their distribution must be analyzed before any statistical calculation. However, variables of data that do not follow Gaussian distribution are sometimes submitted to statistical calculation under assumption of normal distribution of probability. In these cases, application to mathematical tools under assumption of normal distribution would induce errors, to be more specific, acceptance or rejection of statistical hypotheses should be incorrect [1]. These kinds of errors are becoming common mainly because of indiscriminate and incorrect use of statistical software. The statistical programs are very important to allow fast, concise, and reliable analyses of variables and include a test of normality of data. However, sometimes, data that do not display normal distribution are analyzed using these programs, and the results usually indicate no statistical significance.

For understanding the importance of correct statistics, imagine that although average atmospheric temperature around a man is 24°C, his feet are put in a refrigerator and the head is put in a stove. Then, he should consider the temperatures as very uncomfortable [3]. In this case, the average hardly reflects correct interpretation of the data. Indeed, these extreme values could provide an acceptable average. However, mostly these values are not fitted into Gaussian distribution. Therefore, they cannot be analyzed by parametric statistics [1–3].

Data that cannot be submitted to parametric designs should be analyzed using nonparametric statistics. Indeed, other types of averages can be calculated based on nonnormal distributions, as in the binomial or chi-square (*χ*<sup>2</sup> ) distribution, for example. However, they usually do not allow central tendency analyses and require randomness among data. Furthermore, nonparametric statistics are less precise than parametric ones [4].

In gross anatomy, there are many cases in which numerical data are not available for analyses. Anatomical studies must analyze absence or presence of a structure or organ and characteristics associated with these organs; for example, presence or absence of specific nerves and vessels in muscles, and distribution of these structures if these structures are present. They are qualitative variables, but not numerical ones. This means that numbers cannot provide this type of information.

It is well-known that anatomical texts include vast descriptions of structures, relationships between the structures, axes, and positions of the body. These findings indicate that specific statistical methods are required especially in comparative anatomical studies. However, any previous statistical methods do not allow accurate analyses of anatomical data, but only could assist discussion on them. Some anatomical studies tried to analyze qualitative variables more objectively, using nonparametric statistics such as chi-square (*χ*<sup>2</sup> ) [5]. The basis of the chisquare statistics is associativity among the data. Thus, the chi-square statistic is an important tool as a multivariate analysis of discrete variables that are considered to be independent. However, the statistical hypothesis in this test does not agree with the Darwin´s theory of evolution, *inter alia*, because of the assumption of existence of a common ancestor [6]. Indeed, the assumption of the common ancestor suggests similarities of structures across species, i.e., they cannot be random in organisms that evolved from a same ancestor, since the ancestral animal provided basic structures, and could provide derivative features with descendants (for a detailed review, see Ref. [7]). Therefore, application to the statistical methods such as chi-square statistics, which are based on the randomized premise, could induce a hypothetical error.

It is reasonable that when the central tendency measures cannot be used, nonparametric distributions must be chosen [2], mainly due to small sample sizes [3]. The nonparametric distributions are also used when it is difficult to set up quantitative variables. Indeed, the percentage of a given structure based on the frequency of the structure in the samples is one of the key measures in nonparametric statistics used in gross anatomy. In gross anatomy, the highest percentage of occurrence of a given structure is called **normal**, while the lowest percentage is called **variation** [8]. Percentages of **normal** and **variation** could be analyzed using nonparametric methods.

Gross anatomy has no specific statistical method for analyses of noncontinuous variables regarding anatomical structures until a few years ago. Here, we show a new statistical method based on nonparametric statistics, more consistent with anatomical descriptions. We also compare this new method with cladistics used for evolutionary analyses to indicate usefulness of this new method in this discipline.

## **2. Concepts of the statistical methods for gross anatomy**

In this section, we will show that the new statistical method is based on the anatomical concept of normality, and appropriate weight is provided with each variable (parameter for a specific feature) of structures based on the importance of the variable, and that conclusions can be drawn based on the values integrated across multiple variables. The results by this new method have been reported in our previous papers, in which this method was designed to compare muscles not only within the same species but also across different species in comparative gross anatomy [1, 7, 9–11].

## **2.1. Anatomical concept of normality and variation**

continuous ones that can be analyzed by calculating an average, standard deviation, i.e., the data can be fitted approximately to Gaussian distribution (normal distribution) of probability. Gaussian distribution means that most population, in studied variables, is concentrated around the average, i.e., the data are grouped around the average symmetrically [2]; these data, when fitted in Cartesian plane, display a geometrical shape similar to an inverted bell, called Gaussian curve showing a central tendency. In a perfect Gaussian distribution, the average is located at the center of the curve and the frequency of the data, in a studied variable, decreases quantitatively toward lateral extremes. This type of data distribution, similar

It is important that data characteristics regarding their distribution must be analyzed before any statistical calculation. However, variables of data that do not follow Gaussian distribution are sometimes submitted to statistical calculation under assumption of normal distribution of probability. In these cases, application to mathematical tools under assumption of normal distribution would induce errors, to be more specific, acceptance or rejection of statistical hypotheses should be incorrect [1]. These kinds of errors are becoming common mainly because of indiscriminate and incorrect use of statistical software. The statistical programs are very important to allow fast, concise, and reliable analyses of variables and include a test of normality of data. However, sometimes, data that do not display normal distribution are analyzed using these programs, and the results usually indicate no statistical significance.

For understanding the importance of correct statistics, imagine that although average atmospheric temperature around a man is 24°C, his feet are put in a refrigerator and the head is put in a stove. Then, he should consider the temperatures as very uncomfortable [3]. In this case, the average hardly reflects correct interpretation of the data. Indeed, these extreme values could provide an acceptable average. However, mostly these values are not fitted into Gaussian distribution. Therefore, they cannot be analyzed by parametric statistics [1–3].

Data that cannot be submitted to parametric designs should be analyzed using nonparametric statistics. Indeed, other types of averages can be calculated based on nonnormal distributions,

allow central tendency analyses and require randomness among data. Furthermore, nonpara-

In gross anatomy, there are many cases in which numerical data are not available for analyses. Anatomical studies must analyze absence or presence of a structure or organ and characteristics associated with these organs; for example, presence or absence of specific nerves and vessels in muscles, and distribution of these structures if these structures are present. They are qualitative variables, but not numerical ones. This means that numbers cannot provide

It is well-known that anatomical texts include vast descriptions of structures, relationships between the structures, axes, and positions of the body. These findings indicate that specific statistical methods are required especially in comparative anatomical studies. However, any previous statistical methods do not allow accurate analyses of anatomical data, but only could assist discussion on them. Some anatomical studies tried to analyze qualitative variables more

) distribution, for example. However, they usually do not

distribution around the average, is called parametric statistic.

280 Advances in Statistical Methodologies and Their Application to Real Problems

as in the binomial or chi-square (*χ*<sup>2</sup>

this type of information.

metric statistics are less precise than parametric ones [4].

The initial step in the statistical method for comparative anatomy (**SMCA**) is to calculate the frequency based on the **normality** and **variation** concept in anatomy. "A normal structure" means that it is observed in greater than 50% of cases within the same species. Therefore, the variation can be observed in less than 50% of cases [8]. The summary of the steps to calculate SMCA is shown in **Table 1**.

Good examples of structures in animals that should be applied to SMCA are muscles, because muscles require different variables to describe their characteristics: shapes, innervation, vascularization, origin, insertion, and number. Different individuals in the same species or individuals in different species could display different numbers, as in the contrahentes muscles in primates.

The formula indicating the relationship among a total number of studied structures and numbers of normal and variation in the structures is shown below:

$$\mathbf{N} = \mathbf{N}\_{\boldsymbol{\phi}} = \sum\_{i=1}^{q} \left( r\_{v(\boldsymbol{\phi})k} + \mathbf{n}\_{v(\boldsymbol{\phi})k} \right) \tag{1}$$

where *N* is the total number of analyzed structures, *n***<sup>v</sup>** is the number of structures with variation, and *r***<sup>v</sup>** is the number of normal structures **(***N***-***n***<sup>v</sup> )**. The subscript *i* indicates specific species, for instance human, chimpanzees, etc., while the subscript *j* indicates specific structures, and the subscript *k* indicates parameters (variables) of the specific structures. Thus, the sum of normal and variation in structures must be a total (100%) of these structures.

In case of muscles, the parameters should include at least the following four: (1) innervation, (2) origin, (3) insertion, and (4) vascularization. For instance, in a case of the biceps in a specific species, *j* is 1 (*j* = 1), and *i* is 1 (*i* = 1). The data analysis in this step should be performed in terms of the following four parameters: (1) innervation (*rv(***111)**, *nv(***111)**), (2) origin (*rv(***112)**, *nv(***112)**), (3) insertion (*rv(***113)**, *nv(***113)**), and (4) vascularization (*rv(***114)**, *nv(***114)**). Furthermore, (5) number of muscles (*rv(***115)**, *nv(***115)**) and (6) shape (*rv(***116)**, *nv(***116)**) could be added for more detailed analyses. In addition, further detailed parameters (subscript (*h*)) could be added (see below in detail).

The next step is calculation of the relative frequency (**RF =** *P***ijk**) of normal structures in each parameter against the total number of structures based on frequencies of normality and variation, i.e., (1) innervation (*rv(***111)**, *nv(***111)**), (2) origin (*rv(***112)**, *nv(***112)**), (3) insertion (*rv(***113)**, *nv(***113)**), and (4) vascularization (*rv(***114)**, *nv(***114)**). According to these frequencies, each RF for (1) innervation (*P***ij1**), (2) origin (*P***ij2**), (3) insertion (*P***ij3**), and (4) vascularization (*P***ij4**) can be calculated, as follows:

$$\mathbf{RF} = P\_{\psi k} = \frac{r\_{v(i)}}{N} \tag{2}$$

When the structure is pair organs, *N* (number of individuals in a sample) must be multiplied by 2. It is also possible to separately calculate *P***ijk** for each piece in each body side in case of pair organs. Although any values can be used for *N*, smaller number of *N* will result in lower statistical power. It is obviously essential to analyze large numbers of specimens. The analyses with small numbers of specimens are not appropriate for scientific analyses.

It is noted that qualitative features are transformed into quantitative data after the initial data are expressed as percentages. Thus, the method allows numerical description of anatomical structures, which increases preciseness in description of characteristics of anatomical structures. Another usefulness of this method is that the value of **RF** can be obtained from previous literatures as prevalence (percentage of the structure) in a given species. This is especially impor-


Good examples of structures in animals that should be applied to SMCA are muscles, because muscles require different variables to describe their characteristics: shapes, innervation, vascu

larization, origin, insertion, and number. Different individuals in the same species or individuals in different species could display different numbers, as in the contrahentes muscles in primates. The formula indicating the relationship among a total number of studied structures and num

> *n* **v**

*k* indicates parameters (variables) of the specific structures. Thus, the sum of

*rv(***116)** ,

> *rv(***112)** ,

*P***ij3**), and (4) vascularization (

*nv(***114)**). According to these frequencies, each RF for (1) innerva

*N* (number of individuals in a sample) must be multiplied

*N*, smaller number of

**)**. The subscript

**(** *N**n* **v**

In case of muscles, the parameters should include at least the following four: (1) innerva

*i* is 1 (

*nv(***115)**) and (6) shape (

tion, (2) origin, (3) insertion, and (4) vascularization. For instance, in a case of the biceps

parameter against the total number of structures based on frequencies of normality and

*nv(***111)**), (2) origin (

= *P ij k* = *r v* ( *ij k* ) \_\_\_\_

statistical power. It is obviously essential to analyze large numbers of specimens. The analy

It is noted that qualitative features are transformed into quantitative data after the initial data are expressed as percentages. Thus, the method allows numerical description of anatomical structures, which increases preciseness in description of characteristics of anatomical structures. Another usefulness of this method is that the value of **RF** can be obtained from previous litera

tures as prevalence (percentage of the structure) in a given species. This is especially impor

ses with small numbers of specimens are not appropriate for scientific analyses.

*nv(***113)**), and (4) vascularization (

bers of normal and variation in the structures is shown below:

282 Advances in Statistical Methodologies and Their Application to Real Problems

*N* is the total number of analyzed structures,

is the number of normal structures

*j* is 1 (*j* = 1), and

*rv(***113)** ,

The next step is calculation of the relative frequency (**RF =** 

*P***ij2**), (3) insertion (

*rv(***111)** ,

*rv(***115)** ,

*rv(***114)** ,

cies, for instance human, chimpanzees, etc., while the subscript

*N*= *N*. *j* . = ∑ *i* = 1 *q* ( *r v* ( *ij k* ) + *n v* ( *ij k* )

normal and variation in structures must be a total (100%) of these structures.

performed in terms of the following four parameters: (1) innervation (

detailed analyses. In addition, further detailed parameters (subscript (

( *rv(***112)** ,

tion (

as follows:

where

tion, and

*r***v**

in a specific species,

(see below in detail).

(5) number of muscles (

*nv(***112)**), (3) insertion (

variation, i.e., (1) innervation (

RF

by 2. It is also possible to separately calculate

pair organs. Although any values can be used for

When the structure is pair organs,

and (4) vascularization (

*P***ij1**), (2) origin (

and the subscript










) (1)

*i* indicates specific spe

*nv(***111)**), (2) origin

*nv(***114)**). Furthermore,

*h*)) could be added

*rv(***113)** , *nv(***113)**),

*P***ij4**) can be calculated,

*N* will result in lower

*j* indicates specific structures,

is the number of structures with varia

*i* = 1). The data analysis in this step should be

*rv(***114)** ,

*nv(***112)**), (3) insertion (

*<sup>N</sup>* (2)

*P***ijk** for each piece in each body side in case of

*rv(***111)** ,

*nv(***116)**) could be added for more

*P***ijk**) of normal structures in each

A Statistic Method for Anatomical and Evolutionary Analysis http://dx.doi.org/10.5772/66445 283 tant when the study includes comparative anatomy. For example, the palmaris longus could be defect in humans [8, 10, 12] and its prevalence is around 90% [13]; therefore, **RF** might be 90% among total individuals. However, in the analyses of innervation, vascularization, origin, or insertion of the palmaris longus, only 90% receives attention and the data from the remaining 10% are sometimes discarded. Such case is common in comparative studies, where, usually, only data in specific species are studied.

Normal structure in each parameter means 0.5 < *P***ijk** ≤ 1 in practical terms, because a quantity lesser than 50% does not match the definition of normal structures in anatomy. However, mathematically, *P***ijk** can vary as follows: 0 ≤ *P***ijk** ≤ 1. For a given species, *P***ijk**, according to the concept of normality, must be greater than 0.5. However, when different species are compared, the frequencies of the **normal** structures could be different. For example, when the dorsoepitrochlearis muscle is compared among primates and *Homo*, it is rarely observed in modern humans [14] and approximate *P***ijk** is 0.05 (this value was derived from the literature reporting the percentage of presence of this muscle in individuals), while, in nonhuman primates, the dorsoepitroclearis muscle is a normal feature, and the *P***ijk** is 1.00. Furthermore, some muscles have more than one origin or insertion, as in the triceps brachii with three heads, and in rare cases, this muscle has four heads of origin in *Homo* [8, 13]. Therefore, there are only two kinds of origins in this muscle; type 1 containing three heads as a normal feature and type 2 containing four heads as a variation form.

For accurate and detailed analyses, it is required to calculate *P***ijk** by adding other multiple parameters for muscles or other structures. For example, for muscles, the parameters should include, at least, (1) number or kinds of nerves, or branches of a same nerve (*P***ij1**), (2) origin(s) of muscles (*P***ij2**), (3) insertion of muscles (*P***ij3**), and (4) vascularization of muscles by arteries or branches of one artery (*P***ij4**). These parameters should be chosen according to the goal of the analysis; some of the parameters could be removed while the other could be added. Furthermore, (5) quantity of muscles (*P***ij5**) and its (6) shape (*P***ij6**) could be included in more detailed studies. It is noted that small number of parameters results in less characterization of the studied structure.

By the introduction of this parameter (*P***ijk**) in this method (SMCA), anatomical characteristics (shown by *P***ijk**) can be compared among different samples within the same species or across different species, which is useful characteristic of the SMCA for studies of comparative anatomy. For example, variations in the number of the muscles could be compared within the same species or across different species in primates [12].

#### **2.2. Definition of pondered average of frequency (PAF)**

In the next step of the SMCS in which multiple features (*P***ijk**) of a given structure are compared among different species, a unique variable (PAF), an integrated value over multiple parameters (*P***ijk**) is computed. For this purpose, pondered values [the weighted coefficients (*w***k**)], multiplied by *P***ijk**, are specified. The coefficients must be specified according to the anatomical importance of a given parameter in assessment of anatomical similarity. For example, since a small value of the *P***ijk** is ascribed to large variations, the characteristic is not important in assessment of structure similarity. Therefore, the *P***ijk** with small values must be associated with small weighted coefficients. On the other hand, the *P***ijk** with large values (i.e., few variations) must be associated with larger weighted coefficients (see below).

tant when the study includes comparative anatomy. For example, the palmaris longus could be defect in humans [8, 10, 12] and its prevalence is around 90% [13]; therefore, **RF** might be 90% among total individuals. However, in the analyses of innervation, vascularization, origin, or insertion of the palmaris longus, only 90% receives attention and the data from the remaining 10% are sometimes discarded. Such case is common in comparative studies, where, usually,

Normal structure in each parameter means 0.5 < *P***ijk** ≤ 1 in practical terms, because a quantity lesser than 50% does not match the definition of normal structures in anatomy. However, mathematically, *P***ijk** can vary as follows: 0 ≤ *P***ijk** ≤ 1. For a given species, *P***ijk**, according to the concept of normality, must be greater than 0.5. However, when different species are compared, the frequencies of the **normal** structures could be different. For example, when the dorsoepitrochlearis muscle is compared among primates and *Homo*, it is rarely observed in modern humans [14] and approximate *P***ijk** is 0.05 (this value was derived from the literature reporting the percentage of presence of this muscle in individuals), while, in nonhuman primates, the dorsoepitroclearis muscle is a normal feature, and the *P***ijk** is 1.00. Furthermore, some muscles have more than one origin or insertion, as in the triceps brachii with three heads, and in rare cases, this muscle has four heads of origin in *Homo* [8, 13]. Therefore, there are only two kinds of origins in this muscle; type 1 containing three heads as a normal feature

For accurate and detailed analyses, it is required to calculate *P***ijk** by adding other multiple parameters for muscles or other structures. For example, for muscles, the parameters should include, at least, (1) number or kinds of nerves, or branches of a same nerve (*P***ij1**), (2) origin(s) of muscles (*P***ij2**), (3) insertion of muscles (*P***ij3**), and (4) vascularization of muscles by arteries or branches of one artery (*P***ij4**). These parameters should be chosen according to the goal of the analysis; some of the parameters could be removed while the other could be added. Furthermore, (5) quantity of muscles (*P***ij5**) and its (6) shape (*P***ij6**) could be included in more detailed studies. It is noted that small number of parameters results in less characterization

By the introduction of this parameter (*P***ijk**) in this method (SMCA), anatomical characteristics (shown by *P***ijk**) can be compared among different samples within the same species or across different species, which is useful characteristic of the SMCA for studies of comparative anatomy. For example, variations in the number of the muscles could be compared within the

In the next step of the SMCS in which multiple features (*P***ijk**) of a given structure are compared among different species, a unique variable (PAF), an integrated value over multiple parameters (*P***ijk**) is computed. For this purpose, pondered values [the weighted coefficients (*w***k**)], multiplied by *P***ijk**, are specified. The coefficients must be specified according to the anatomical importance of a given parameter in assessment of anatomical similarity. For example, since a small value of the *P***ijk** is ascribed to large variations, the characteristic is not important in

only data in specific species are studied.

284 Advances in Statistical Methodologies and Their Application to Real Problems

and type 2 containing four heads as a variation form.

same species or across different species in primates [12].

**2.2. Definition of pondered average of frequency (PAF)**

of the studied structure.

After designation of pondered values as weighted coefficients, the pondered average of frequencies (**PAF =** *P***w(ij)**) is computed according to the following formula:

$$\mathbf{PAF} = \mathbf{P}\_{\text{av}(\text{)}} = \frac{\sum\_{l=1}^{\sum\_{l=1}^{l} w\_k \cdot P\_{\bar{v}k}}}{\sum\_{l=1}^{\sum\_{l=1}^{l} w\_k}};$$

for any species (*i* = 1, 2, …, s ) and any muscles (*j* = 1, 2, …, m ) ; (3)

where *P***ijk** is the relative frequency and *w***k** is the weighted coefficient linked to a specific parameter. For instance, for the muscle 1 of species 1, *P***111** is the PAF (relative frequency) of innervation, and weighted coefficient *w***<sup>1</sup>** is 3; *P***112** is PAF of the muscle origin, and *w***<sup>2</sup>** is 2; *P***<sup>113</sup>** is relative frequency of muscle insertion, and *w***<sup>3</sup>** is 2; and *P***114** is relative frequency of vascularization, and *w***<sup>4</sup>** is 1 [9].

The idea of weighted coefficients (*w***k**) is based on frequencies of variation in studied structures. The structures with less variation could receive larger weight and the more variable structures, smaller weight. For example, if vessels receive larger weight in the comparison, this could compromise final results, leading to an idea of larger anatomical difference among specimens or species in spite of less difference in muscles. Therefore, we gave the weighted coefficient 3 to **innervation** (*k* = 1, *w***<sup>1</sup>** *= 3*) in a case of muscles. In muscles, during embryonic development of animals, a given nerve terminates on a given muscle [14]. Thus, variations in innervation of muscles are few. Therefore, variations in the innervation is very sensitive to the differences among individuals in a same species and, also, to the difference among different species. Accordingly, the four parameters for muscles noted above, i.e., innervation, origin, insertion, and vascularization, the innervation shows less variations, **origin** and **insertion** usually show similar variations, and vascularization, more variations. Thus, both weighted coefficients for origin and insertion should receive the same weight coefficient 2 (*w***<sup>2</sup>** *= 2* for origin, *w***<sup>3</sup>** *= 2* for insertion). Finally, the parameter with greater variation, **vascularization** (*k* = 4), received the weighted coefficient 1 (*w***<sup>4</sup>** *=* **1**). Indeed, vascularization can be different between the same muscles in bilateral sides within the same individuals [12].

Zero cannot be accepted as weighted coefficient (*w***k**). Therefore, *w***k** must be greater than zero, i.e., *w***k** > 0. To make the calculation easier and to keep clear parameters, the best choice is to use only integer values, i.e., *w***k** ≥ 1. Accordingly, it is very important to keep in mind the choice of the weighted coefficients should depend on different degrees of variations of the structures; highest weighted coefficient for the parameter with the lowest variations, or the same weighted coefficients for the parameters with identical degree of variation. The designed numbers also should be integers or discrete ones since it does not make sense to look for values that represent the exact differences among descriptive or nominal variables.

## **2.3. Definition of comparative anatomy index (CAI) for comparison among different species**

In normal structures, *P***w(ij)** must be greater than 0.5 and less than or equal to 1, i.e., 0.5 < *P***w(ij)** ≤ 1. In fact, *P***w(ij)** could be 1, if every *P***w(ij)** has maximal value 1, and if every *P***w(ij)** is minimum (*P***w(ij)** > 0.5), the *P***w(ij)** will be 0.5, as well. Mathematically, *P***w(ij)** can vary within the range of 0 ≤ *P***w(ij)** ≤ 1, since *P***w(ij)** could be zero or less than 0.5 in analyses using different species in which such structure might not be normal.

It is noted that *P***w(ij)** [which is a function of each relative frequency of each specific feature (*P***ijk**) and its weight (*w***k**)] can be used to assess mathematical similarity of a population of anatomical data; equal values indicate high similarity and large difference in the values between two species indicates dissimilarities or less similarity. In order to compare structures among different species, *P***w(ij)** has to be calculated in each species.

Before calculating *P***w(ij)**, each *P***ijk** must be computed according to the data of species used as a reference, i.e., the **control species**. For example, the coracobrachialis, a muscle of the arm, could have one or two cranial heads in different primates [15]. For the coracobrachialis, *P***ijk** could be different depending on the number of cranial heads in the **control species**. Thus, *P***ijk** must be consistently calculated in reference to **control species**, since different species could have different normal structures (see below in detail).

For example, there are two types of origin in the coracobrachialis (*j* = 1); type 1 has one origin and type 2 has two origins (*k* = 2). *P***ijk** could take different values according to the number of heads in the reference species (**control species**) (*i* = 1). In noncontrol species to be studied (*i* = 2) in which type 1 (number of origin is 1) is normal, the *P***212** will be 1 in reference to the species with one head, and *P*212 of type 1 will be 0.5 in reference to the species in which type 2 (two heads) is normal. In case of the muscle that has one to three heads of origins in different species, the *P***ijk** should be divided by maximum number (i.e., 3) of heads resulting in 1/3, because *P***ijk** should not be greater than 1. Accordingly, when **control species** with three heads of cranial origin of a muscle is chosen as reference, then, *P***212** is 1.000 (3 heads, *i* = 2), *P***312** is 0.667 (two heads, *i* = 3), and *P***412** is 0.333 (1 head, *i* = 4).

Therefore, *P***ijk** must be obtained initially for the **control species**. When the **control species** (*i* = 1) have anatomically normal two cranial heads (*k* = 2) for the coracobrachialis (*j* = 1), and if all individuals in this species have two cranial heads (100%), *P***112** should be 1; if 90% of individuals in this species have two cranial heads, *P***112** should be 9/10. In a case of a noncontrol species (*i* = 2) in which the anatomical normal is one cranial head for the coracobrachial, if all individuals have one head, the *P***212** should be 0.5, and if 90% of the individuals have one head, *P***212** should be 0.45. These values noted above (*P***ijk**) can be obtained from the data in previous studies, and can be applied to the CAI analysis (see below).

Although any species can be defined as control species, the species studied in the first time or the species with abundant known data should be chosen as control species. To compare any single structure (e.g., muscle) between two different species (*i* ≠ *i*'), the data in any noncontrol species can be compared one by one with those in the control species using the comparative anatomy index (CAI) defined by the following the formula:

$$\mathbf{CAI} = \left| \mathbf{P}\_{w(0)} - \mathbf{P}\_{w(\prime)} \right| \text{ : where } i \neq \vec{i} \tag{4}$$

The **CAIii'** formula represents the absolute difference of weighted averages (*P***w(ij)**) of a single structure between the **control** *i* and other noncontrol *i*' species. In this comparison, the noncontrol *i*' species should be always compared with the same **control species** *i*. For example, the formula to compare a structure (*j* = 1) with a given feature (parameter) (*k* = 1) between the *i* and *i***'** species is shown as follows:

$$\mathbf{CAI} = \left| \mathbf{P}\_{u(1)} - \mathbf{P}\_{u(2)} \right|. \tag{5}$$

It is noted that the **CAIii'** ranges from 0 to 1, i.e., 0 ≤ **CAIii'** ≤ 1. This is because the maximum value of *P***w(ij)** is 1 and the minimum is 0. Note that this equation permits only comparison of just one structure between the two species.

#### **2.4. Definition of group comparative anatomy index (GCAI) for comparison of a group of structures among different species**

However, the **SMCA** analysis of the muscles in the forearm [9] shows the need to compare many muscles of a same functional group between different species, for instance, the deep flexor muscles in the forearm among studied species of primates. This need could be understood as reasonable purpose because they work together for some functions as close the hand, then, the comparison of these muscles as a group seems more appropriate in terms of physiology, phylogeny, taxonomy, and evolution, as well. Thus, was purposed the **GCAI** to compare a group of the muscles among species [1, 9, 11], one by one based on the sum of the *P***w(ij)**, as follows:

$$\mathbf{P}\_{w(i)} = \frac{\sum\_{j=1}^{m} \mathbf{P}\_{w(i)}}{m\_j} \tag{6}$$

where *i* indicates number of species (*i* = 1, 2,…, s) and *j* indicates number of studied structures (*j* = 1, 2,…, m) and *m***<sup>j</sup>** is the number of structures studied in a sample. Usually, *m***<sup>j</sup>** is *m* (*m***<sup>j</sup>** = *m*) because, usually, equal quantity of structures is studied in each species.

The **GCAI**, which represents difference in *P***w(i)** based on multiple structures between the **control (***i***)** and other noncontrol (*i'*) species, is defined by the following formula:

$$\mathbf{GCAI}\_{\mathbb{L}^\*} = \left| \mathbf{P}\_{\mathbb{w}(\mathfrak{l})} - \mathbf{P}\_{\mathbb{w}(\mathfrak{l}')} \right| \tag{7}$$

or

**2.3. Definition of comparative anatomy index (CAI) for comparison among different** 

In normal structures, *P***w(ij)** must be greater than 0.5 and less than or equal to 1, i.e., 0.5 < *P***w(ij)** ≤ 1. In fact, *P***w(ij)** could be 1, if every *P***w(ij)** has maximal value 1, and if every *P***w(ij)** is minimum (*P***w(ij)** > 0.5), the *P***w(ij)** will be 0.5, as well. Mathematically, *P***w(ij)** can vary within the range of 0 ≤ *P***w(ij)** ≤ 1, since *P***w(ij)** could be zero or less than 0.5 in analyses using different species in which

It is noted that *P***w(ij)** [which is a function of each relative frequency of each specific feature (*P***ijk**) and its weight (*w***k**)] can be used to assess mathematical similarity of a population of anatomical data; equal values indicate high similarity and large difference in the values between two species indicates dissimilarities or less similarity. In order to compare structures among

Before calculating *P***w(ij)**, each *P***ijk** must be computed according to the data of species used as a reference, i.e., the **control species**. For example, the coracobrachialis, a muscle of the arm, could have one or two cranial heads in different primates [15]. For the coracobrachialis, *P***ijk** could be different depending on the number of cranial heads in the **control species**. Thus, *P***ijk** must be consistently calculated in reference to **control species**, since different species could

For example, there are two types of origin in the coracobrachialis (*j* = 1); type 1 has one origin and type 2 has two origins (*k* = 2). *P***ijk** could take different values according to the number of heads in the reference species (**control species**) (*i* = 1). In noncontrol species to be studied (*i* = 2) in which type 1 (number of origin is 1) is normal, the *P***212** will be 1 in reference to the species with one head, and *P*212 of type 1 will be 0.5 in reference to the species in which type 2 (two heads) is normal. In case of the muscle that has one to three heads of origins in different species, the *P***ijk** should be divided by maximum number (i.e., 3) of heads resulting in 1/3, because *P***ijk** should not be greater than 1. Accordingly, when **control species** with three heads of cranial origin of a muscle is chosen as reference, then, *P***212** is 1.000 (3 heads, *i* = 2), *P***312** is

Therefore, *P***ijk** must be obtained initially for the **control species**. When the **control species** (*i* = 1) have anatomically normal two cranial heads (*k* = 2) for the coracobrachialis (*j* = 1), and if all individuals in this species have two cranial heads (100%), *P***112** should be 1; if 90% of individuals in this species have two cranial heads, *P***112** should be 9/10. In a case of a noncontrol species (*i* = 2) in which the anatomical normal is one cranial head for the coracobrachial, if all individuals have one head, the *P***212** should be 0.5, and if 90% of the individuals have one head, *P***212** should be 0.45. These values noted above (*P***ijk**) can be obtained from the data in previous

Although any species can be defined as control species, the species studied in the first time or the species with abundant known data should be chosen as control species. To compare any single structure (e.g., muscle) between two different species (*i* ≠ *i*'), the data in any noncontrol species can be compared one by one with those in the control species using the comparative

**species**

such structure might not be normal.

different species, *P***w(ij)** has to be calculated in each species.

286 Advances in Statistical Methodologies and Their Application to Real Problems

have different normal structures (see below in detail).

0.667 (two heads, *i* = 3), and *P***412** is 0.333 (1 head, *i* = 4).

studies, and can be applied to the CAI analysis (see below).

anatomy index (CAI) defined by the following the formula:

$$\mathbf{CCAI}\_{ii'} = \frac{\sum\_{j=1}^{m} \mathbf{P}\_{w(i)}}{m\_j} - \frac{\sum\_{j'=1}^{m} \mathbf{P}\_{w(i')}}{m\_j} \tag{8}$$

Based on the above inferences, using **SMCA**, the values close to 0.000 suggest **high similarity** of the structures between the species, and the value 1.000 indicates that those are completely different structures. Thus, we can rank similarity among species based on SMCA. For example, we can define the CAI or GCAI values of 0 as high similarity among the structures analysed, the values from 0 to 0.200 as similar structures, those from 0.200 to 0.650 as somewhat similar, and those from 0.650 to 1.000 as dissimilar. Thus, the **GCAI** is the absolute difference in mean weighted averages of *P***w(ij)** for multiple muscles between the two species.

In **Table 2**, we show that different structures (muscles) among different species can be compared in reference to a control species. The 8 specimens of the *Sapajus* sp, 16 forearms with a total of the 304 muscles, were analyzed [9], and the data derived from the previous literatures using chimpanzees, gorillas, baboons, and humans were compared with the muscles in the control species (*Sapajus* sp). For other examples of SMCA application to muscles, see Aversi-Ferreira et al. [7, 9–11]. In the same way, we also analyzed 4 Japanese monkeys for 4 muscles in the arm resulting in a total of the 16 structures [11], which were compared with the data of the modern humans and other primates obtained from previous studies except those for *Sapajus* sp.


Note: Observe that the simple description is company by a numerical analysis, i.e., qualitative features were transformed in a numerical data for objective to compare different structures of different species. The values of CAI or GCAI chosen for comparison effect were of 0 as high similarity among the structures analysed, from 0 to 0.200 as similar structures, from 0.200 to 0.650 as somewhat similar, from 0.650 to 1.000 as dissimilar.

**Table 2.** Examples of the analysis using SMCA applied to muscles of the forearm.

For humans, variation of the structures is very well informed, and too for great apes. However, others animals, except for domestic ones, are scarcely studied, and if any the number of the specimens or species could be small. However, calculation of the SMCA based on the previous studies, except the values of the *N*, will provides quantitative data for analyses of morphological distances of structures among the species or those within the same species.

## **2.5. Comparison of the SMCA with other nonparametric statistics**

Another possibility to study nominal variables is to use the cladistics method that is commonly used in evolutionary studies. This method supposes binary characteristic of data, and any other possibility could be an error, because these features are mutually exclusives [15, 16]. This method is useful to obtain objective and/or precise information of evolution-related structures regarding the absence or presence of such structure across different species. However, this characteristic limits its application to morphological analyses of structures, since it considers just two parameters; 0 for absent characteristic, and 1 for its presence. Nevertheless, this method is important in evolutionary studies, since this method might provide evolutionary information. Cladistics analyses prioritize the primitive and derivative features [15, 16], while the morphological analyses studied here (SMCA) prioritize utmost characters observed in a given structure.

We previously compared SMCA with other nonparametric methods including cladistics [7, 10]. In fact, the SMCA accept more variables to be analyzed for each structure than the cladistics method. For a more detailed comparison, see Aversi-Ferreira et al. [7].

## **3. Conclusion**

different structures. Thus, we can rank similarity among species based on SMCA. For example, we can define the CAI or GCAI values of 0 as high similarity among the structures analysed, the values from 0 to 0.200 as similar structures, those from 0.200 to 0.650 as somewhat similar, and those from 0.650 to 1.000 as dissimilar. Thus, the **GCAI** is the absolute difference

In **Table 2**, we show that different structures (muscles) among different species can be compared in reference to a control species. The 8 specimens of the *Sapajus* sp, 16 forearms with a total of the 304 muscles, were analyzed [9], and the data derived from the previous literatures using chimpanzees, gorillas, baboons, and humans were compared with the muscles in the control species (*Sapajus* sp). For other examples of SMCA application to muscles, see Aversi-Ferreira et al. [7, 9–11]. In the same way, we also analyzed 4 Japanese monkeys for 4 muscles in the arm resulting in a total of the 16 structures [11], which were compared with the data of the modern humans and other primates obtained from previous studies except those for

**Superficial dorsal group of the forearm**

Lesser variation regarding the distribution of tendons to fingers. Somewhat similar

Only one insertion tendon to little finger.

Somewhat similar

GCAI = 0.22; somewhat similar

to BC CAI = 0.22

to BC

Note: Observe that the simple description is company by a numerical analysis, i.e., qualitative features were transformed in a numerical data for objective to compare different structures of different species. The values of CAI or GCAI chosen for comparison effect were of 0 as high similarity among the structures analysed, from 0 to 0.200 as similar structures,

to BC CAI = 0.22

*Men Chimpanzee Baboon*

Highly similar to BC CAI = 0.00

Fleshy portion is well detached. Somewhat similar

GCAI = 0.11; similar to BC

to BC CAI = 0.22 Highly similar to BC CAI = 0.00

Highly similar to BC CAI = 0.0

GCAI = 0.00; highly similar to BC

in mean weighted averages of *P***w(ij)** for multiple muscles between the two species.

*(BC)* **[chosen as control species]**

of the humerus

in the second to fifth proximal phalanges

of the humerus

Origin Lateral epicondyle

288 Advances in Statistical Methodologies and Their Application to Real Problems

Insertion Dorsal aponeurosis

Origin Lateral epicondyle

from 0.200 to 0.650 as somewhat similar, from 0.650 to 1.000 as dissimilar.

**Table 2.** Examples of the analysis using SMCA applied to muscles of the forearm.

Innervation Radial nerve Vascularization Radial artery

*Sapajus* sp.

Extensor digitorum communis

Extensor digiti quinti proprius

**Muscle Features** *Bearded capuchin* 

It is desirable to quantitatively assess any kinds of data, even in gross anatomy [7, 10], which is important for more precise discussions and more reliable conclusions [17]. Indeed, according to Lord Kelvin, "When you can measure what you are speaking about, and express it in numbers, you know something about it." Our objective is to provide a statistical test for gross anatomy to numerically compare structures of different subjects within the same species and those across different species, which should be useful to analyse more precisely and objectively the data in comparative anatomy.

The SMCA is a new statistical method and requires further verification using many data. We reported SMCA analyses previously [7, 9–11] and the SMCA could satisfactorily incorporate many qualitative data numerically. In conclusion, the main features of SMCA are as follows: (1) to allow numerical description of the data shown by discrete or nominal variables in comparative anatomy or in other areas of morphology and (2) to provide a, at least, more precise (numerical) method for comparison of samples of structures from the same species and from different species. Thus, the SMCA fills the lack of an appropriate method for statistical works in comparative anatomy, and in other areas of morphology and other disciplines such as taxonomy, phylogenetic, and evolution.

## **Acknowledgments**

T.A. Aversi-Ferreira is a recipient of Scholarship Research Productivity from National Council of Technology and Development (CNPq/Brazil). This work was supported partly by the Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Scientific Research (B) (16H04652). The authors declare no competing financial interests.

## **Author details**

Roqueline Ametila e Gloria Martins de Freitas Aversi-Ferreira2 , Hisao Nishijo1 and Tales Alexandre Aversi-Ferreira1,3,\*

\*Address all correspondence to: aversiferreira@gmail.com

1 System Emotional Science, Department of Physiology, School of Medicine and Pharmaceutical Sciences, University of Toyama, Toyama, Japan

2 Nursing and Pharmacy School, FAPAL, Tocantins, Brazil

3 Institute of Biomedical Sciences, University of Alfenas, MG, Brazil

## **References**


[8] Moore KL, Dalley AF, Agur AMR. Clinically oriented anatomy. 7th ed. Philadelphia: Wolters Kluwer Health/Lippincott Williams & Wilkins; 2014. 1173p.

**Acknowledgments**

**Author details**

**References**

Alexandre Aversi-Ferreira1,3,\*

2009; 27: 1051–1058.

T.A. Aversi-Ferreira is a recipient of Scholarship Research Productivity from National Council of Technology and Development (CNPq/Brazil). This work was supported partly by the Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Scientific Research (B)

1 System Emotional Science, Department of Physiology, School of Medicine and Pharmaceutical

[1] Aversi-Ferreira TA. A new statistical method for comparative anatomy. Int. J. Morphol.

[3] Centeno AJ. Statistical course applied to biology. 2nd ed. Goiânia: Editora da

[4] Kitchen CR. Nonparametric versus parametric tests of location in biomedical research.

[5] Barros RAC, Prada ILS, Silva Z, Ribeiro AR, Silva DCO. Lumbar plexus formation of the *cebus apella* monkey. Brazil J Veter Res Anim Sci. 2003; **40**: 373–381. DOI:10.1590/

[6] Darwin CR. Mutual affinities of organic beings: morphology: embryology: rudimentary organs. In: The Origin of Species. 6th ed. London: John Murray; 1859. pp. 411–458. [7] Aversi-Ferreira RAGMF, Nishijo H, Aversi-Ferreira TA. Reexamination of statistical methods for comparative anatomy: examples of its application and comparisons with other parametric and nonparametric statistics. BioMed Res Int. 2015, Article ID 902534.

[2] Vieira S. Biostatistics: advanced topics. 2nd ed. Rio de Janeiro: Elsevier; 2004. 216p.

Am J Ophthalmol. 2009; **147**: 571–572. DOI:10.1016/j.ajo.2008.06.

, Hisao Nishijo1

and Tales

(16H04652). The authors declare no competing financial interests.

290 Advances in Statistical Methodologies and Their Application to Real Problems

Roqueline Ametila e Gloria Martins de Freitas Aversi-Ferreira2

\*Address all correspondence to: aversiferreira@gmail.com

2 Nursing and Pharmacy School, FAPAL, Tocantins, Brazil

3 Institute of Biomedical Sciences, University of Alfenas, MG, Brazil

Sciences, University of Toyama, Toyama, Japan

Universidade de Goiás; 2001. 235p.

S1413-95962003000500009

doi:10.1155/2015/902534


**New Approaches for Teaching Statistics and Data Analysis**

#### **Descriptive and Inferential Statistics in Undergraduate Data Science Research Projects Descriptive and Inferential Statistics in Undergraduate Data Science Research Projects**

Malcolm J. D'Souza, Edward A. Brandenburg, Derald E. Wentzien, Riza C. Bautista, Agashi P. Nwogbaga, Rebecca G. Miller and Paul E. Olsen Malcolm J. D'Souza, Edward A. Brandenburg, Derald E. Wentzien, Riza C. Bautista, Agashi P. Nwogbaga, Rebecca G. Miller and Paul E. Olsen

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/65721

#### **Abstract**

Undergraduate data science research projects form an integral component of the Wesley College science and mathematics curriculum. In this chapter, we provide examples for hypothesis testing, where statistical methods or strategies are coupled with methodologies using interpolating polynomials, probability and the expected value concept in statistics. These are areas where real-world critical thinking and decision analysis applications peak a student's interest.

**Keywords:** Wesley College, STEM, undergraduate research, solvolysis, phenyl chloroformate, benzoyl chloride, benzoyl fluoride, benzoyl cyanide, Grunwald-Winstein equation, transition-state, addition-elimination, multiple regression, time-series, Ebola, polynomial functions, probability, expected value

## **1. Introduction**

Wesley College (Wesley) is a minority-serving, primarily undergraduate liberal-arts institution. Its STEM (science, technology, engineering and mathematics) fields contain a robust (federal and state) sponsored directed research program [1, 2]. In this program, students receive individual mentoring on diverse projects from a full-time STEM faculty member. In addition, undergraduate research is a capstone thesis requirement and students complete research projects within experiential courses or for an annual Scholars' Day event.

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Undergraduate research is indeed a hallmark of Wesley's progressive liberal-arts corecurriculum. All incoming freshmen are immersed in research in a specially designed quantitative reasoning a 100-level mathematics core course, a first-year seminar course and 100-level frontiers in science core course [1]. Projects in all level-1 STEM core courses provide an opportunity to develop a base knowledge for interacting and manipulating data. These courses also introduce students to modern computing techniques and platforms.

At the other end of the Wesley core-curriculum spectrum, the advanced undergraduate STEM research requirements reflect the breadth and rigor necessary to prepare students for (possible) future postgraduate programs. For analyzing data in experiential research projects, descriptive and inferential statistics are major components. In informatics, students are trained in the SAS Institute's statistical analysis system (SAS) software and in the use of geographic information system (GIS) spatial tools through ESRI's ArcGIS platform [2].

To help students with poor mathematical ability and to further enhance their general thinking skills, in our remedial mathematics courses, we provide a foundation in algebraic concepts, problem-solving skills, basic quantitative reasoning and simple simulations. Our institution also provides a plethora of student academic support services that include an early alert system, peer and professionally trained tutoring services and writing center support. In addition, Wesley College non-STEM majors are required to take the project-based 100-level mathematics core course and can then opt to take two project-based 300-level SAS and GIS core courses. Such students who are trained in the concepts and applications of mathematical and statistical methods can then participate in Scholars' Day to augment their mathematical and critical thinking skills.

## **2. Linear free energy relationships to understand molecular pathways**

Single and multiparameter linear free energy relationships (LFERs) help chemists evaluate multiple kinds of transition-state molecular interactions observed in association with compound variability [3]. Chemical kinetics measurements are understood by correlating the experimental compound reaction rate (*k*) or equilibrium data and their thermodynamics. The computationally challenging stoichiometric analysis elucidates metabolic pathways by analyzing the effect of physiochemical, environmental and biological factors on the overall chemical network structure. All of these determinations are important in the design of chemical processes for petrochemical, pharmaceutical and agricultural building blocks.

In this section, through results obtained from our undergraduate directed research program in chemistry, we outline examples with statistical descriptors that use inferential correctness for testing hypotheses about regression coefficients in LFERs that are common to the study of solvent reactions. To understand mechanistic approaches, multiple regression correlation analyses using the one- and two-term Grunwald-Winstein equations (Eqs. (1) and (2)) are proven to be effective instruments that elucidate the transition-state in solvolytic reactions [3]. To avoid multicollinearity, it is stressed that the chosen solvents have widely varying ranges of nucleophilicity (*N*) and solvent-ionizing power (*Y*) values [3, 4]. In Eqs. (1) and (2) (for a particular substrate), *k* is the rate of reaction in a given solvent, *ko* is the 80% ethanol (EtOH) reaction rate, *l* is the sensitivity toward changes in *N, m* is the sensitivity toward changes in *Y* and *c* is a constant (residual) term. In substrates that have the potential for transition-state electron delocalization, Kevill and D'Souza introduced an additional *hI* term to Eqs. (1) and (2) (and as shown in Eqs. (3) and (4)). In Eqs. (3) and (4), *h* represents the sensitivity to changes in the aromatic ring parameter *I* [3].

Undergraduate research is indeed a hallmark of Wesley's progressive liberal-arts corecurriculum. All incoming freshmen are immersed in research in a specially designed quantitative reasoning a 100-level mathematics core course, a first-year seminar course and 100-level frontiers in science core course [1]. Projects in all level-1 STEM core courses provide an opportunity to develop a base knowledge for interacting and manipulating data. These courses

At the other end of the Wesley core-curriculum spectrum, the advanced undergraduate STEM research requirements reflect the breadth and rigor necessary to prepare students for (possible) future postgraduate programs. For analyzing data in experiential research projects, descriptive and inferential statistics are major components. In informatics, students are trained in the SAS Institute's statistical analysis system (SAS) software and in the use of geographic information

To help students with poor mathematical ability and to further enhance their general thinking skills, in our remedial mathematics courses, we provide a foundation in algebraic concepts, problem-solving skills, basic quantitative reasoning and simple simulations. Our institution also provides a plethora of student academic support services that include an early alert system, peer and professionally trained tutoring services and writing center support. In addition, Wesley College non-STEM majors are required to take the project-based 100-level mathematics core course and can then opt to take two project-based 300-level SAS and GIS core courses. Such students who are trained in the concepts and applications of mathematical and statistical methods can then participate in Scholars' Day to augment their mathematical

**2. Linear free energy relationships to understand molecular pathways**

processes for petrochemical, pharmaceutical and agricultural building blocks.

Single and multiparameter linear free energy relationships (LFERs) help chemists evaluate multiple kinds of transition-state molecular interactions observed in association with compound variability [3]. Chemical kinetics measurements are understood by correlating the experimental compound reaction rate (*k*) or equilibrium data and their thermodynamics. The computationally challenging stoichiometric analysis elucidates metabolic pathways by analyzing the effect of physiochemical, environmental and biological factors on the overall chemical network structure. All of these determinations are important in the design of chemical

In this section, through results obtained from our undergraduate directed research program in chemistry, we outline examples with statistical descriptors that use inferential correctness for testing hypotheses about regression coefficients in LFERs that are common to the study of solvent reactions. To understand mechanistic approaches, multiple regression correlation analyses using the one- and two-term Grunwald-Winstein equations (Eqs. (1) and (2)) are proven to be effective instruments that elucidate the transition-state in solvolytic reactions [3]. To avoid multicollinearity, it is stressed that the chosen solvents have widely varying ranges of nucleophilicity (*N*) and solvent-ionizing power (*Y*) values [3, 4]. In Eqs. (1) and (2) (for a

also introduce students to modern computing techniques and platforms.

system (GIS) spatial tools through ESRI's ArcGIS platform [2].

296 Advances in Statistical Methodologies and Their Application to Real Problems

and critical thinking skills.

$$\log\left(k \,/\, k\_o\right) \,= mY + \mathcal{c} \tag{1}$$

$$\log\left(k \,/\, k\_o\right) \,\,= lN + mY + c\tag{2}$$

$$\log\left(k \,/\, k\_o\right) \, = mY + h\,\, I + \mathcal{c} \tag{3}$$

$$\log\left(k \,/\, k\_o\right) \,\,= lN + mY + hI + c \tag{4}$$

Eqs. (1) and (3) are useful in substrates where the unimolecular dissociative transition-state (SN1 or E1) formation is rate-determining. Eqs. (2) and (4) are employed for reactions where there is evidence for bimolecular associative (SN2 or E2) mechanisms or addition-elimination (A-E) processes. In substrates undergoing similar mechanisms, the resultant *l*/*m* ratios obtained can be important indicators to compensate for earlier and later transition-states (TS). Furthermore, *l*/*m* ratios between 0.5 and 1.0 are indicative of unimolecular processes (SN1 or E1), values ≥ 2.0 are typical in bimolecular processes (SN2, E2, or A-E mechanisms) and values <<0.5 imply that ionization-fragmentation is occurring [3].

To study the (solvent) nucleophilic attack at a sp2 carbonyl carbon, we completed detailed Grunwald-Winstein (Eqs. (1), (2) and (4)) analyses for phenyl chloroformate (PhOCOCl) at 25.0°C in 49 solvents with widely varying *N* and *Y* values [3, 4]. Using Eq. (1), we obtained an *m* value of −0.07 ± 0.11, *c* = −0.46 ± 0.31, a very poor correlation coefficient (*R* = 0.093) and an extremely low *F*-test value of 0.4. An analysis of Eq. (2) resulted in a very robust correlation, with *R* = 0.980, *F*-test = 568, *l* = 1.66 ± 0.05, *m* = 0.56 ± 0.03 and *c* = 0.15 ± 0.07. Using Eq. (4), we obtained *l* = 1.77 ± 0.08, *m* = 0.61 ± 0.04, *h* = 0.35 ± 0.19 (*P*-value = 0.07), *c* = 0.16 ± 0.06, *R* = 0.982 and the *F*-test value was 400.

Since the use of Eq. (2) provided superior statistically significant results (*R, F*-test and *P*-values) for PhOCOCl, we strongly recommended that in substrates where nucleophilic attack occurs at a sp2 hybridized carbonyl carbon, the PhOCOCl *l*/*m* ratio of 2.96 should be used as a guiding indicator for determining the presence of an addition-elimination (A-E) process [3, 4]. Furthermore, for *n*-octyl fluoroformate (OctOCOF) and *n*-octyl chloroformate (OctOCOCl), we found that the leaving-group ratio (*k*F/*k*Cl) was close to, or above unity. Fluorine is a very poor leaving-group when compared to chlorine, hence for carbonyl group containing molecules, we proposed the existence of a bimolecular tetrahedral transition-state (TS) with a ratedetermining addition step within an A-E pathway (as opposed to a bimolecular concerted associative SN2 process with a penta-coordinate TS).

For chemoselectivity, the sp2 hybridized benzoyl groups (PhCO─) are found to be efficient and practical protecting agents that are utilized during the synthesis of nucleoside, nucleotide and oligonucleotide analogue derivative compounds. Yields for regio- and stereoselective reactions are shown to depend on the preference of the leaving group and commercially, benzoyl fluoride (PhCOF), benzoyl chloride (PhCOCl) and benzoyl cyanide (PhCOCN) are cheap and readily available.

We experimentally measured the solvolytic rates for PhCOF at 25.0°C [5]. In 37 solvent systems, a two-term Grunwald-Winstein (Eq. (2)) application resulted in an *l* value of 1.58 ± 0.09, an *m* value of 0.82 ± 0.05, a *c* value of −0.09, *R* = 0.953 and the *F*-test value was 186. The *l*/*m* ratio of 1.93 for PhCOF is close to the OctOCOF *l*/*m* ratio of 2.28 (in 28 pure and binary mixtures) indicating similar A-E transition states with rate-determining addition.

On the other hand, for PhCOCl at 25.0°C, we used the available literature data (47 solvents) from various international groups and proved the presence of simultaneous competing dual side-by-side mechanisms [6]. For 32 of the more ionizing solvents, we obtained *l* = 0.47 ± 0.03, *m* = 0.79 ± 0.02, *c* = −0.49 ± 0.17, *R* = 0.990 and *F*-test = 680. The *l*/*m* ratio is 0.59. Hence, we proposed an SN1 process with significant solvation (*l* component) of the developing aryl acylium ion. In 12 of the more nucleophilic solvents, we obtained *l* = 1.27 ± 0.29, *m* = 0.46 ± 0.07, *c* = 0.18 ± 0.23, *R* = 0.917 and *F*-test = 24. The *l*/*m* ratio of 2.76 is close to the 2.96 value obtained for PhOCOCl. This suggests that the A-E pathway is prevalent. In addition, there were three solvents where there was no clear demarcation of the changeover region.

At 25.0°C in solvents that are common to PhCOCl and PhCOCF we observed *k*PhCOCl > *k*PhCOF. This rate trend is primarily due to more efficient PhCOF ground-state stabilization.

Lee and co-workers followed the kinetics of benzoyl cyanide (PhCOCN) at 1, 5, 10, 15 and 20°C in a variety of pure and mixed solvents and proposed the presence of an associative SN2 (pentacoordinate TS) process [7]. PhCOCN is an ecologically important chemical defensive secretion of polydesmoid millipedes and cyanide is a synthetically useful highly active leaving group. Since the leaving group is involved in the rate-determining step of any SN2 process, we became skeptical with the associative SN2 proposal and decided to reinvestigate the PhCOCN analysis. We hypothesized that since PhCOCl showed mechanism duality, similar analogous dual mechanisms should endure during PhCOCN solvolyses.

Using the Lee data within Arrhenius plots (Eq. (5)), we determined the PhCOCN solvolytic rates at 25°C (**Table 1**). We obtained the rates for PhCOCN in 39 pure and mixed

$$
\ln\left(k\right) = \frac{-Ea}{RT} + \ln\left(A\right) \tag{5}
$$



1 Calculated using four data points in an Arrhenius plot.

determining addition step within an A-E pathway (as opposed to a bimolecular concerted

practical protecting agents that are utilized during the synthesis of nucleoside, nucleotide and oligonucleotide analogue derivative compounds. Yields for regio- and stereoselective reactions are shown to depend on the preference of the leaving group and commercially, benzoyl fluoride (PhCOF), benzoyl chloride (PhCOCl) and benzoyl cyanide (PhCOCN) are cheap and

We experimentally measured the solvolytic rates for PhCOF at 25.0°C [5]. In 37 solvent systems, a two-term Grunwald-Winstein (Eq. (2)) application resulted in an *l* value of 1.58 ± 0.09, an *m* value of 0.82 ± 0.05, a *c* value of −0.09, *R* = 0.953 and the *F*-test value was 186. The *l*/*m* ratio of 1.93 for PhCOF is close to the OctOCOF *l*/*m* ratio of 2.28 (in 28 pure and binary mixtures)

On the other hand, for PhCOCl at 25.0°C, we used the available literature data (47 solvents) from various international groups and proved the presence of simultaneous competing dual side-by-side mechanisms [6]. For 32 of the more ionizing solvents, we obtained *l* = 0.47 ± 0.03, *m* = 0.79 ± 0.02, *c* = −0.49 ± 0.17, *R* = 0.990 and *F*-test = 680. The *l*/*m* ratio is 0.59. Hence, we proposed an SN1 process with significant solvation (*l* component) of the developing aryl acylium ion. In 12 of the more nucleophilic solvents, we obtained *l* = 1.27 ± 0.29, *m* = 0.46 ± 0.07, *c* = 0.18 ± 0.23, *R* = 0.917 and *F*-test = 24. The *l*/*m* ratio of 2.76 is close to the 2.96 value obtained for PhOCOCl. This suggests that the A-E pathway is prevalent. In addition, there were three

At 25.0°C in solvents that are common to PhCOCl and PhCOCF we observed *k*PhCOCl > *k*PhCOF.

Lee and co-workers followed the kinetics of benzoyl cyanide (PhCOCN) at 1, 5, 10, 15 and 20°C in a variety of pure and mixed solvents and proposed the presence of an associative SN2 (pentacoordinate TS) process [7]. PhCOCN is an ecologically important chemical defensive secretion of polydesmoid millipedes and cyanide is a synthetically useful highly active leaving group. Since the leaving group is involved in the rate-determining step of any SN2 process, we became skeptical with the associative SN2 proposal and decided to reinvestigate the PhCOCN analysis. We hypothesized that since PhCOCl showed mechanism duality, similar analogous dual

Using the Lee data within Arrhenius plots (Eq. (5)), we determined the PhCOCN solvolytic


rates at 25°C (**Table 1**). We obtained the rates for PhCOCN in 39 pure and mixed

ln( ) ln( ) *Ea k A RT*

This rate trend is primarily due to more efficient PhCOF ground-state stabilization.

indicating similar A-E transition states with rate-determining addition.

solvents where there was no clear demarcation of the changeover region.

mechanisms should endure during PhCOCN solvolyses.

hybridized benzoyl groups (PhCO─) are found to be efficient and

associative SN2 process with a penta-coordinate TS).

298 Advances in Statistical Methodologies and Their Application to Real Problems

For chemoselectivity, the sp2

readily available.

2 Calculated using three data points in an Arrhenius plot.

3 Calculated using three data points in an Arrhenius plot and are w/w compositions.

4 Determined using a second-degree polynomial equation.

5 Determined using a third-degree polynomial equation.

**Table 1.** The 25.0°C calculated rates for PhCOCN, the *N*T, *Y*Cl and *I* values.

aqueous organic solvents of ethanol (EtOH), methanol (MeOH), acetone (Me2CO), dioxane, 2,2,2-trifluoroethanol (TFE) and in TFE-EtOH (T-E) mixtures. For all of the Arrhenius plots, the *R*<sup>2</sup> values ranged from 0.9937 to 1.0000, except in 60% Me2CO, *R*<sup>2</sup> was 0.9861. The Arrhenius plot for 80% EtOH is shown in **Figure 1**. In order to utilize Eqs. (1)–(4) for all 39 solvents, second degree or third-degree polynomial equations were used to calculate the missing *N*T, *Y*Cl and *I* values. The calculated 25°C PhCOCN reaction rates and the literature available or interpolated *N*T, *Y*Cl and *I* values are listed in **Table 1**.

Using Eq. (2) for 32 of the PhCOCN solvents in **Table 1** (20–90% EtOH, 30–90% MeOH, 20– 80% Me2CO, 10–30% dioxane, 10T–90E, 20T–80E, 30T–70E, 40T–60E, 50T–50E and 70T–30E),

we obtained *R* = 0.988, *F*-test = 595, *l* = 1.54 ± 0.11, *m* = 0.74 ± 0.03 and *c* = 0.13 ± 0.04. Using Eq. (4), we obtained *R* = 0.989, *F*-test = 432, *l* = 1.62 ± 0.11, *m* = 0.78 ± 0.03, *h* = 0.22 ± 0.11 (*P*-value = 0.07) and *c* = 0.13 ± 0.04.

**Figure 1.** Arrhenius plot for 80% EtOH.

The *l*/*m* ratio of 2.08 obtained (for PhCOCN) using Eq. (2) is close to that obtained (1.93) for PhCOF and hence we propose a parallel A-E mechanism.

For the seven highly ionizing aqueous TFE mixtures, using Eq. (1) we obtained, *R* = 0.977, *F*-test = 105, *m* = 0.61 ± 0.06 and *c* = −1.15 ± 0.20. Using Eq. (2) we obtained *R* = 0.999, *F*-test = 763, *l* = 0.25 ± 0.031, *m* = 0.42 ± 0.03 and *c* = −0.13 ± 0.14. Using Eqs. (3) and (4) we obtained *R* = 0.998, *F*-test = 417, *m* = −0.65 ± 0.22 (*P*-value = 0.04), *h* = −2.83 ± 0.491 (*P*-value = 0.01) and *c* = 3.12 ± 0.73 (*P*-value = 0.01) and *R* = 0.989, *F*-test = 572, *l* = 0.17 ± 0.07 (*P*-value = 0.11), *m* = 0.02 ± 0.33 (*P*-value = 0.96), *h* = −1.04 ± 0.86 (*P*-value = 0.31), *c* = 1.10 ± 1.02 (*P*-value = 0.36), respectively.

In the very polar TFE mixtures, in Eq. (2) the *l*/*m* ratio was 0.60, indicating a dissociative SN1 process. The *l* value of 0.25 is consistent with the need of small preferential solvation to stabilize the developing SN1 carbocation and the lower *m* value (0.42) attained can be rationalized in terms of less demand for solvation of the cyanide anion (leaving group).

In all of the common solvents at 25.0°C, *k*PhCOCl > *k*PhCOCN > *k*PhCOF. In addition, PhCOCN was found to be faster than PhCOF by a factor of 18–71 times in the aqueous ethanol, methanol, acetone and dioxane mixtures and 185–1100 times faster in the TFE-EtOH and TFE-H2O mixtures. These observations are very reasonable as the cyanide group is shown to have a greater inductive effect and in addition, the cyanide anion is a weak conjugate base. This rationalization is logical as (*l*/*m*)PhCOCN > (*l*/*m*)PhCOF.

## **3. Estimating missing values from a time series data set**

we obtained *R* = 0.988, *F*-test = 595, *l* = 1.54 ± 0.11, *m* = 0.74 ± 0.03 and *c* = 0.13 ± 0.04. Using Eq. (4), we obtained *R* = 0.989, *F*-test = 432, *l* = 1.62 ± 0.11, *m* = 0.78 ± 0.03, *h* = 0.22 ± 0.11 (*P*-value =

The *l*/*m* ratio of 2.08 obtained (for PhCOCN) using Eq. (2) is close to that obtained (1.93) for

For the seven highly ionizing aqueous TFE mixtures, using Eq. (1) we obtained, *R* = 0.977, *F*-test = 105, *m* = 0.61 ± 0.06 and *c* = −1.15 ± 0.20. Using Eq. (2) we obtained *R* = 0.999, *F*-test = 763, *l* = 0.25 ± 0.031, *m* = 0.42 ± 0.03 and *c* = −0.13 ± 0.14. Using Eqs. (3) and (4) we obtained *R* = 0.998, *F*-test = 417, *m* = −0.65 ± 0.22 (*P*-value = 0.04), *h* = −2.83 ± 0.491 (*P*-value = 0.01) and *c* = 3.12 ± 0.73 (*P*-value = 0.01) and *R* = 0.989, *F*-test = 572, *l* = 0.17 ± 0.07 (*P*-value = 0.11), *m* = 0.02 ± 0.33 (*P*-value = 0.96), *h* = −1.04 ± 0.86 (*P*-value = 0.31), *c* = 1.10 ± 1.02 (*P*-value = 0.36),

In the very polar TFE mixtures, in Eq. (2) the *l*/*m* ratio was 0.60, indicating a dissociative SN1 process. The *l* value of 0.25 is consistent with the need of small preferential solvation to stabilize the developing SN1 carbocation and the lower *m* value (0.42) attained can be rationalized in

In all of the common solvents at 25.0°C, *k*PhCOCl > *k*PhCOCN > *k*PhCOF. In addition, PhCOCN was found to be faster than PhCOF by a factor of 18–71 times in the aqueous ethanol, methanol, acetone and dioxane mixtures and 185–1100 times faster in the TFE-EtOH and TFE-H2O mixtures. These observations are very reasonable as the cyanide group is shown to have a greater inductive effect and in addition, the cyanide anion is a weak conjugate base. This

terms of less demand for solvation of the cyanide anion (leaving group).

0.07) and *c* = 0.13 ± 0.04.

**Figure 1.** Arrhenius plot for 80% EtOH.

respectively.

PhCOF and hence we propose a parallel A-E mechanism.

300 Advances in Statistical Methodologies and Their Application to Real Problems

rationalization is logical as (*l*/*m*)PhCOCN > (*l*/*m*)PhCOF.

Complete historical data time series are needed to create effective mathematical models. Unfortunately, systems that track and record the data values periodically malfunction thereby creating missing and/or inaccurate values in the time series. If a reasonable estimate for the missing value can be determined, the data series can then be used for future analysis.

In this section, we present a methodology to generate a reasonable estimate for a missing or inaccurate values when two important conditions exist: (1) a similar data series with complete information is available and (2) a pattern (or trend) is observable.

The extent of the ice at the northern polar ice cap in square kilometers is tracked on a daily basis and this data is made available to researchers by the National Snow & Ice Data Center (NSIDC). A review of the NASA Distributed Active Archive Center (DAAC) data at NSIDC indicates that the extent of the northern polar ice cap follows a cyclical pattern throughout the year. The extent increases until it reaches a maximum for the year in mid-March and decreases until it reaches a minimum for the year in mid-September. Unfortunately, the data set contains missing data for some of the days.

The extent of the northern polar ice cap in the month of January for 2011, 2012 and 2013 is utilized as an example. Complete daily data for January in 2011 and 2012 is available. The 2013 January data has a missing data value for January 25, 2013.

**Figure 2** presents the line graph of the daily ice extent for January of 2011, 2012 and 2013. A complete time series is available for 2011 and 2012, so the first condition is met. The line graphs also indicate that the extent of the polar ice caps is increasing in January, so the second condition is met. An interpolating polynomial will be introduced and used to estimate the missing value for the extent of the polar ice cap on January 25, 2013.

Let *t* = the time period or observation number in a time series.

Let *f*(*t)* = the extent of the sea ice for time period *t*.

The extent of the sea ice can be written as a function of time.

For a polynomial of degree 1, the function will be: <sup>=</sup> 0 <sup>+</sup> 1

For a polynomial of degree 3, the function will be: <sup>=</sup> 0 <sup>+</sup> 1 <sup>+</sup> 2 <sup>2</sup> <sup>+</sup> 3 <sup>3</sup>

Polynomials of higher degrees could also be used. The extent of the polar ice for January 25 will be removed from the data series for 2011 and 2012 and an estimate will be prepared using polynomials of degree 1. Another estimate is prepared using polynomials of degree 3. The estimated value will be compared to the actual value for the years 2011 and 2012. The degree of the polynomial that generates the best (closest) estimate for January 25 will be the degree of the polynomial used to generate the estimate for January 25, 2013.

**Figure 2.** The extent of sea ice in January 2011, 2012 and 2013.

A two-equation, two-unknown system of equations is created when using polynomials of degree 1. One known value before and after the missing value for each year is used to set up the system of equations. To simplify the calculations, January 24 is recorded as time period 1, January 25 is recorded as time period 2 and January 26 is recorded as time period 3. The time period and extent of the sea ice for each year was recorded in Excel.


The system of equations using a first-order polynomial for January 2011 is:

$$\begin{aligned} a\_0 + a\_1(1) &= 12,878,750\\ a\_0 + a\_1(3) &= 12,996,875 \end{aligned} \tag{6}$$

The coefficients can be found by solving the system of equations. Substitution, elimination, or matrices can be used to solve the system of equations. A TI-84 graphing calculator and matrices were used to solve this system.

The solution to this system of equations is: 0 = 12, 819, 687.5, 1 = 59, 062.5

The estimate for January 25, 2011 is: 12, 819, 687.5 + 59, 062.2 <sup>2</sup> = 12, 937, 812.5 km2. The system of equations using a first-order polynomial for 2012 is:

$$\begin{aligned} a\_0 + a\_1 \binom{1}{1} &= 13,110,000\\ a\_0 + a\_1 \binom{3}{3} &= 13,204,219 \end{aligned} \tag{7}$$

The solution to this system of equations is: 0 = 13, 062, 890.5, 1 = 47, 109.5

**Figure 2.** The extent of sea ice in January 2011, 2012 and 2013.

302 Advances in Statistical Methodologies and Their Application to Real Problems

period and extent of the sea ice for each year was recorded in Excel.

2 12,916,563 13,123,125

matrices were used to solve this system.

Time period 2011 2012 2013

1 12,878,750 13,110,000 13,077,813

3 12,996,875 13,204,219 13,404,688

( ) ( )

+ = + =

1 12,878,750 3 12,996,875

The coefficients can be found by solving the system of equations. Substitution, elimination, or matrices can be used to solve the system of equations. A TI-84 graphing calculator and

*a a* (6)

The system of equations using a first-order polynomial for January 2011 is:

*a a*

A two-equation, two-unknown system of equations is created when using polynomials of degree 1. One known value before and after the missing value for each year is used to set up the system of equations. To simplify the calculations, January 24 is recorded as time period 1, January 25 is recorded as time period 2 and January 26 is recorded as time period 3. The time The estimate for January 25, 2012 is: 13, 062, 890.5 + 47, 109.5 <sup>2</sup> = 13, 157, 109.5km2.

The absolute values of the deviations (actual and estimated values) were calculated in Excel.


A four-equation, four-unknown system of equations is created when using polynomials of degree 3. Two known values before and after the missing value are used to set up the system of equations. To simplify the calculations, January 23 is recorded as time period 1, January 24 is recorded as time period 2, January 25 is recorded as time period 3, January 26 is recorded as time period 4 and January 27 is recorded as time period 5. The time period and extent of the sea ice for each year was recorded in Excel.


The system of equations using a third-order polynomial for 2011 is:

$$\begin{aligned} \left(a\_0 + a\_1(1) + a\_2(1)\right)^2 + a\_3(1)^3 &= 12, 848, 281\\ \left(a\_0 + a\_1(2) + a\_2(2)\right)^2 + a\_3(2)^3 &= 12, 878, 750\\ \left(a\_0 + a\_1(4) + a\_2(4)\right)^2 + a\_3(4)^3 &= 12, 996, 875\\ \left(a\_0 + a\_1(5) + a\_2(5)\right)^2 + a\_3(5)^3 &= 13, 090, 625 \end{aligned} \tag{8}$$

The solution to this system of equations is: 0 = 12, 832, 811.67, 1 = 8, 985.17, 2 = 5, 976.33, 3 = 507.83

The estimate for January 25, 2011 is: 12, 832, 811.67 + 8, 985.17 <sup>3</sup> + 5, 976.33 <sup>3</sup> <sup>2</sup> <sup>+</sup> 507.33 <sup>3</sup> <sup>3</sup> = 12, 927, 252.1km2.

The system of equations using a third-order polynomial for 2012 is:

$$\begin{aligned} \left(a\_0 + a\_1(1) + a\_2\left(1\right)^2 + a\_3\left(1\right)^3\right) &= 13,199,375\\ \left(a\_0 + a\_1\left(2\right) + a\_2\left(2\right)^2 + a\_3\left(2\right)^3 &= 13,110,000\\ \left(a\_0 + a\_1\left(4\right) + a\_2\left(4\right)^2 + a\_3\left(4\right)^3 &= 13,204,219\\ \left(a\_0 + a\_1\left(5\right) + a\_2\left(5\right)^2 + a\_3\left(5\right)^3 &= 13,227,344\right) \end{aligned} \tag{9}$$

The solution to this system of equations is: 0 = 13, 486, 719, 1 <sup>=</sup> <sup>−</sup> 413, 073.33, 2 = 139, 101.75, 3 <sup>=</sup> <sup>−</sup> 13, 372.42

The estimate for January 25, 2012 is: 13, 486, 719 <sup>−</sup> 413, 073.33 <sup>3</sup> + 139, 101.75 <sup>3</sup> <sup>2</sup> <sup>−</sup> 13, 372.42 <sup>3</sup> <sup>3</sup> = 13, 138, 359.42 km<sup>2</sup>

The absolute values of the deviations (actual and estimated values) were calculated in Excel.


The mean of the absolute deviations for polynomials of degree 1 and the mean of the absolute deviations for polynomials of degree 3 were calculated in Excel. The polynomial of degree 3 provided the smallest mean absolute deviation.


Therefore, a third order polynomial will be used to generate an estimate for the sea ice extent on January 25, 2013.

The system of equations using a third-order polynomial for 2013 is:

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

++ + = ++ + = ++ + = ++ + =

01 2 3

*aa a a aa a a aa a a aa a a*

304 Advances in Statistical Methodologies and Their Application to Real Problems

01 2 3

01 2 3

01 2 3

The system of equations using a third-order polynomial for 2012 is:

01 2 3

*aa a a aa a a aa a a aa a a*

01 2 3

01 2 3

01 2 3

Degree Year Actual Estimated Absolute deviation

3 2011 12,916,563 12,927,252.1 10,689.1

3 2012 13,123,125 13,138,359.4 15,234.4

provided the smallest mean absolute deviation.

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

++ + = ++ + = ++ + = ++ + =

2 3

2 3

1 1 1 13,199,375 2 2 2 13,110,000 4 4 4 13,204,219 5 5 5 13,227,344

2 3

2 3

The solution to this system of equations is: 0 = 13, 486, 719, 1 <sup>=</sup> <sup>−</sup> 413, 073.33, 2 = 139,

The estimate for January 25, 2012 is: 13, 486, 719 <sup>−</sup> 413, 073.33 <sup>3</sup> + 139, 101.75 <sup>3</sup> <sup>2</sup> <sup>−</sup> 13,

The absolute values of the deviations (actual and estimated values) were calculated in Excel.

The mean of the absolute deviations for polynomials of degree 1 and the mean of the absolute deviations for polynomials of degree 3 were calculated in Excel. The polynomial of degree 3

2 = 5, 976.33, 3 = 507.83

101.75, 3 <sup>=</sup> <sup>−</sup> 13, 372.42

372.42 <sup>3</sup> <sup>3</sup> = 13, 138, 359.42 km<sup>2</sup>

507.33 <sup>3</sup> <sup>3</sup> = 12, 927, 252.1km2.

2 3

2 3

1 1 1 12,848,281 2 2 2 12,878,750 4 4 4 12,996,875 5 5 5 13,090,625

2 3

(8)

(9)

2 3

The solution to this system of equations is: 0 = 12, 832, 811.67, 1 = 8, 985.17,

The estimate for January 25, 2011 is: 12, 832, 811.67 + 8, 985.17 <sup>3</sup> + 5, 976.33 <sup>3</sup> <sup>2</sup> <sup>+</sup>

$$\begin{aligned} \left(a\_0 + a\_1(1) + a\_2(1)\right)^2 + a\_3(1)^3 &= 13, 168, 594\\ a\_0 + a\_1(2) + a\_2(2)^2 + a\_3(2)^3 &= 13, 077, 813\\ a\_0 + a\_1(4) + a\_2(4)^2 + a\_3(4)^3 &= 13, 404, 688\\ a\_0 + a\_1(5) + a\_2(5)^2 + a\_3(5)^3 &= 13, 388, 750 \end{aligned} \tag{10}$$

The solution to this system of equations is: 0 = 13, 717, 916.67, 1 <sup>=</sup> <sup>−</sup> 850, 859.17, 2 = 337, 669.33, 3 <sup>=</sup> <sup>−</sup> 36, 132.83.

The estimate for January 25, 2013 is: 13, 717, 916.67 <sup>−</sup> 850, 859.17 <sup>3</sup> + 337, 669.33 <sup>3</sup> <sup>2</sup> <sup>−</sup> 36, 132.83 <sup>3</sup> <sup>3</sup> = 13, 228, 776.72 km2. **Figure 3** shows the extent of the sea ice in January, 2013 with the estimate for January 25.

**Figure 3.** The extent of sea ice in January 2013 with the January 25, 2013 estimate.

## **4. Statistical methodologies and applications in the Ebola war**

In 2014, an unprecedented outbreak of Ebola occurred predominantly in West Africa. According to the Center for Disease Control (CDC), over 28.5 thousand cases were reported resulting in more than 11,000 deaths [8]. The countries that were affected by the Ebola outbreak were Senegal, Guinea, Nigeria, Mali, Sierra Leone, Liberia, Spain and the United States of America (USA). Statistics through dynamic modeling played a crucial role with clinical data collection and management. The lessons learned and the resultant statistical advances continue to inform and drive current and subsequent pandemics.

For this honors thesis project, we tracked and gathered Ebola data over an extended period of time from the CDC, World Health Organization (WHO) and the news media [8, 9]. We used statistical curve fitting that involved both exponential and polynomial functions as well as model validation using nonlinear regression and *R*<sup>2</sup> statistical analysis.

The first WHO report (initial announcement) of the West Africa Ebola outbreak was made during the March 23rd, 2014 week. Consequently, the data for this project began from that week to October 31, 2014. The 2014 Ebola data was used to create epidemiological models to predict the possible pathway of a 2014 West Africa type of Ebola outbreak. The WHO number of Ebola cases and death toll as of October 31st, 2014 were Liberia (6635 cases with 2413 deaths), Sierra Leone (5338 cases with 1510 deaths), Guinea (1667 cases with 1018 deaths), Nigeria (20 cases with eight deaths), the United States (four cases with one death), Mali (one case with one death) and Spain (one case with zero death).

Microsoft Excel was used for the modeling of the three examples shown and were predicated upon the following assumptions: (1) Week 1 is the week of March 23rd, 2014; (2) X is the number of weeks starting from Week 1 and Y is the number of Ebola deaths; (3) there was no vaccine/cure; and (4) the missing data for the 24th week was obtained by interpolation.

## **4.1. Modeling of weekly Guinea Ebola deaths**

The dotted curve in **Figure 4** shows the actual observed deaths while the solid line shows the number of deaths as determined by the fitted model. As shown in **Figure 4**, the growth of the Guinea deaths is exponential. The best fit curve for the projected growth is *y* = 72.827e0.0823*<sup>x</sup>* . A comparison of the actual data to the projected data shows that the two are similar but not exact (**Table 2**). The projected amount of deaths is approximately 1300 by week 35 (or the week of November 23, 2014).

## **4.2. Modeling of Liberia Ebola deaths (weekly)**

Unlike the Guinea deaths, the Liberian deaths are modeled using polynomial function (**Figure 5**).

**Figure 4.** Weekly deaths in Guinea.

**4. Statistical methodologies and applications in the Ebola war**

and drive current and subsequent pandemics.

306 Advances in Statistical Methodologies and Their Application to Real Problems

model validation using nonlinear regression and *R*<sup>2</sup>

death) and Spain (one case with zero death).

**4.1. Modeling of weekly Guinea Ebola deaths**

**4.2. Modeling of Liberia Ebola deaths (weekly)**

interpolation.

November 23, 2014).

(**Figure 5**).

In 2014, an unprecedented outbreak of Ebola occurred predominantly in West Africa. According to the Center for Disease Control (CDC), over 28.5 thousand cases were reported resulting in more than 11,000 deaths [8]. The countries that were affected by the Ebola outbreak were Senegal, Guinea, Nigeria, Mali, Sierra Leone, Liberia, Spain and the United States of America (USA). Statistics through dynamic modeling played a crucial role with clinical data collection and management. The lessons learned and the resultant statistical advances continue to inform

For this honors thesis project, we tracked and gathered Ebola data over an extended period of time from the CDC, World Health Organization (WHO) and the news media [8, 9]. We used statistical curve fitting that involved both exponential and polynomial functions as well as

The first WHO report (initial announcement) of the West Africa Ebola outbreak was made during the March 23rd, 2014 week. Consequently, the data for this project began from that week to October 31, 2014. The 2014 Ebola data was used to create epidemiological models to predict the possible pathway of a 2014 West Africa type of Ebola outbreak. The WHO number of Ebola cases and death toll as of October 31st, 2014 were Liberia (6635 cases with 2413 deaths), Sierra Leone (5338 cases with 1510 deaths), Guinea (1667 cases with 1018 deaths), Nigeria (20 cases with eight deaths), the United States (four cases with one death), Mali (one case with one

Microsoft Excel was used for the modeling of the three examples shown and were predicated upon the following assumptions: (1) Week 1 is the week of March 23rd, 2014; (2) X is the number of weeks starting from Week 1 and Y is the number of Ebola deaths; (3) there was no vaccine/cure; and (4) the missing data for the 24th week was obtained by

The dotted curve in **Figure 4** shows the actual observed deaths while the solid line shows the number of deaths as determined by the fitted model. As shown in **Figure 4**, the growth of the Guinea deaths is exponential. The best fit curve for the projected growth is *y* = 72.827e0.0823*<sup>x</sup>*

comparison of the actual data to the projected data shows that the two are similar but not exact (**Table 2**). The projected amount of deaths is approximately 1300 by week 35 (or the week of

Unlike the Guinea deaths, the Liberian deaths are modeled using polynomial function

statistical analysis.

. A


**Table 2.** Actual and projected Ebola deaths in Guinea.

The best fit curve is best defined with the polynomial equation *y* = 0.0003*x*<sup>5</sup> − 0.0069*x*<sup>4</sup> + 0.0347*x*<sup>3</sup> + 0.5074*x*<sup>2</sup> − 4.1442*x* + 10.487. The model is not exact but it is close enough to predict that by week 35, there would be over 7000 deaths in Liberia (**Table 3**).

**Figure 5.** Weekly deaths in Liberia.


**Table 3.** Actual and projected Ebola deaths in Liberia.

#### **4.3. Modeling of total deaths (World)**

When analyzing the total deaths of Ebola (for 35 weeks), the data was best modeled using the polynomial function *y* = 0.033*x*<sup>4</sup> − 1.4617*x*<sup>3</sup> + 23.437*x*<sup>2</sup> − 118.18*x* + 231.59 (**Figure 6**). An exponential function was not used as it was not suitable since the actual growth was not (initially) fast enough to match the exponential growth. As shown in **Table 4**, the projected total deaths according to this model would be greater than 11,000 by week 35.

**Figure 6.** Weekly world-wide deaths.

**Figure 5.** Weekly deaths in Liberia.

308 Advances in Statistical Methodologies and Their Application to Real Problems

**Table 3.** Actual and projected Ebola deaths in Liberia.

**4.3. Modeling of total deaths (World)**

**Week Deaths Model Week Deaths Model Week Deaths Model** 0 7 13 24 33 25 871 1001 0 4 14 25 43 26 670 1267 10 3 15 65 58 27 1830 1589 13 3 16 84 79 28 2069 1976 6 3 17 105 107 29 2484 2436 6 5 18 127 145 30 2705 2981 11 7 19 156 197 31 XXX 3620 11 9 20 282 264 32 XXX 4366 11 12 21 355 352 33 XXX 5231 11 15 22 576 464 34 XXX 6230 11 20 23 624 606 35 XXX 7377 11 25 24 748 783 36 XXX XXX

When analyzing the total deaths of Ebola (for 35 weeks), the data was best modeled using the polynomial function *y* = 0.033*x*<sup>4</sup> − 1.4617*x*<sup>3</sup> + 23.437*x*<sup>2</sup> − 118.18*x* + 231.59 (**Figure 6**). An exponential function was not used as it was not suitable since the actual growth was not (initially) fast enough to match the exponential growth. As shown in **Table 4**, the projected total deaths

according to this model would be greater than 11,000 by week 35.

**Ebola deaths in Liberia**


**Table 4.** Actual and projected worldwide deaths.

#### **4.4. Nonlinear regression and** *R***-squared analysis**

A visual inspection of the graphs and tables shows that the model for Liberia as well as the model for the world-wide total deaths evidently fits the data more closely and a lot better than does the Guinea model. Hence, other statistical goodness-of-fit tests are used to reassert these observations. Here, nonlinear polynomial regression (Eq. (11)) and *R*<sup>2</sup> statistical analysis are employed. In Eq. (11), Σ signifies summation, *w* refers to the actual (observed) number of Ebola deaths, *z* is the number of Ebola deaths as calculated with the model and *n* is the total number of weeks.

$$R^2 = \frac{2\left(\sum \nu z\right) + n\left(\overline{w}\right)^2 - \sum \left(z^2\right) - 2\overline{w}\left(\sum \nu\right)}{\sum \left(\nu^2\right) + n\left(\overline{w}\right)^2 - 2\overline{w}\left(\sum \nu\right)}\tag{11}$$

For the Guinea epidemiological Ebola model, the nonlinear regression equation is *y* = 72.827e0.0823x with *R*<sup>2</sup> as 0.9077 indicating that about 91% of the total variations in *y* (the number of actual Ebola deaths) can be explained by the regression equation. The polynomial epidemiological model for Ebola deaths in Liberia, *y* = 0.0003*x*<sup>5</sup> − 0.0069*x*<sup>4</sup> + 0.0347*x*<sup>3</sup> + 0.5074*x*<sup>2</sup> − 4.1442*x* + 10.487, has *R*<sup>2</sup> as 0.9715 so that about 97% of the total variations in *y* (the number of observed Ebola deaths) can be explained by the regression equation. For the third worldwide model, the polynomial for the total Ebola deaths for all countries combined is expectedly better. Here, the *R*<sup>2</sup> is 0.9823, so that about 98% of the total variations in the number of actual Ebola deaths can be explained by the regression equation, *y* = 0.033*x*<sup>4</sup> − 1.4617*x*<sup>3</sup> + 23.437*x*<sup>2</sup> − 118.18*x* + 231.59.

This shows that recording good and organized data that is easily retrievable is paramount in the fight of pandemics. The statistical models developed, in turn, can continue to inform and drive current and subsequent pandemic analyses.

## **5. Probability and expected value in statistics**

At Wesley College, probability and expected value in statistics are introduced in two freshmanlevel mathematics classes: the quantitative reasoning math-core course and a first-year seminar, *Mathematics in Gambling*.

In general, there are two practical approaches to assigning a probability value to an event:


The **classical approach** to assigning a probability assumes that all outcomes to a probability experiment are equally likely. In the case of a roulette wheel at a casino, the little rolling ball is equally likely to land in any of the 38 compartments of the roulette wheel. In general, the rule for the probability of an event according to the classical approach is:

$$P\text{ (event }A\text{) } = \frac{number\ of\ ways\ event\ A\ can\ occur}{total\ number\ of\ ways\ anything\ can\ occur} \tag{12}$$

In the case of roulette, the probability an individual wins by placing a bet on the color red is 18/38. Since there are 18 red, 18 black and 2 green compartments, the probability of a gambler winning by placing a bet on the color red is 18 <sup>38</sup> <sup>=</sup> <sup>9</sup> <sup>19</sup> or approximately 0.474.

Unfortunately, the classical approach to probability is not always applicable. In the insurance industry, actuaries are interested in the likelihood of a policyholder dying. Since the two events of a policyholder living or dying are not equally likely, the classical approach cannot be used.

Instead, the **relative frequency approach** is used, which is:

employed. In Eq. (11), Σ signifies summation, *w* refers to the actual (observed) number of Ebola deaths, *z* is the number of Ebola deaths as calculated with the model and *n* is the total number

( ) ( ) ( ) ( )

+- -

+ å åå

2 2 2 2

miological model for Ebola deaths in Liberia, *y* = 0.0003*x*<sup>5</sup> − 0.0069*x*<sup>4</sup>

2 2

*wz n w z w w*

( ) ( ) ( )

For the Guinea epidemiological Ebola model, the nonlinear regression equation is *y* =

of actual Ebola deaths) can be explained by the regression equation. The polynomial epide-

− 4.1442*x* + 10.487, has *R*<sup>2</sup> as 0.9715 so that about 97% of the total variations in *y* (the number of observed Ebola deaths) can be explained by the regression equation. For the third worldwide model, the polynomial for the total Ebola deaths for all countries combined is expectedly better. Here, the *R*<sup>2</sup> is 0.9823, so that about 98% of the total variations in the number of actual

This shows that recording good and organized data that is easily retrievable is paramount in the fight of pandemics. The statistical models developed, in turn, can continue to inform and

At Wesley College, probability and expected value in statistics are introduced in two freshmanlevel mathematics classes: the quantitative reasoning math-core course and a first-year

In general, there are two practical approaches to assigning a probability value to an event:

The **classical approach** to assigning a probability assumes that all outcomes to a probability experiment are equally likely. In the case of a roulette wheel at a casino, the little rolling ball is equally likely to land in any of the 38 compartments of the roulette wheel. In general, the rule

*total number of ways anything can occur* <sup>=</sup> (12)

*number of ways event Acan occur P event A*

Ebola deaths can be explained by the regression equation, *y* = 0.033*x*<sup>4</sup> − 1.4617*x*<sup>3</sup>

*w nw w w*

2

å å (11)

+ 0.0347*x*<sup>3</sup>

+ 0.5074*x*<sup>2</sup>

+ 23.437*x*<sup>2</sup>

as 0.9077 indicating that about 91% of the total variations in *y* (the number

of weeks.

72.827e0.0823x with *R*<sup>2</sup>

− 118.18*x* + 231.59.

2

=

310 Advances in Statistical Methodologies and Their Application to Real Problems

drive current and subsequent pandemic analyses.

seminar, *Mathematics in Gambling*.

**a.** The classical approach

**5. Probability and expected value in statistics**

**b.** The relative frequency/empirical approach and

for the probability of an event according to the classical approach is:

( )

*R*

$$P\ \left(\text{event } B\right) \ = \frac{number\ of\ times\ event\ B\ has\ happened\ in\ the\ past\ n\ trials}{number\ of\ trials,\ n} \tag{13}$$

When setting life insurance rates for policyholders, life insurance companies must consider variables such as age, sex and smoking status (among others). Suppose recent mortality data for 65-year-old non-smoking males indicates 1800 such men died last year out of 900,000 such men. Based on this data, one would say the probability a 65-year-old non-smoking male will die in the next year, based on the relative frequency approach is:

*P* (65-year-old non-smoking male dies) = 1, 800 900, 000 or approximately 0.002 or 0.2%.

The field of decision analysis often employs the concept of **expected value**. Take the case of a 65-year-old non-smoking male buying a \$250,000 term life insurance policy. Is it worth buying this policy? Based on the concept of expected value, a calculation based on probability is made and interpreted. If the value turns out to be negative, students then have to explain the rationale justifying the purpose of purchasing the term life insurance policy.

For a casino installing, a roulette wheel or craps table will the table game be a money maker for the casino? In the *Mathematics of Gambling* first-year seminar course, students research the rules for the game of roulette and the payoffs for various bets. Based on their findings, they determine the "house edge" for various bets. They also compare various bets in different games of chance to analyze which is a "better bet" and in what game.

Assume a situation has various outcomes/states of nature which occur randomly and are unknown when a decision is to be made. In the case of a person considering a life-insurance policy, the person will either live (L) or die (D) during the next year. Assuming the person has no adverse medical condition, the person's state of nature is unknown when he has to make the decision to buy the term life-insurance (the two outcomes will occur in no predictable manner and are considered random). If each monetary outcome (denoted ) has a probability

(denoted ), then the **expected value** can be computed by the formula:

$$\begin{aligned} \text{Expected Value} &= O\_1 \cdot p\_1 + O\_2 \cdot p\_2 + O\_3 \cdot p\_3 + O\_4 \cdot p\_4 + ... + O\_n \cdot p\_n\\ &= \sum\_{l=1}^n (O\_l \cdot p\_l) \end{aligned} \tag{14}$$

where there are *n* possible outcomes.

In other words, it is the sum of each monetary outcome times its corresponding probability.

*Example 1: A freshman-level quantitative reasoning mathematics-core class*

Assume a 67-year-old non-smoking male is charged \$1180 for a one year \$250,000 term lifeinsurance policy. Assume actuarial tables show the probability of death for such a person to be 0.003. What is the expected value of this life-insurance policy to the buyer?

A payoff table can be constructed showing the outcomes, probabilities and "net" payoffs:


The payoff in the case of the person living is negative since the money is spent with no return on the investment. Using these data, the expected value is calculated as

$$\text{Expected Value} = \\$248,\\$20 \cdot 0.003 + \\$1,\\$80 \cdot 0.997 = \\$430. \tag{15}$$

The negative sign in the expected value means the consumer should expect to lose money (while the insurance company can expect to make money). Students are asked to explain the meaning of the expected value and explain reasons for people throwing their money away like this. What will they do when it comes time to consider term life insurance?

#### *Example 2: Mathematics of Gambling class*

Students are asked to research rules of various games of chance, the meaning of various payoffs (for example, 35 to 1 versus 35 for 1) and then be asked to calculate and interpret the **house edge** in gambling. This is defined by the formula

$$\text{HouseEdge} = \frac{\text{Expected Value of the Bet}}{\text{Size of the Bet}} \tag{16}$$

By asking different students to evaluate the house edge of different gambling bets, students can analyze and decide which bet is safest if they do choose to gamble.

Which bet has the lower house edge and why?

1 1 2 2 3 3 4 4

*n n*

× ++ ×

<sup>=</sup>å (14)

...

In other words, it is the sum of each monetary outcome times its corresponding probability.

Assume a 67-year-old non-smoking male is charged \$1180 for a one year \$250,000 term lifeinsurance policy. Assume actuarial tables show the probability of death for such a person to

A payoff table can be constructed showing the outcomes, probabilities and "net" payoffs:

The payoff in the case of the person living is negative since the money is spent with no return

The negative sign in the expected value means the consumer should expect to lose money (while the insurance company can expect to make money). Students are asked to explain the meaning of the expected value and explain reasons for people throwing their money away like

Students are asked to research rules of various games of chance, the meaning of various payoffs (for example, 35 to 1 versus 35 for 1) and then be asked to calculate and interpret the **house**

By asking different students to evaluate the house edge of different gambling bets, students

*Expected Value of the Bet House Edge Size of the Bet* <sup>=</sup> (16)

*Expected Value* \$248,820 0.003 \$1,180 0.997 \$430. = × +- × =- (15)

1

=

*i*

*Example 1: A freshman-level quantitative reasoning mathematics-core class*

Outcome: Person dies Person lives Probability: 0.003 1 – 0.003 = 0.997 Net payoff: \$250,000–\$1180 − \$1180

on the investment. Using these data, the expected value is calculated as

this. What will they do when it comes time to consider term life insurance?

can analyze and decide which bet is safest if they do choose to gamble.

\$248,820

*Example 2: Mathematics of Gambling class*

**edge** in gambling. This is defined by the formula

be 0.003. What is the expected value of this life-insurance policy to the buyer?

where there are *n* possible outcomes.

*n*

312 Advances in Statistical Methodologies and Their Application to Real Problems

( )

×

*O p*

*i i*

*Expected Value O p O p O p O p O p*

= + × + × +×

Bet #1 – Placing a \$10 bet in American roulette on the "row" 25– 27.

Bet #2 – Placing a \$5 bet in Craps on rolling the sum of 11.

Students must research each game of chance and determine important information to use, which is recorded as follows:


The roulette bet has a lower house edge and is financially safer in the long run for the gambler. Students were then asked to compute the house edge using the shortcut method based on the theory of odds. The house edge is the difference between the true odds (denoted *a:b*) and the payoff odds the casino pays, expressed as a percentage of the true total odds (*a* + *b*).

In the example involving craps, the true odds against a sum of 11 is 34:2 which reduces to 17:1. The difference between the true odds and payoff odds is 17 – 15 (see Example 2) = 2. Expressing this difference as a percentage of (a + b), the house edge is then calculated as 2 ÷ (17 + 1) =

2 ÷ 18 = <sup>1</sup> <sup>9</sup> = 0.1111 which is the same answer found using the expected value.

Due to the concept of the house edge, casinos know that in the long run, every time a bet is made in roulette, the house averages a profit of \$0.0526 for each dollar bet. Yes, gamblers do win at the roulette table and large amounts of money are paid out. But in the long run, the game is a money maker for the casino.

## **Acknowledgements**

This work was made possible by grants from the National Institute of General Medical Sciences —NIGMS (P20GM103446) from the National Institutes of Health (DE-INBRE IDeA program), a National Science Foundation (NSF) EPSCoR grant IIA-1301765 (DE-EPSCoR program) and the Delaware (DE) Economic Development Office (DEDO program). The undergraduates acknowledge tuition scholarship support from Wesley's NSF S-STEM Cannon Scholar Program (NSF DUE 1355554) and RB acknowledges further support from the NASA DE-Space Grant Consortium (DESGC) program (NASA NNX15AI19H). The DE-INBRE, the DE-EPSCoR and the DESGC grants were obtained through the leadership of the University of Delaware and the authors sincerely appreciate their efforts.

## **Author contributions**

Drs. D'Souza, Wentzien and Nwogbaga served as undergraduate research mentors to Brandenberg, Bautista and Miller, respectively. Professor Olsen has developed and taught the probability and expected value examples in his freshman-level mathematics core courses. The findings and conclusions drawn within the chapter in no way reflect the interpretations and/or views of any other federal or state agency.

#### **Conflicts of interest**

The authors declare no conflict of interest.

## **Author details**

Malcolm J. D'Souza1\*, Edward A. Brandenburg1 , Derald E. Wentzien2 , Riza C. Bautista2 , Agashi P. Nwogbaga2 , Rebecca G. Miller2 and Paul E. Olsen2


2 Department of Mathematics and Data Science, Wesley College, Dover, Delaware, USA

## **References**


[3] Kevill, D.N., D'Souza, M.J. Sixty Years of the Grunwald-Winstein Equation: Development and Recent Applications. Journal of Chemical Research, 2008; 61–66.

a National Science Foundation (NSF) EPSCoR grant IIA-1301765 (DE-EPSCoR program) and the Delaware (DE) Economic Development Office (DEDO program). The undergraduates acknowledge tuition scholarship support from Wesley's NSF S-STEM Cannon Scholar Program (NSF DUE 1355554) and RB acknowledges further support from the NASA DE-Space Grant Consortium (DESGC) program (NASA NNX15AI19H). The DE-INBRE, the DE-EPSCoR and the DESGC grants were obtained through the leadership of the University of Delaware

Drs. D'Souza, Wentzien and Nwogbaga served as undergraduate research mentors to Brandenberg, Bautista and Miller, respectively. Professor Olsen has developed and taught the probability and expected value examples in his freshman-level mathematics core courses. The findings and conclusions drawn within the chapter in no way reflect the interpretations

, Derald E. Wentzien2

and Paul E. Olsen2

2 Department of Mathematics and Data Science, Wesley College, Dover, Delaware, USA

[1] D'Souza, M.J., Curran, K.L., Olsen, P.E., Nwogbaga, A.P., Stotts, S. Integrative Approach for a Transformative Freshman-Level STEM Curriculum. Journal College Teaching &

[2] D'Souza, M.J., Kashmar, R.J., Hurst, K., Fiedler, F., Gross, C.E., Deol, J.K., Wilson, A. Integrative Wesley College Biological Chemistry Program Includes the Use of Informatics Tools, GIS and SAS Software Applications. Contemporary Issues in Education

, Riza C. Bautista2

,

and the authors sincerely appreciate their efforts.

314 Advances in Statistical Methodologies and Their Application to Real Problems

and/or views of any other federal or state agency.

The authors declare no conflict of interest.

Malcolm J. D'Souza1\*, Edward A. Brandenburg1

Learning, 2016; 13:47–64.

Research, 2015; 8:193–214.

, Rebecca G. Miller2

\*Address all correspondence to: malcolm.dsouza@wesley.edu

1 Department of Chemistry, Wesley College, Dover, Delaware, USA

**Author contributions**

**Conflicts of interest**

**Author details**

Agashi P. Nwogbaga2

**References**


## *Edited by Tsukasa Hokimoto*

In recent years, statistical techniques and methods for data analysis have advanced significantly in a wide range of research areas. These developments enable researchers to analyze increasingly large datasets with more flexibility and also more accurately estimate and evaluate the phenomena they study. We recognize the value of recent advances in data analysis techniques in many different research fields. However, we also note that awareness of these different statistical and probabilistic approaches may vary, owing to differences in the datasets typical of different research fields. This book provides a cross-disciplinary forum for exploring the variety of new data analysis techniques emerging from different fields.

Advances in Statistical Methodologies and Their Application to Real Problems

Advances in Statistical

Methodologies and Their

Application to Real Problems

*Edited by Tsukasa Hokimoto*

Photo by maciek905 / iStock