**2. Spatial statistical models**

Space models have a simple structure, flexible enough to handle a variety of problems. The data may be continuous or discrete, present spatial aggregations, or be point observations in space. As for the spatial locations can be regular or irregular. A spatial model is usually used to predict sites where the study phenom was not observed.

Let *<sup>x</sup>*<sup>∈</sup> *<sup>A</sup>* <sup>⊂</sup> <sup>ℝ</sup>*<sup>d</sup>* and *S x*ð Þ the data observed at the *<sup>x</sup>* location, this results in a stochastic process

$$S(\mathfrak{x}): \mathfrak{x} \in A \tag{1}$$

Structure 1, allows to differentiate and talk about problems with continuous spatial indexes, lattice, and point patterns giving rise to three types of data: geospatial, lattice data, and point patterns. In geospatial data, *A* is a fixed set in ℝ*<sup>d</sup>* containing a ddimensional rectangular with positive volume; *S x*ð Þ is a random vector in the location *x*∈ *A*. These data arise in areas such as atmospheric sciences, mining, and public health. In point patterns *<sup>A</sup>* is a point process in <sup>ℝ</sup>*<sup>d</sup>* or a subset of <sup>ℝ</sup>*d*; *S x*ð Þ is a random vector in the location *x*∈ *A*. In its most general form, it results in a spatial point process marked when *S x*ð Þ¼ 1, for all *x*∈ *A*. Point patterns arise when the variable to be analyzed is a location of "events".

Finally, the entangled data or also known as area data, *A* is a regular or irregular fixed set (with additional information from the surrounding neighborhood) of ℝ*d*; *S x*ð Þ is a random vector in location *x*∈ *A*. When locations are in regular meshes it is the closest analogy to time series observed at equally spaced time points. In the entangled data, based on the general spatial process 1, it is assumed that *A* is an accounting collection of space sites, in which the data are observed. The most common entangled data models are the Conditional Autoregressive Model (CAR) and the Simultaneous Autoregressive Model (SAR). CAR models form the basis of Markovian Gaussian random fields and Integrated Nested Laplace Approximation (INLA) methods. SAR models are popular in geographic information systems. Other models are the spatial autoregressive moving average (ARMA) [24, 25].

#### **2.1 Gaussian spatial processes**

Knowing the type of variables with which they are working and taking into account their spatial dependence, helps to determine the regression technique that best fits the characteristics of the data [21]. For the study of spatial data Gaussian processes can be used, which are stochastic processes, a collection of variables. This allows any subset of finite random variables to have a multivariate Gaussian distribution. Gaussian processes can thus be thought of as distributions of random vectors or random functions [26]. Gaussian processes began to be studied in the 1940s, but until the 1970s they were used in geostatistics and meteorology; In the 1990s Cressie [24] began to implement them in spatial statistics. In fact, the term "model-based geostatistics" was first used to describe an approach to geostatistical problems based on formal statistical models and inference procedures [27].

Gaussian stochastic processes are widely used as models for geostatic data. If a transformation of the original response variable is used, the scope of the Gaussian models can be amplified, and so with this extra flexibility the model provides a good empirical fit to the data.

A Gaussian process, {*S x*ð Þ : *<sup>x</sup>*<sup>∈</sup> <sup>ℝ</sup>2}, is a stochastic process with the property that for any collection of locations *<sup>x</sup>*1, … , *xn*, *xi* <sup>∈</sup> <sup>ℝ</sup>2, the joint distribution of **<sup>S</sup>** <sup>¼</sup> f g *S x*ð Þ<sup>1</sup> , … , *S x*ð Þ*<sup>n</sup>* is multivariate Gaussian.

Any such process is fully specified by the average function *μ*ð Þ¼ *x ES x* ½ � ð Þ and the covariance function *Cov S x*ð Þ, *S x*<sup>0</sup> f g ð Þ . As given *x*1, … , *xn* an arbitrary set of locations with *μ* ¼ ð Þ *μ*ð Þ *x*<sup>1</sup> , … , *μ*ð Þ *xn* and **G** an *n* � *n* matrix with elements *Gij* ¼ *Cov S x*ð Þ*<sup>i</sup>* , *S x <sup>j</sup>* ; then **S** has a multivariate normal distribution (MN).

$$\mathbf{S} \sim \text{MN}(\mu, \mathbf{G}) \tag{2}$$

A spatial Gaussian process is stationary if *μ*ð Þ *x* is constant, *μ*ð Þ¼ *x μ*, for all *x* and *Cov S x*ð Þ, *S x*<sup>0</sup> ð Þ¼ ð Þ *Cov u*ð Þ; where *u* ¼ ∥*x* � *x*<sup>0</sup> ∥ is the Euclidean distance. A stationary process is isotropic if the covariance between the values of *S x*ð Þ at any two locations

*Spatial Statistics in Vector-Borne Diseases DOI: http://dx.doi.org/10.5772/intechopen.104953*

depends only on the distance between them. The term stationary is often used as the equivalence of stationary and isotropic. A process for which *S x*ð Þ� *μ*ð Þ *x* is stationary is called covariance stationary. Processes of this type are widely used in practice as models for geostatistical data [28].

Among the parametric functions for the covariance function [29] are the following: Exponential:

$$Cov(u) = \sigma^2 \left[ \exp\left(\frac{-u}{\phi}\right) \right] \tag{3}$$

Gaussian:

$$\text{Cov}(u) = \sigma^2 \left[ \exp\left(-\frac{u}{\Phi}\right)^2 \right] \tag{4}$$

Matérn:

$$Cov(u) = \sigma^2 \left[ \frac{2}{2^{\kappa - 1} \Gamma(\kappa)} \left( \frac{u}{\phi} \right)^{\kappa} K\_{\kappa} \left( \frac{u}{\phi} \right) \right] \tag{5}$$

In these covariance functions (Eqs. (3)–(5)) *u* >0, *ϕ*>0, y *κ* >0; function *K<sup>κ</sup>* denotes the modified Bessel function of order *κ* and Γð Þ� denotes the gamma function.

#### **2.2 Criteria for evaluating the covariance structure of the Gaussian process**

There are several criteria in the literature to validate the covariance structure of a Gaussian process Eq. (2). Among the most used are: Mean Error (ME), Mean Square Error (MSE), Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Square Normalized Error (MSNE) (**Table 1**). ME and MSE should tend to zero when the covariance structure of the Gaussian process was correctly estimated. The MAE and RMSE criteria are considered as the most efficient criteria to validate the covariance structure of the Gaussian process. The RMSE is expected to be small like MAE, while the MSNE is expected to be close to 1 [29, 30].


#### **Table 1.**

*Criteria for evaluating the covariance structure of the Gaussian process.*

#### **2.3 Generalized linear spatial models**

Spatial Generalized Linear Models were introduced by Diggle et al. in 1998 [31]; if the variable response *Y* has Poisson distribution, then

$$Y\_i|\mathbb{S}(\cdot) \sim Poisson(\mu\_i) \tag{6}$$

Where

$$\mathbf{S} \sim \text{MN}(\mathbf{D}\boldsymbol{\beta}, \mathbf{G})$$

It is assumed that *Yi* f g : *i* ¼ 1, … , *n* conditioned in **S** are independent, *E Yi* ½ �¼ j*S*ð Þ� *<sup>μ</sup>i*, *<sup>g</sup>*s a known link function such that *<sup>g</sup> <sup>μ</sup><sup>i</sup>* ð Þ¼ *<sup>η</sup><sup>i</sup>* then *<sup>μ</sup><sup>i</sup>* <sup>¼</sup> *<sup>g</sup>*�<sup>1</sup> *<sup>η</sup><sup>i</sup>* ð Þ, *<sup>i</sup>* <sup>¼</sup> 1, … , *<sup>n</sup>*. **<sup>D</sup>** <sup>¼</sup> **1**, **d**1, … , **d***<sup>p</sup>* � � is a design matrix of *<sup>n</sup>* � ð Þ *<sup>p</sup>* <sup>þ</sup> <sup>1</sup> of full range, **<sup>1</sup>** a vector *<sup>n</sup>* � 1 of ones and **d** *<sup>j</sup>* ¼ *d <sup>j</sup>*ð Þ *x*<sup>1</sup> , … , *d <sup>j</sup>*ð Þ *xn* � �<sup>0</sup> , where *d <sup>j</sup>*ð Þ *xi* is the value of the covariate *j*-th of the *i*-th location; *β* ¼ *β*0, *β*1, … , *β<sup>p</sup>* � � the regression parameters.

## **2.4 Moran's index for spatial autocorrelation**

To prove the existence of spatial dependence on a variable *Y*, the Moran index [32, 33], given by

$$IM = \frac{n\sum\_{i}^{n}\sum\_{i}^{n}w\_{\vec{\eta}}\left(Y\_{i} - \overline{Y}\right)\left(Y\_{i} - \overline{Y}\right)}{\sum\_{i \neq j}^{n}w\_{\vec{\eta}}\sum\_{i}^{n}\left(Y\_{i} - \overline{Y}\right)^{2}}\tag{7}$$

Where **W** is the weights matrix that defines the relationships between the regions of the study. In this case *wij* ¼ 1 denotes areas with a common border and *wij* ¼ 0 in another case. *Yi* and *Y <sup>j</sup>* would be the values observed in regions *i* and *j* respectively, while *Y* is the average incidence of the districts studied, *n* is the total number of localities.

#### **2.5 Statistical software packages R for spatial data**

Several packages are available in statistical software R [34] to perform spatial modeling.

The *geoR* package is used for performing geostatistical data analysis and spatial prediction, which expands the set of methods and tools presently available for spatial data analysis in R. The package executes methods for Gaussian and Gaussian models transformed, incorporates functions and methods for reading and preparing the data, exploratory analysis, inference on model parameters and spatial interpolation, and it also contains functions for parameter estimation under Bayesian methods [35].

The *geoRglm* package is used to implement Generalized Linear Spatial Model. The subsequent and predictive inference is based on Markov Chains Monte Carlo (MCMC) methods. This package, which is an extension of the *geoR* package, help with GLSM conditional simulation and prediction, and with Bayesian inference for the models Poisson (*pois.krige*) and Binomial (*binom.krige*) [35, 36]. A Langevin-Hastings algorithm is used to obtain MCMC simulations. In the *pois.krige* and *binom.krige* functions, the user can provide a value for the variation of the proposal *S.scale*, a value initial, *S.start*, the thinning, *thin*, the length of the burn, *burn.in*, and the number of iterations, *n.iter* [35].

#### *2.5.1 Inference for the generalized linear spatial model*

The geostatistical model assumes the response variable to be Gaussian, which may be an unrealistic assumption for some data sets. The GLSM provides a framework for analyzing Binomial and Poisson distributed data. The likelihood for such a model, in general, cannot be represented in closed form, since it is a high-dimensional integral

$$L(\mathfrak{f}, \sigma^2, \phi) = \int \prod\_{i=1}^n f(\mathfrak{y}\_i; \mathbf{g}^{-1}(\mathfrak{s}\_i)) p(\mathfrak{s}; \mathfrak{f}, \sigma^2, \phi) d\mathfrak{s} \tag{8}$$

where *f y*ð Þ ; *<sup>μ</sup>* denotes the density of the distribution with mean *<sup>μ</sup>*, *p s*; *<sup>β</sup>*, *<sup>σ</sup>*<sup>2</sup> ð Þ , *<sup>ϕ</sup>* is the multivariate Gaussian density for the vector **s** of random effects at the data locations and *g*ð Þ� is the link function. In practice, the high dimensionality of this integral precludes direct computation, so the inference is based on MCMC.
