**2. GIS and geostatistical tools**

Environmental risk management involves the integrated use of geostatistical techniques and several tools provided in a GIS.

A GIS is a repository of tools for capturing, storing, updating, manipulating, analyzing, and displaying all geographic information [9]. The data are stored as collections of relational tables logically associated with each other by shared attributes. On the other hand, Geostatistics provides advanced techniques to predict a variable of interest at unsampled points and to support decisions concerning monitoring, sampling, planning and re-qualification of a territory [10].

Fig. 1 shows a flowchart regarding the steps which characterize the connection between GIS and geostatistical tools.

The first step is based on the creation of a geodatabase, that is a relational database which stores a series of tables holding feature classes, raster datasets, attributes and geographic information represented by shapefiles (i.e. geospatial vector data format which can spatially describe vector features, such as polygon, line or point, with defined shape and geometric location). Then, the next step concerns the georeferencing, that is the process of assigning real-world coordinates to each pixel of the raster data. In this case, these coordinates are often obtained through field surveys, by collecting coordinates with a GPS device, or by using Google Earth software. On the other hand, the shapefile data do not need to be georeferenced, but simply loaded in a geodatabase by specifying the coordinate system associated to the datasets; hence, the vector features in the shapefile will be represented by using the given coordinate system.

The third step involves the implementation of the geostatistical analysis which is organized in various phases, such as the exploratory spatial data analysis (i.e. mapping the spatial distribution using thematic maps), the estimation and modeling the spatial correlation structure exhibited by the data, predicting the variable of interest at unsampled locations by using spatial interpolation techniques (such as Simple, Ordinary and Universal Kriging, Indicator Kriging, Cokriging) and computing the corresponding prediction errors. Finally, the last step consists in the visualization of geostatistical results by producing reliable prediction maps, probability maps, and other useful cartographic representations.

**Figure 1.** Integration between GIS and geostatistical tools.

2 Current Air Quality Issues

its implementation have been detailed.

**2. GIS and geostatistical tools**

several tools provided in a GIS.

re-qualification of a territory [10].

by using the given coordinate system.

and geostatistical tools.

Thus, the usefulness of geostatistical techniques for monitoring Rn concentrations in soil gas and the integration of these results in an *ad hoc* WebGIS, named RnWebGIS, have been described. This innovative tool offers dynamic scenarios for monitoring and analyzing different environmental variables [7, 39]. The proposed RnWebGIS is based on an Open Source Software, which enables users to get online spatial and environmental information. Hence, after a discussion on the integration of GIS and geostatistical tools (Section 2), a brief review of some geostatistical techniques for modeling and prediction purposes has been presented (Section 3). A thorough case study concerning the Rn concentration in soil gas measured in several sites over Lecce district has been developed (Section 4). Finally, the RnWebGIS, where all the available geo-lithological information of the area under study, as well as the Rn prediction results have been integrated, has been proposed and the steps for

Environmental risk management involves the integrated use of geostatistical techniques and

A GIS is a repository of tools for capturing, storing, updating, manipulating, analyzing, and displaying all geographic information [9]. The data are stored as collections of relational tables logically associated with each other by shared attributes. On the other hand, Geostatistics provides advanced techniques to predict a variable of interest at unsampled points and to support decisions concerning monitoring, sampling, planning and

Fig. 1 shows a flowchart regarding the steps which characterize the connection between GIS

The first step is based on the creation of a geodatabase, that is a relational database which stores a series of tables holding feature classes, raster datasets, attributes and geographic information represented by shapefiles (i.e. geospatial vector data format which can spatially describe vector features, such as polygon, line or point, with defined shape and geometric location). Then, the next step concerns the georeferencing, that is the process of assigning real-world coordinates to each pixel of the raster data. In this case, these coordinates are often obtained through field surveys, by collecting coordinates with a GPS device, or by using Google Earth software. On the other hand, the shapefile data do not need to be georeferenced, but simply loaded in a geodatabase by specifying the coordinate system associated to the datasets; hence, the vector features in the shapefile will be represented

The third step involves the implementation of the geostatistical analysis which is organized in various phases, such as the exploratory spatial data analysis (i.e. mapping the spatial distribution using thematic maps), the estimation and modeling the spatial correlation structure exhibited by the data, predicting the variable of interest at unsampled locations by using spatial interpolation techniques (such as Simple, Ordinary and Universal Kriging, Indicator Kriging, Cokriging) and computing the corresponding prediction errors. Finally, the last step consists in the visualization of geostatistical results by producing reliable

prediction maps, probability maps, and other useful cartographic representations.

The integration between GIS and Geostatistics has been of interest since the early 1990s, when [20] and [8] illustrated the potential benefits of a close link between information systems and Spatial Analysis. Many contributions, concerning the integration of spatial statistical methods and a GIS environment, have been proposed in the literature [2, 3, 6, 9, 14, 37] over the years.

In order to improve the integration between GIS and Geostatistics, different methods of spatial analysis have been implemented in several kinds of GIS softwares [1, 3]. For example, among the most widely used commercial GIS packages, it is worth citing *ArcGIS*, *Geomedia* and *Mapinfo*, which offer extensions for geostatistical analysis; on the other hand, among the *Open Source* packages it is worth mentioning *GRASS* and *Quantum GIS* which use R, Java and Python in order to implement the geostatistical tools. But, the collection of spatial data analysis softwares now available does not completely include all the computational aspects. Therefore, in order to satisfy the specific requirements of the user it is often necessary to develop *ad hoc routines* for sharing GIS data, implementing macros and customized libraries of geostatistical functions in GIS environment, creating suitable script [36].

## **3. Geostatistical basic notions**

Geostatistical methods [15, 23, 28] are useful applied in many areas of environmental sciences (Geography, Geology, Environmental Sciences, Ecology, Meteorology, Agronomy), for studying phenomena which vary continuously over the survey domain. Usually, the final aim of geostatistical analysis consists of obtaining predictions of the variable of interest at unsampled points of the study area. In this case, the use of stochastic models to describe the spatial behaviour of the phenomenon under study, as well as its spatial uncertainty, is reasonable.

In the following sections, the most applied geostatistical techniques, developed for modeling and prediction purposes, both in the univariate and multivariate case, will be briefly described.

### **3.1. Modeling and prediction in the univariate case**

In Geostatistics, the available spatial data concerning the measurements of a single variable, at *n* points located over the area of interest, are considered as a particular realization of the random variable *<sup>Z</sup>*(**u***α*), *<sup>α</sup>* <sup>=</sup> 1, 2, . . . , *<sup>n</sup>*, where **<sup>u</sup>** <sup>∈</sup> *<sup>D</sup>* <sup>⊆</sup> *<sup>R</sup>d*, *<sup>d</sup>* <sup>∈</sup> *<sup>N</sup>*+. This random variable is called in [25] *regionalized variable*, in order to highlight the feature of the spatial dependency for the variable itself. The spatial distribution of the random variable *Z* is usually interpreted as the result of two overlapping structures: a systematic structure at the macroscopic level, which describes the global features of the same natural process, and another random structure at the microscopic level, which represents the local irregularity of *Z*.

In particular, by assuming the existence of the first and the second-order moments of the random variable *Z*, the basic stochastic model for *Z* is defined as follows:

$$Z(\mathbf{u}) = m(\mathbf{u}) + \mathcal{Y}(\mathbf{u}), \qquad \qquad \mathbf{u} \in D \subseteq \mathbb{R}^d, \ d \in \mathcal{N}\_{+}, \tag{1}$$

where *m*(**u**), usually called *trend*, describes the large-scale variations and *Y*(**u**) is a second-order stationary random field which represents the local fluctuations of *Z*.

Under the second-order stationary hypotheses, the expected value of *Z* exists and does not depend on **u** ∈ *D*; moreover the variogram exists and can be defined as follows:

$$\gamma(\mathbf{h}) = 0.5 \, E[Z(\mathbf{u}) - Z(\mathbf{u} + \mathbf{h})]^2,\tag{2}$$

where **h** = **u** − **u** is the separation vector between two support points **u** and **u** of the spatial domain *D*.

The variogram, by definition, is a measure of dissimilarity, in the sense that it increases when the difference between *Z*(**u**) and *Z*(**u** + **h**) increases; thus, it is usually characterized by small values for short spatial distances.

After computing the sample variogram [32], as a measure of the spatial correlation of the variable of interest, a theoretical admissible model has to be fitted to the estimated variogram.

It is worth noting that the condition for a function to be a variogram is that it be conditionally negative definite [11]. However, given the difficulties to verify this condition, it is advisable for users to look for the best model among the wide parametric families whose members are known to be conditionally negative definite functions. This condition ensures a unique solution in the kriging system (system of linear equations based on the variogram matrix of observations in different spatial points of the domain, as it will be discussed hereafter) used to tackle the general problem of predicting the process of interest, starting from the observed values.

Various classes of variogram models have been proposed in the literature; however the selection of an appropriate class of models is in general based on its geometric features and theoretical properties [10, 12, 13].

Once the spatial correlation of the variable under study has been properly modelled, some well-known based-variogram prediction techniques can be used in order to estimate the analyzed variable at unsampled points over the domain of interest.

The most applied spatial predictor is the Ordinary Kriging (OK) [10, 23] which is based on the following estimator:

$$
\hat{Z}^{OK}(\mathbf{u}) = \sum\_{a=1}^{n} \lambda\_a^{OK}(\mathbf{u}) Z(\mathbf{u}\_a),
\tag{3}
$$

where the kriging weights *λOK <sup>α</sup>* , *α* = 1, 2, . . . , *n*, depend on the variogram model and the sampling configuration.

The predictor *<sup>Z</sup>OK*(**u**) satisfies both the unbiasedness and the efficiency condition.

In particular, the first property is satisfied if the weights *λOK <sup>α</sup>* , *α* = 1, 2, . . . , *n*, of the linear combination are such that

$$\sum\_{i=1}^{n} \lambda\_{\alpha}^{OK}(\mathbf{u}) = 1. \tag{4}$$

On the other hand, *<sup>Z</sup>OK*(**u**) is efficient, if the weights are chosen in order to minimize the error variance *σ*<sup>2</sup> *<sup>E</sup>* <sup>=</sup> *<sup>E</sup>*[*R*(**u**)]<sup>2</sup> <sup>=</sup> *<sup>E</sup>*[*<sup>Z</sup>*(**u**) <sup>−</sup> *<sup>Z</sup>*(**u**)]2, under the unbiasedness condition.

Finally, it is important to point out that, if the variable under study is characterized by a high coefficient of variation and/or by different levels of variability, traditional prediction methods, such as OK, might not be appropriate. In particular, when the experimental data shown a skewed distribution, it is possible to apply a non-linear transformation at the observed data, such that the distribution of the transformed data is symmetric. In many cases, the logarithmic transformation is useful for this aim. Hence, structural analysis and kriging are performed by considering the log-transformed data.

#### **3.2. Modeling and prediction in the multivariate case**

In many environmental applications, the phenomenon under study is the result of the simultaneous behavior of several variables, which are correlated each other and measured at some points of the area of interest.

In this case, a multivariate spatial structure [34] characterizes the observed data which can be considered as a realization of a multivariate random field (*MRF*) {**Z**(**u**), **<sup>u</sup>** <sup>∈</sup> *<sup>D</sup>* <sup>⊆</sup> **<sup>R</sup>***d*}, with

$$\mathbf{Z(u)} = [Z\_1(\mathbf{u}), Z\_2(\mathbf{u}), \dots, Z\_p(\mathbf{u})]^T,\tag{5}$$

where *p* ≥ 2.

4 Current Air Quality Issues

described.

*Z*.

domain *D*.

values.

values for short spatial distances.

In the following sections, the most applied geostatistical techniques, developed for modeling and prediction purposes, both in the univariate and multivariate case, will be briefly

In Geostatistics, the available spatial data concerning the measurements of a single variable, at *n* points located over the area of interest, are considered as a particular realization of the random variable *<sup>Z</sup>*(**u***α*), *<sup>α</sup>* <sup>=</sup> 1, 2, . . . , *<sup>n</sup>*, where **<sup>u</sup>** <sup>∈</sup> *<sup>D</sup>* <sup>⊆</sup> *<sup>R</sup>d*, *<sup>d</sup>* <sup>∈</sup> *<sup>N</sup>*+. This random variable is called in [25] *regionalized variable*, in order to highlight the feature of the spatial dependency for the variable itself. The spatial distribution of the random variable *Z* is usually interpreted as the result of two overlapping structures: a systematic structure at the macroscopic level, which describes the global features of the same natural process, and another random structure at the microscopic level, which represents the local irregularity of

In particular, by assuming the existence of the first and the second-order moments of the

where *m*(**u**), usually called *trend*, describes the large-scale variations and *Y*(**u**) is a

Under the second-order stationary hypotheses, the expected value of *Z* exists and does not

where **h** = **u** − **u** is the separation vector between two support points **u** and **u** of the spatial

The variogram, by definition, is a measure of dissimilarity, in the sense that it increases when the difference between *Z*(**u**) and *Z*(**u** + **h**) increases; thus, it is usually characterized by small

After computing the sample variogram [32], as a measure of the spatial correlation of the variable of interest, a theoretical admissible model has to be fitted to the estimated variogram. It is worth noting that the condition for a function to be a variogram is that it be conditionally negative definite [11]. However, given the difficulties to verify this condition, it is advisable for users to look for the best model among the wide parametric families whose members are known to be conditionally negative definite functions. This condition ensures a unique solution in the kriging system (system of linear equations based on the variogram matrix of observations in different spatial points of the domain, as it will be discussed hereafter) used to tackle the general problem of predicting the process of interest, starting from the observed

second-order stationary random field which represents the local fluctuations of *Z*.

depend on **u** ∈ *D*; moreover the variogram exists and can be defined as follows:

*<sup>Z</sup>*(**u**) = *<sup>m</sup>*(**u**) + *<sup>Y</sup>*(**u**), **<sup>u</sup>** <sup>∈</sup> *<sup>D</sup>* <sup>⊆</sup> *<sup>R</sup>d*, *<sup>d</sup>* <sup>∈</sup> *<sup>N</sup>*+, (1)

*<sup>γ</sup>*(**h**) = 0.5 *<sup>E</sup>*[*Z*(**u**) <sup>−</sup> *<sup>Z</sup>*(**<sup>u</sup>** <sup>+</sup> **<sup>h</sup>**)]2, (2)

random variable *Z*, the basic stochastic model for *Z* is defined as follows:

**3.1. Modeling and prediction in the univariate case**

Under second-order stationarity, the mean vector of **Z** exists and does not depend on **u**, i.e.:

$$E[\mathbf{Z}(\mathbf{u})] = \mathbf{M} = [m\_1, m\_2, \dots, m\_p]^T, \qquad \mathbf{u} \in D,\tag{6}$$

and the (*p* × *p*) variogram matrix **Γ** exists and depends on the spatial separation vector **h** between two support points (**u** and **u** ), i.e.:

$$
\Gamma[\mathbf{Z}(\mathbf{u}), \mathbf{Z}(\mathbf{u}')] = E\left\{ [\mathbf{Z}(\mathbf{u}) - \mathbf{Z}(\mathbf{u}')] [\mathbf{Z}(\mathbf{u}) - \mathbf{Z}(\mathbf{u}')]^T \right\} = 
$$

$$
= \Gamma(\mathbf{h}) = [\gamma\_{ij}(\mathbf{h})]\_\prime \tag{7}
$$

where *γij*(**h**), *i*, *j* = 1, 2, . . . , *p*, is the cross-variogram between *Zi*(**u**) and *Zj*(**u** + **h**), when *i* = *j*, and the direct variogram of the *Zi*(**u**) , when *i* = *j*.

In multivariate Geostatistics, *Ordinary Cokriging* (OCK) [34] is a well-known technique used to obtain prediction of one or all the variables under study, at unsampled points of the spatial domain. This technique considers the following linear predictor of the spatial random vector defined in (5):

$$\hat{\mathbf{Z}}^{\text{OC}}(\mathbf{u}) = \sum\_{\alpha=1}^{n} \Lambda\_{\alpha}^{\text{OC}}(\mathbf{u}) \mathbf{Z}(\mathbf{u}\_{\alpha})\_{\prime} \tag{8}$$

where **u** ∈ *D* is an unsampled point in the domain *D*, **u***<sup>α</sup>* ∈ *D*, *α* = 1, 2, . . . , *n*, are the sampled points in the domain *D* and **Λ***OC <sup>α</sup>* (**u**) are the (*p* × *p*) matrices of weights, which are determined by ensuring unbiasedness and efficiency conditions.

The cokriging predictor requires a model for the variogram matrix (7). The *Linear Model of Coregionalization* (*LMC*), widely discussed in [23] and [34], is the most commonly model used in the applications. In particular, the *LMC* can be written as

$$\Gamma(\mathbf{h}) = \sum\_{l=1}^{L} \mathbf{B}\_l \, \mathcal{g}\_l(\mathbf{h}), \tag{9}$$

where **B***l*, *l* = 1, 2, . . . , *L*, are (*p* × *p*) positive definite matrices, called *coregionalization matrices*, while *gl*(**h**), *l* = 1, 2, . . . , *L*, are basic variograms which correspond to different scales of spatial variability.

A particular case of cokriging, which allows non-parametric, non-linear predictions [22], is the *Cokriging with Indicator Variable* (ICK), where the values observed for one and/or all the variables under study, are replaced by indicator values (that is 1 or 0, depending on a condition specified by the analyst which is of interest for the phenomenon under study).

Another very useful prediction technique, developed in the context of multivariate Geostatistics, is the *Kriging with Varying Means* (KVM), which corresponds to an extension of the *simple kriging* used in the univariate case. In the context of two or more variables, the

stationary mean of the primary variable is replaced with a known varying mean according to the secondary information [21]. Hence, the KVM predictor is computed as follows:

$$\hat{Z}^{KVM}(\mathbf{u}) = \sum\_{a=1}^{n} \lambda\_a^{KVM}(\mathbf{u}) [Z(\mathbf{u}\_a) - m^{KVM}(\mathbf{u}\_a)] \tag{10}$$

where **<sup>u</sup>** <sup>∈</sup> *<sup>D</sup>* is the unsampled point, while *<sup>λ</sup>KVM <sup>α</sup>* , *α* = 1, 2, . . . , *n*, are the weights of the predictor. Note that if the secondary variable is categorical, the primary local mean is the mean of the primary variable within a specific category of the secondary variable.

In the proposed case study, the above mentioned prediction techniques will be performed and the main prediction results will be discussed.

#### **4. Case study: the survey area and the data set**

6 Current Air Quality Issues

defined in (5):

spatial variability.

between two support points (**u** and **u**

**Γ**[**Z**(**u**), **Z**(**u**

*i* = *j*, and the direct variogram of the *Zi*(**u**) , when *i* = *j*.

sampled points in the domain *D* and **Λ***OC*

*E*[**Z**(**u**)] = **M** = [*m*1, *m*2,..., *mp*]

), i.e.:

)] = *E* 

**<sup>Z</sup>***OC*(**u**) =

**Γ**(**h**) =

determined by ensuring unbiasedness and efficiency conditions.

in the applications. In particular, the *LMC* can be written as

and the (*p* × *p*) variogram matrix **Γ** exists and depends on the spatial separation vector **h**

[**Z**(**u**) − **Z**(**u**

where *γij*(**h**), *i*, *j* = 1, 2, . . . , *p*, is the cross-variogram between *Zi*(**u**) and *Zj*(**u** + **h**), when

In multivariate Geostatistics, *Ordinary Cokriging* (OCK) [34] is a well-known technique used to obtain prediction of one or all the variables under study, at unsampled points of the spatial domain. This technique considers the following linear predictor of the spatial random vector

> *n* ∑ *α*=1

**Λ***OC*

where **u** ∈ *D* is an unsampled point in the domain *D*, **u***<sup>α</sup>* ∈ *D*, *α* = 1, 2, . . . , *n*, are the

The cokriging predictor requires a model for the variogram matrix (7). The *Linear Model of Coregionalization* (*LMC*), widely discussed in [23] and [34], is the most commonly model used

> *L* ∑ *l*=1

where **B***l*, *l* = 1, 2, . . . , *L*, are (*p* × *p*) positive definite matrices, called *coregionalization matrices*, while *gl*(**h**), *l* = 1, 2, . . . , *L*, are basic variograms which correspond to different scales of

A particular case of cokriging, which allows non-parametric, non-linear predictions [22], is the *Cokriging with Indicator Variable* (ICK), where the values observed for one and/or all the variables under study, are replaced by indicator values (that is 1 or 0, depending on a condition specified by the analyst which is of interest for the phenomenon under study). Another very useful prediction technique, developed in the context of multivariate Geostatistics, is the *Kriging with Varying Means* (KVM), which corresponds to an extension of the *simple kriging* used in the univariate case. In the context of two or more variables, the

)][**Z**(**u**) − **Z**(**u**

)]*T* =

*<sup>α</sup>* (**u**)**Z**(**u***α*), (8)

**B***<sup>l</sup> gl*(**h**), (9)

*<sup>α</sup>* (**u**) are the (*p* × *p*) matrices of weights, which are

= **Γ**(**h**)=[*γij*(**h**)], (7)

*<sup>T</sup>*, **<sup>u</sup>** <sup>∈</sup> *<sup>D</sup>*, (6)

The study area is located in the Southern part of Italy (Lecce district in the Apulian region), at 199 meters above the sea level (Fig. 2).

**Figure 2.** Location map of survey stations in Lecce district.

It is characterized by a limestone substrate and the presence of sink-holes, caves and underground drainages. Moreover, the area is characterized by 6 different lithotypes, with predominance of sandstone (45.7%), limestone and dolomitic limestone (29.4%), as well as siltstone and sandstone (23.9%). Because of this particular lithology, a medium-high level of underground permeability marks the area out. Fig. 3 illustrates the maps of geo-lithological features and permeability levels.

The available geo-lithological information about the survey area represents the *soft* information which can better explain the outdoor Rn concentrations and the corresponding exhalation rate from soil. In order to use the soft information in the modeling and prediction stages of the geostatistical analysis, a regular georeferenced grid covering the area under study (Fig. 4-a)) has been defined. The node separation distance (along the horizontal and vertical directions) has been set equal to 2 km. Thus, the soft data

**Figure 3.** Maps of geo-lithological features in Lecce district. a) Map of geo-lithological features. b) Map of underground permeability levels. c) Map of tectonics features. d) Map of karst forms.

have been stored in a GIS project by using Arcmap of the ArcGIS software and the shapefile (data sources: http://www.sit.puglia.it) concerning the geo-lithological features, the underground permeability level, the tectonic (such as distance from tectonic elements) and karst features (such as distance from karst forms). In particular, by using the *spatial join* tool (a geoprocessing tool of Arcmap), all the soft data have been assigned to each grid point (for example, for each grid point it is possible to identify the type of geo-lithology composition, the distance grid point-faults, the distance grid point-karst forms).

On the other hand, the available Rn concentrations in soil gas have been considered as *hard* information. In particular, the in soil gas Rn levels have been measured at 32 locations over

Lecce district (Fig. 2), during a campaign conducted in June-July 2012 on the basis of a suitable spatial sampling.

**Figure 4.** a) Regular grid of cell size 2 × 2 km and grid points with node separation distance equal to 2 km. b) Survey stations for Rn concentrations in soil gas, together with the sampling template.

For each selected site, the Rn concentrations have been detected by using an active sampling-based system, in which the Rn and its decay products are conveyed in proximity of the radiation detector through a mechanical pumping. A special instrument (AlphaGUARD radon monitor by Saphymo GmbH), equipped with an ionization chamber has been used. The chamber has been connected to a probe, introduced into the ground up to a maximum of 1 meter deep, and to a pump for the aspiration of Rn gas at fixed intervals of time. In particular, at each sample location, the measurements of the Rn levels in soil gas have been performed, every minute for 20 minutes, by aspirating the air through the probe introduced into the ground at 5 points according to a regular template (sampling template in Fig. 4). The template consists of a square with a side of 4 meters; thus the Rn measurements have been made both in the center and the corresponding vertices. Hence, both the center and the vertices of the square have been numbered, starting from the top of the North-West and proceeding clockwise, as considered in the sampling protocol.

Finally, the average of the observed Rn concentrations for each site has been assigned and georeferenced at the central point of the sampling template (number 5 in Fig. 4).

All the 32 sample values (hard data) have been georeferenced and stored in a GIS project; then they have been integrated with the soft information.

#### **4.1. Geostatistical analysis: main results**

8 Current Air Quality Issues

**Figure 3.** Maps of geo-lithological features in Lecce district. a) Map of geo-lithological features. b) Map of underground

have been stored in a GIS project by using Arcmap of the ArcGIS software and the shapefile (data sources: http://www.sit.puglia.it) concerning the geo-lithological features, the underground permeability level, the tectonic (such as distance from tectonic elements) and karst features (such as distance from karst forms). In particular, by using the *spatial join* tool (a geoprocessing tool of Arcmap), all the soft data have been assigned to each grid point (for example, for each grid point it is possible to identify the type of geo-lithology

On the other hand, the available Rn concentrations in soil gas have been considered as *hard* information. In particular, the in soil gas Rn levels have been measured at 32 locations over

composition, the distance grid point-faults, the distance grid point-karst forms).

permeability levels. c) Map of tectonics features. d) Map of karst forms.

In the following the spatial concentrations of the Rn in soil gas have been analyzed and predicted at unsampled points of the study area by applying different techniques, such as OK, LK, ICK and KVM.

Note that, the above mentioned multivariate interpolation methods utilize both image-derived (soft) data which are more exhaustive and higher accurate than hard data, typically sparse over the domain. Using ICK and KVM is justified when the measurements of environmental variables can be characterized by a high coefficient of variation and/or by different level of variability; indeed in such cases traditional prediction methods, such as OK and LK, are not appropriate because they do not account for these features.

#### *Ordinary kriging predictions*

In order to predict Rn concentrations in soil gas by using OK method, the structural analysis has been performed on the measured values. After computing the sample variogram, the following Gaussian variogram model has been fitted:

$$\gamma(\mathbf{h})^{OK} = 80 + 263 \left[ 1 - \exp\left( -3 ||\mathbf{h}|| / 20 \right)^2 \right] \tag{11}$$

where nugget and sill correspond to 80 and 263, respectively, while range is equal to 20 km. Fig. 5-b) shows the estimated Rn concentrations in soil gas obtained by using OK.

**Figure 5.** a) Sample variogram and fitted model for Rn concentrations. b) Contour map of the estimated Rn concentrations using Ordinary kriging.

In particular, the prediction map in Fig. 5-b) highlights three Rn prone area, characterized by high Rn concentration values, in the North-Western, South-West and South-South-West of the domain of interest.

#### *Log-Normal kriging predictions*

10 Current Air Quality Issues

*Ordinary kriging predictions*

Note that, the above mentioned multivariate interpolation methods utilize both image-derived (soft) data which are more exhaustive and higher accurate than hard data, typically sparse over the domain. Using ICK and KVM is justified when the measurements of environmental variables can be characterized by a high coefficient of variation and/or by different level of variability; indeed in such cases traditional prediction methods, such as OK

In order to predict Rn concentrations in soil gas by using OK method, the structural analysis has been performed on the measured values. After computing the sample variogram, the

where nugget and sill correspond to 80 and 263, respectively, while range is equal to 20 km.

**Figure 5.** a) Sample variogram and fitted model for Rn concentrations. b) Contour map of the estimated Rn

In particular, the prediction map in Fig. 5-b) highlights three Rn prone area, characterized by high Rn concentration values, in the North-Western, South-West and South-South-West

Fig. 5-b) shows the estimated Rn concentrations in soil gas obtained by using OK.

1 − exp (−3||**h**||/20)

2 

(11)

and LK, are not appropriate because they do not account for these features.

*OK* = 80 + 263

a) b)

concentrations using Ordinary kriging.

of the domain of interest.

following Gaussian variogram model has been fitted:

*γ*(**h**)

In this case, the Rn measured values have been first processed by applying the logarithmic transformation. Hence, the structural analysis has been performed on the log-transformed data. In Fig. 6-a) the sample variogram computed for the log-transformed data is illustrated together with the fitted model. In particular, the variogram model considered for the LK prediction technique is given below:

$$\gamma(\mathbf{h})^{LK} = 0.1 + 1.2 \left[ 1 - \exp\left( -3||\mathbf{h}||/20 \right)^2 \right] \tag{12}$$

where nugget and sill correspond to 0.1 and 1.2, respectively, while range is equal to 20 km.

Fig. 6-b) shows the predictions for Rn concentrations obtaining by using LK, after a back-transformation on the data has been applied.

**Figure 6.** a) Sample variogram and fitted model for Log-Normal Rn concentrations. b) Contour map of estimated Rn concentrations using LK, after a back-transformation on the data.

Despite the previous case (Fig. 5-b)), the prediction map illustrated in Fig. 6-b) highlights only one area with high Rn concentrations, located in the North-Western part of the domain of interest. From the prediction LK results, it is evident the significant *smoothing effect* produced by the interpolation process based on LK.

#### *Cokriging predictions with indicator variable*

This method has been applied by using a double source of information, i.e. the sampled Rn concentrations (primary variable) and the underground permeability levels (secondary or auxiliary variable). Evidently the auxiliary information is richer over the study area than the primary variable, hence it has been used to improve predictions of the Rn concentrations in soil gas. In particular, the auxiliary data, which represent the three permeability classes for any node of the grid, have been derived by the software Arcmap of ArcGIS and the shapefile concerning the underground permeability level in Lecce district. A value has been assigned to each point of the grid and to each survey station, according to the different permeability levels (the value 1 for low permeability, the value 2 for medium permeability and the value 3 for high level of permeability).

In this case, the following *LMC* [33, 34] has been fitted to the sample variogram matrix:

$$\mathbf{T} = \mathbf{B}\_1 \,\, \mathbf{g}\_1(\mathbf{h}) + \mathbf{B}\_2 \,\, \mathbf{g}\_2(\mathbf{h})\tag{13}$$

where *g*<sup>1</sup> is a nugget model and *g*<sup>2</sup> is a gaussian model [23] with unit sill and range equal to 20 km; while **B***l*, *l* = 1, 2, are the following coregionalization matrices:

$$\mathbf{B}\_1 = \begin{bmatrix} 80 & 2\\ 2 & 0.125 \end{bmatrix}, \qquad \mathbf{B}\_2 = \begin{bmatrix} 263 & 4.8\\ 4.8 & 0.088 \end{bmatrix}. \tag{14}$$

The sample direct and cross variograms, shown in Fig. 7-a), have been modelled as follows:

$$\gamma\_{11}(\mathbf{h}) = 80 \mathbf{g}\_1(\mathbf{h}) + 26 \mathbf{3} \mathbf{g}\_2(\mathbf{h}),$$

$$\gamma\_{22}(\mathbf{h}) = 0.125 \mathcal{g}\_1(\mathbf{h}) + 0.088 \mathcal{g}\_2(\mathbf{h})\_{\prime\prime}$$

$$\gamma\_{12}(\mathbf{h}) = 2\mathbf{g}\_1(\mathbf{h}) + 4.8\mathbf{g}\_2(\mathbf{h}) .$$

Finally, cokriging predictions of Rn concentrations in soil gas have been obtained as shown in Fig. 7-b).

In particular, the prediction map shown in Fig. 7-b) highlights three Rn prone areas, characterized by high Rn concentration values, in the North-Western, South-West and South-South-West of the domain of interest.

#### *Kriging predictions with Varying Means*

In order to apply this interpolation technique to the available data set, the Rn concentrations have been grouped by the three permeability classes, then the hypothesis of the presence of different mean levels of Rn concentrations in soil gas has been assumed.

After the structural analysis, the following variogram model (Fig. 8-a)) has been considered:

$$\gamma(\mathbf{h})^{KVM} = 170 \left[ 1 - \exp\left(-3||\mathbf{h}||/15\right) \right] \tag{15}$$

where sill is equal to 170 and range equals to 15 km.

12 Current Air Quality Issues

in Fig. 7-b).

3 for high level of permeability).

primary variable, hence it has been used to improve predictions of the Rn concentrations in soil gas. In particular, the auxiliary data, which represent the three permeability classes for any node of the grid, have been derived by the software Arcmap of ArcGIS and the shapefile concerning the underground permeability level in Lecce district. A value has been assigned to each point of the grid and to each survey station, according to the different permeability levels (the value 1 for low permeability, the value 2 for medium permeability and the value

In this case, the following *LMC* [33, 34] has been fitted to the sample variogram matrix:

where *g*<sup>1</sup> is a nugget model and *g*<sup>2</sup> is a gaussian model [23] with unit sill and range equal to

, **B**<sup>2</sup> =

The sample direct and cross variograms, shown in Fig. 7-a), have been modelled as follows:

*γ*11(**h**) = 80*g*1(**h**) + 263*g*2(**h**),

*γ*22(**h**) = 0.125*g*1(**h**) + 0.088*g*2(**h**),

*γ*12(**h**) = 2*g*1(**h**) + 4.8*g*2(**h**). Finally, cokriging predictions of Rn concentrations in soil gas have been obtained as shown

In particular, the prediction map shown in Fig. 7-b) highlights three Rn prone areas, characterized by high Rn concentration values, in the North-Western, South-West and

In order to apply this interpolation technique to the available data set, the Rn concentrations have been grouped by the three permeability classes, then the hypothesis of the presence of

After the structural analysis, the following variogram model (Fig. 8-a)) has been considered:

*<sup>γ</sup>*(**h**)*KVM* <sup>=</sup> <sup>170</sup> [<sup>1</sup> <sup>−</sup> exp (−3||**h**||/15)] (15)

different mean levels of Rn concentrations in soil gas has been assumed.

20 km; while **B***l*, *l* = 1, 2, are the following coregionalization matrices:

80 2 2 0.125

**B**<sup>1</sup> = 

South-South-West of the domain of interest.

where sill is equal to 170 and range equals to 15 km.

*Kriging predictions with Varying Means*

**Γ** = **B**<sup>1</sup> *g*1(**h**) + **B**<sup>2</sup> *g*2(**h**) (13)

263 4.8 4.8 0.088 . (14)

**Figure 7.** a) Sample direct and cross variograms and fitted model for Rn and indicator data. b) Contour map of estimated Rn concentrations using cokriging with indicator variable.

**Figure 8.** a) Sample variogram and fitted model for residual hard data. b) Contour map of estimated Rn concentrations using KVM.

Hence, predictions for Rn concentrations in soil gas have been obtained for the study area (Fig. 8-b)) by using KVM.

In particular, the prediction map in Fig. 8-b) confirms that high Rn concentration values are located in the North-Western, South-West and South-South-West of the domain of interest.

#### **4.2. A comparison among the prediction results**

All the four contour maps illustrated in Figg. 5-b)-8-b) reproduce almost a similar behaviour, with high levels of Rn concentrations located in the Western part of the domain of interest. However, the predictions maps obtained from the KVM and ICK techniques, exploit the auxiliary information and identify the Rn prone areas more accurately than the ones computed by using the OK and LK techniques.

Moreover, the four interpolation techniques have highlighted the existence of a direction (135*<sup>o</sup>* from North to South, close to the Western coast of the study area), along with the Rn concentrations in soil gas present high levels.

The accuracy of the predictions has been measured through the *Root Mean Square Error* (RMSE) index [35, 38], defined as follows:

$$RMSE = \sqrt{\frac{1}{n} \sum\_{\alpha=1}^{n} [z(\mathbf{u}\_{\alpha}) - \hat{z}(\mathbf{u}\_{\alpha})]^2} \,\tag{16}$$

where *<sup>z</sup>*(**u***α*) represent the measured values, *<sup>z</sup>*(**u***α*) correspond to the estimated values and *<sup>n</sup>* is the number of sample data.

By analyzing the RMSE shown in Fig. 9 it is evident that the accuracy of predictions is lower for OK and LK than the other two interpolation techniques.

**Figure 9.** RMSE computed for the OK, LK, KVM and ICK predictions.

14 Current Air Quality Issues

using KVM.

(Fig. 8-b)) by using KVM.

a) b)

**4.2. A comparison among the prediction results**

computed by using the OK and LK techniques.

concentrations in soil gas present high levels.

*RMSE* =

 1 *n*

(RMSE) index [35, 38], defined as follows:

**Figure 8.** a) Sample variogram and fitted model for residual hard data. b) Contour map of estimated Rn concentrations

Hence, predictions for Rn concentrations in soil gas have been obtained for the study area

In particular, the prediction map in Fig. 8-b) confirms that high Rn concentration values are located in the North-Western, South-West and South-South-West of the domain of interest.

All the four contour maps illustrated in Figg. 5-b)-8-b) reproduce almost a similar behaviour, with high levels of Rn concentrations located in the Western part of the domain of interest. However, the predictions maps obtained from the KVM and ICK techniques, exploit the auxiliary information and identify the Rn prone areas more accurately than the ones

Moreover, the four interpolation techniques have highlighted the existence of a direction (135*<sup>o</sup>* from North to South, close to the Western coast of the study area), along with the Rn

The accuracy of the predictions has been measured through the *Root Mean Square Error*

*n* ∑ *α*=1

[*z*(**u***α*) <sup>−</sup> *<sup>z</sup>*(**u***α*)]2, (16)

However, if the comparison is expressed in terms of the *relative percentage changes* of RMSE index:


The better performance of the multivariate prediction techniques, compared with the univariate ones (OK and LK), is due to the appropriate use of the information given by the auxiliary variables (soft data). Moreover, among the univariate prediction techniques, the LK has shown a better performance since the variable under study has a skewed distribution.
