2.10 Process model

The results of eight separate ordinary least squares linear regression models of salinity make up the rows Table 1. The first five consist of an intercept and a single explanatory variable: closest\_inlet\_distit, categoryt , inverse\_days\_surveyt , num\_stormst, and montht. The sixth and seventh contain an intercept plus, respectively, the sets of four short and long-term freshwater influx indices 1wkFWII\_rit f g ,r ¼ 1, … , 4 , and 2moFWII\_rit f g ,r ¼ 1, … , 4 . We treated the short and long-term sets of indices as groups assuming that if an index evaluated for one river is meaningful, then it is also meaningful for other rivers. We discuss the eighth row in Section 4.2.

Adjusted R2 is a modification of R<sup>2</sup> that penalizes the number of explanatory variables. While R2 increases as more variables are added to a model, adjusted R2 increases only if the added variable decreases the error sum of squares enough to offset the loss in error degrees of freedom.

The model with the long-term freshwater influx indices had the largest adjusted R2 at 0.38, followed by the model with the distance from the nearest inlet (0.34), and the model with the short-term FWI indices (0.27). None of the other four models explained more than 5% of the variability in salinity. We chose the model with the long-term freshwater influx indices as the base upon which to build the mean function.

To this base model we added the variable closest\_inlet\_distit since the model containing this variable had the second-best performance, thus beginning a forward-selection process. Each time we added a variable or set of variables to the model, we kept it in the model if the new adjusted R<sup>2</sup> exceeded the old. Variables


model was not full rank (not all columns in the design matrix were linearly independent). Because we created this second model to evaluate these interactions, we removed the 1wkFWII\_rit f g ,r ¼ 1, … , 4 , the most recent variable addition, to include them. This new model, including the interactions, became the base since its adjusted R2 (0.89) was larger than that of the previous mean trend time model (0.78). After investigating spatial coordinate variables, the final mean trend time

it. To avoid confusion later, note that the adjusted R2 of 0.73 for the

process model and 0.91 for the time model were based on fitting each model to the

The variable selection analyses above used ordinary least squares (OLS) regression to model salinity as a function of explanatory variables. That model can

salit ¼ β<sup>0</sup> þ β1x1it þ β2x2it þ ⋯βPxPit þ εit, t ¼ 1, … , 40, i ¼ 1, … nt (2)

itβ þ ειτ, , where xit is the ð Þ� P þ 1 1 vector containing the values of

where xpit represents the value of the pth explanatory variable at space–time location it, for p ¼ 1, … P, where P is the total number of explanatory variables. β0, β1, … , β<sup>P</sup> represent the intercept and regression coefficients, and deviations from the mean trend εit are assumed to be independent and identically distributed <sup>ε</sup>it � <sup>N</sup> 0, <sup>σ</sup><sup>2</sup> ð Þ with mean 0 and variance <sup>σ</sup>2. The model can be equivalently written

the explanatory variables at space-time location it, and β represents the

ð Þ� P þ 1 1vector of regression coefficients. The same model written in matrix

<sup>Y</sup> <sup>¼</sup> <sup>X</sup><sup>β</sup> <sup>þ</sup> <sup>ε</sup>, <sup>ε</sup> � <sup>N</sup> <sup>0</sup>, <sup>σ</sup><sup>2</sup>

Rarely, however, does the assumption of independent and identically distributed

errors hold for observations of natural phenomena associated with locations in space and time. While it is intuitive that values of salinity located close together in space should be similar, it is also generally the case that the deviations from the mean function of observations located close together are similar. That similarity is referred to as spatial covariance, and the spatial covariance between deviations from the mean trend at two locations within the same time period can be modeled as a function of the distance separating them. Including in the overall model both a deterministic mean function and a spatial covariance function allowed predictions

where bold print indicates vectors so that Y, ε, and 0 are N � 1 vectors containing, respectively, all observations of salinity in the space–time domain, all deviations from the mean function, and all zeros. X is the N � ð Þ P þ 1 design matrix whose rows represent space-time locations and whose columns contain the values of the explanatory variables (with a column of ones for the intercept), and I is the N � N identity matrix. Since a histogram of salinity observations is somewhat

symmetric and bell-shaped, use of the normal distribution is justified.

of salinity at locations where there were no observations.

it,northingit,

) based on a cross-

I , (3)

model (below) had an adjusted R2 of 0.91 and included 204 variables: timeper\_τ<sup>t</sup> f g , τ ¼ 1, … , 39 , 2moFWII\_rit f g ,r ¼ 1, … , 4 , closest\_inlet\_distit, timeper\_τ<sup>t</sup> <sup>∗</sup> <sup>2</sup>moFWII\_rit f g ;<sup>τ</sup> <sup>¼</sup> 1, … , 39;<sup>r</sup> <sup>¼</sup> 1, … , 4 , eastingit, easting<sup>2</sup>

full dataset. In the next section, we report R2 (not adjusted R<sup>2</sup>

Process-Based Statistical Models Predict Dynamic Estuarine Salinity

DOI: http://dx.doi.org/10.5772/intechopen.89911

2.12 Modeling spatially correlated error

and northing<sup>2</sup>

be written as

as salit <sup>¼</sup> <sup>x</sup><sup>T</sup>

form is

169

validation dataset.

#### Table 1.

Adjusted R<sup>2</sup> for the eight initial linear regression models. All regressions include an intercept plus the variables listed.

from the seven initial models were then added in order of decreasing adjusted R<sup>2</sup> . Following this procedure, the mean trend model grew to contain 10 variables — 2moFWII\_rit f g ,r ¼ 1, … , 4 , closest\_inlet\_distit, 1wkFWII\_rit f g ,r ¼ 1, … , 4 , and inverse\_days\_surveyt —with adjusted R<sup>2</sup> 0.57.

Because the effect of FWI from one river on a given location in PS could change based on the FWI from another river during the same time period, we evaluated the addition of the 6 pair-wise interactions among the four 1wkFWII\_rit, the 6 pair-wise interactions among the four 2moFWII\_rit, and the twelve interactions between the 1wkFWII\_rit and the 2moFWII\_rit, excluding interactions of one river's 1wkFWII\_rit with its own 2moFWII\_rit. Despite a decrease in error degrees of freedom by 24, adjusted R2 was 0.66, so the set was retained.

Spatial coordinate variables were evaluated last in groups according to their polynomial order, with squared and cubic terms added before interactions. We considered these variables last because we wanted to include them only if they explained additional variability in the response after more interpretable variables were included. We determined that including all variables except northing<sup>2</sup> it ∗ easting<sup>2</sup> it increased the adjusted R<sup>2</sup> . The final process model mean function thus had an adjusted R<sup>2</sup> of 0.73 and included the following: 2moFWII\_rit f g ,r ¼ 1, … , 4 ; closest\_inlet\_distit; 1wkFWII\_rit f g ,r ¼ 1, … , 4 ; <sup>1</sup>wkFWII\_rit <sup>∗</sup> <sup>1</sup>wkFWII\_qit,<sup>r</sup> 6¼ <sup>q</sup> ; 2moFWII\_rit <sup>∗</sup> <sup>2</sup>moFWII\_qit,<sup>r</sup> 6¼ <sup>q</sup> ; <sup>1</sup>wkFWII\_rit <sup>∗</sup> <sup>2</sup>moFWII\_qit,<sup>r</sup> 6¼ <sup>q</sup> ; inverse\_days\_surveyt ; eastingit, easting<sup>2</sup> it, northingit, northing<sup>2</sup> it, and interactions northingit <sup>∗</sup> eastingit, northing<sup>2</sup> it ∗ eastingit, and northingit ∗ easting<sup>2</sup> it.

#### 2.11 Time model

To build the time model, we followed the same procedure described above, selecting for the base of the mean function a set of time period indicator variables because a linear regression of salit on these variables had an adjusted R2 of 0.41 (Table 1). (Note that such a model is equivalent to fitting an ANOVA model using the time periods as groups.) Again, we added other sets of explanatory variables in order of decreasing adjusted R<sup>2</sup> . Before evaluating interactions, the mean trend time model had an adjusted R<sup>2</sup> of 0.78 and contained 48 variables: timeper\_τ<sup>t</sup> f g , τ ¼ 1, … , 39 , 2moFWII\_rit f g ,r ¼ 1, … , 4 , closest\_inlet\_distit, and 1wkFWII\_rit f g ,r ¼ 1, … , 4 . When interactions among the timeper\_τ<sup>t</sup> f g , τ ¼ 1, … , 39 and the 2moFWII\_rit f g ,r ¼ 1, … , 4 were added, the

Process-Based Statistical Models Predict Dynamic Estuarine Salinity DOI: http://dx.doi.org/10.5772/intechopen.89911

model was not full rank (not all columns in the design matrix were linearly independent). Because we created this second model to evaluate these interactions, we removed the 1wkFWII\_rit f g ,r ¼ 1, … , 4 , the most recent variable addition, to include them. This new model, including the interactions, became the base since its adjusted R2 (0.89) was larger than that of the previous mean trend time model (0.78). After investigating spatial coordinate variables, the final mean trend time model (below) had an adjusted R2 of 0.91 and included 204 variables: timeper\_τ<sup>t</sup> f g , τ ¼ 1, … , 39 , 2moFWII\_rit f g ,r ¼ 1, … , 4 , closest\_inlet\_distit, timeper\_τ<sup>t</sup> <sup>∗</sup> <sup>2</sup>moFWII\_rit f g ;<sup>τ</sup> <sup>¼</sup> 1, … , 39;<sup>r</sup> <sup>¼</sup> 1, … , 4 , eastingit, easting<sup>2</sup> it,northingit, and northing<sup>2</sup> it. To avoid confusion later, note that the adjusted R2 of 0.73 for the process model and 0.91 for the time model were based on fitting each model to the full dataset. In the next section, we report R2 (not adjusted R<sup>2</sup> ) based on a crossvalidation dataset.
