2.5.1. Smoothing splines on 1f g ;…;K � ½ � 0; 1

We construct the reproducing kernel Hilbert space by using

$$R\_{\langle 1\rangle 0} = 1/K \text{and} \\ R\_{\langle 1\rangle 1} = I\_{\left[x\_{\langle 1\rangle} = y\_{\langle 1\rangle}\right]}$$

on 1f g ;…;K . On 0½ � ; 1 , assume that if m ¼ 2, then we have

$$R\_{\langle 2\rangle 0} = 1 + k\_1(\mathfrak{x}\_{\langle 2\rangle})k\_1\left(\mathfrak{y}\_{\langle 2\rangle}\right)^2$$

and

For space

70 Topics in Splines and Applications

X ¼ X<sup>1</sup> � X2.

domains.

H<sup>1</sup> ¼ f :

(See details in [11]; Section~2.3).

Hilbert space on product domain Q<sup>d</sup>

Theorem 2.4. Suppose that if Rh i<sup>1</sup> <sup>x</sup>h i<sup>1</sup> ; <sup>y</sup>h i<sup>1</sup>

Hh i<sup>1</sup> and Hh i<sup>2</sup> , denoted by H ¼ Hh i<sup>1</sup> ⊗ Hh i<sup>2</sup> .

to a multidimensional space <sup>H</sup> <sup>¼</sup> <sup>⊗</sup> <sup>d</sup>

nonnegative definite on <sup>X</sup>2, then R xð Þ¼ ; <sup>y</sup> <sup>R</sup>h i<sup>1</sup> <sup>x</sup>h i<sup>1</sup> ; <sup>y</sup>h i<sup>1</sup>

ð1 0 f

Hilbert space is induced by constructing its reproducing kernel.

� �

one can check that the reproducing kernel in H<sup>1</sup> is

ð Þ<sup>ν</sup> ð Þ<sup>x</sup> dx <sup>¼</sup> <sup>0</sup>; <sup>ν</sup> <sup>¼</sup> <sup>0</sup>; <sup>1</sup>;…; <sup>m</sup> � <sup>1</sup>; <sup>f</sup>

SSANOVA models on product domains: A natural way to construct reproducing kernel

on the marginal domains Xjs. According to the Moore-Aronszajn theorem, every nonnegative definite function R corresponds to a reproducing kernel Hilbert space with R as its reproducing kernel. Therefore, the construction of the tensor product reproducing kernel

� �

Theorem 2.4 implies that a reproducing kernel R on tensor product reproducing kernel Hilbert space can be derived from the reproducing kernels on marginal domains. Indeed, let Hh i<sup>j</sup> be the space on X<sup>j</sup> with reproducing kernel Rh i<sup>j</sup> , where j ¼ 1, 2. Then, R ¼ Rh i<sup>1</sup> Rh i<sup>2</sup> is nonnegative definite on X<sup>1</sup> � X2. The space H corresponding to Rð Þ �; � is called the tensor product space of

One can decompose each marginal space Hh i<sup>j</sup> into Hh i<sup>j</sup> ¼ Hh i<sup>j</sup> <sup>0</sup> ⊕ Hh i<sup>j</sup> 1, where Hh i<sup>j</sup> <sup>0</sup> denotes the averaging space and Hh i<sup>j</sup> <sup>1</sup> denotes the orthogonal complement. Then, by the discussion in Section 2.4, the one-way ANOVA decomposition on each marginal space can be generalized

> � � ⊗ ⊗ <sup>j</sup>∉SHh i<sup>j</sup> <sup>0</sup> � � � �

where S denotes all the subsets of 1f g ;…; d . The component f <sup>S</sup> in (8) is in the space HS. Based on the decomposition, the minimizer of (2) is called a tensor product smoothing spline. One can construct a tensor product smoothing spline following Theorem 2.3, in which the reproducing kernel term may be calculated in the same way as the tensor product (11).

In the following, we will give some examples of tensor product smoothing splines on product

<sup>j</sup>¼<sup>1</sup>Hh i<sup>j</sup> as

¼ ⊕ <sup>S</sup> ⊗ <sup>j</sup><sup>∈</sup> SHh i<sup>j</sup> <sup>1</sup>

<sup>j</sup>¼<sup>1</sup> <sup>H</sup>h i<sup>j</sup> <sup>0</sup> <sup>⊕</sup> <sup>H</sup>h i<sup>j</sup> <sup>1</sup> � �

<sup>H</sup> <sup>¼</sup> <sup>⊗</sup> <sup>d</sup>

¼ ⊕ SHS,

<sup>R</sup>1ð Þ¼ <sup>x</sup>; <sup>y</sup> kmð Þ<sup>x</sup> kmð Þþ � <sup>y</sup> ð Þ<sup>1</sup> <sup>m</sup>�<sup>1</sup>

� �

ð Þ <sup>m</sup> <sup>∈</sup>L2½ � <sup>0</sup>; <sup>1</sup>

k2mð Þ x � y ,

<sup>j</sup>¼<sup>1</sup> <sup>X</sup><sup>j</sup> is taking the tensor product of spaces constructed

is nonnegative definite on <sup>X</sup><sup>1</sup> and Rh i<sup>2</sup> <sup>x</sup>h i<sup>2</sup> ; <sup>y</sup>h i<sup>2</sup>

<sup>R</sup>h i<sup>2</sup> <sup>x</sup>h i<sup>2</sup> ; <sup>y</sup>h i<sup>2</sup> � � ,

� �

is nonnegative definite on

is

(11)

$$R\_{\langle 2 \rangle 1} = k\_2(\mathbf{x}\_{\langle 2 \rangle}) k\_2 \left( y\_{\langle 2 \rangle} \right) - k\_4 \left( \mathbf{x}\_{\langle 2 \rangle} - y\_{\langle 2 \rangle} \right).$$

In this case, the space H can be further decomposed as

$$\mathcal{H} = \left(\mathcal{H}\_{(1)0} \oplus \mathcal{H}\_{(1)1}\right) \otimes \left(\mathcal{H}\_{(2)00} \oplus \mathcal{H}\_{(2)01} \oplus \mathcal{H}\_{(2)1}\right). \tag{12}$$

The reproducing kernels of tensor product cubic spline on 1f g ;…;K � ½ � 0; 1 are listed in Table 1.

On other product domains, for example, 0½ � ; <sup>1</sup> <sup>2</sup> , the tensor product reproducing kernel Hilbert space can be decomposed in a similar way. More examples are available in ([11], Section~2.4).

#### 2.5.1.1. General form

In general, a tensor product reproducing kernel Hilbert space can be specified as H ¼ ⊕ <sup>j</sup>Hj, where j∈B is a genetic index. Suppose that H<sup>j</sup> is equipped with a reproducing kernel Rj and an inner product h i f ; g <sup>j</sup> . Denote f <sup>j</sup> as the projection of f onto Hj. Then, an inner product in H can be defined as


Table 1. Reproducing kernels of (12) on 1f g� ; …; K ½ � 0; 1 when m ¼ 2.

$$J(f, \mathbf{g}) = \sum\_{j} \Theta\_{j}^{-1} \left< f\_{j}, \mathbf{g}\_{j} \right>\_{j'} \tag{13}$$

The penalized least squares (16) is a quadratic form of both d and c. By differentiating (16), one

Note that (17) only works for penalized least squares (15), and hence a normal assumption is

In SSANOVA models, properly selecting smoothing parameters is important to estimate η [9, 27, 28], as shown in Figure 1. Here, we introduce the generalized cross validation (GCV)

> X S

θ�<sup>1</sup> <sup>j</sup> f <sup>j</sup> ; f j D E j ,

j¼1

<sup>V</sup>ð Þ¼ <sup>λ</sup> <sup>n</sup>�<sup>1</sup>y<sup>T</sup>ð Þ <sup>I</sup> � <sup>A</sup>ð Þ <sup>λ</sup> <sup>2</sup>

where Að Þ λ is a symmetric matrix similar to the hat matrix in linear regression. We can select a

Tweets in the contiguous United States were collected over five weekdays in January 2014. The dataset contains information of time, GPS location, and tweet counts (see Figure 2). To illustrate the application of SSANOVA models, we study the time and spatial patterns in this data.

time and xh i<sup>2</sup> represents the longitude and latitude coordinates. We use the thin-plate spline for the spatial variable and cubic spline for the time variable. As a rotation-free method, the thinplate spline is popular for modeling spatial data [29–31]. For a better interpretation, we

� � <sup>þ</sup> <sup>η</sup><sup>2</sup> <sup>x</sup>h i<sup>2</sup>

where S is the number of smoothing parameters, which is related to the functional ANOVA decomposition, and θj's are the extra smoothing parameters. To avoid overparameterization,

<sup>T</sup> as the effective smoothing parameters.

<sup>n</sup>�<sup>1</sup> ½ � trð Þ <sup>I</sup> � <sup>A</sup>ð Þ <sup>λ</sup> <sup>2</sup> ,

y

� � is a function of time and location, where <sup>x</sup>h i<sup>1</sup> denotes the

� � <sup>þ</sup> <sup>η</sup><sup>12</sup> <sup>x</sup>h i<sup>1</sup> ; <sup>x</sup>h i<sup>2</sup>

� �,

c � � <sup>¼</sup> ST<sup>y</sup> R<sup>T</sup>y !

Smoothing Spline ANOVA Models and their Applications in Complex and Massive Datasets

: (17)

73

http://dx.doi.org/10.5772/intechopen.75861

STS STR RTS RTR <sup>þ</sup> <sup>n</sup>λ<sup>R</sup> ! d

For the multivariate predictors, the penalty term in (15) has the form

λJ fð Þ¼ λ

can obtain the linear system:

2.6.2. Selection of smoothing parameters

we treat λ ¼ ð Þ λ=θ1; ⋯; λ=θ<sup>S</sup>

2.7. Case study: Twitter data

The bivariate function η xh i<sup>1</sup> ; xh i<sup>2</sup>

decompose the function η as

A GCV score is defined as

method for the smoothing parameter selection.

proper λ by minimizing the GCV score [21].

η xh i<sup>1</sup> ; xh i<sup>2</sup>

� � <sup>¼</sup> <sup>η</sup><sup>c</sup> <sup>þ</sup> <sup>η</sup><sup>1</sup> <sup>x</sup>h i<sup>1</sup>

needed in this case.

where θ<sup>j</sup> ≥ 0 are the tuning parameters. If a penalty J in (2) has the form (13), the SSANOVA models can be defined on the space H ¼ ⊕ <sup>j</sup>H<sup>j</sup> with the reproducing kernel:

$$R = \sum\_{j} \Theta\_{j} \mathbf{R}\_{j}.\tag{14}$$

#### 2.6. Estimation

In this section, we show the procedure of estimating the minimizer <sup>b</sup><sup>η</sup> of (2) under the Gaussian assumption and selecting the smoothing parameters.

#### 2.6.1. Penalized least squares

We consider the same model shown in (1), and then the η can be estimated by minimizing the penalized least squares:

$$\frac{1}{n}\sum\_{i=1}^{n}\left(y\_i - \eta(\mathbf{x}\_i)\right)^2 + \lambda f(\eta). \tag{15}$$

Let S denote the n � M matrix with the ð Þ i; j th entry ξjð Þ xi as in (6) and R denote the n � n matrix with the ð Þ i; j th entry R xi; xj � � with the form (14). Then, based on Theorem 2.3, η can be expressed as

$$
\boldsymbol{\eta} = \mathbf{S} \mathbf{d} + \mathbf{R} \mathbf{c},
$$

where <sup>η</sup> <sup>¼</sup> ð Þ <sup>η</sup>ð Þ <sup>x</sup><sup>1</sup> ;…; <sup>η</sup>ð Þ xn <sup>T</sup>, <sup>d</sup> <sup>¼</sup> ð Þ <sup>d</sup>1; …; dM <sup>T</sup>, and <sup>c</sup> <sup>¼</sup> ð Þ <sup>c</sup>1;…; cn <sup>T</sup>. The least squares term in (15) becomes

$$\frac{1}{n}\sum\_{i=1}^{n}\left(y\_i - \eta(\mathbf{x}\_i)\right)^2 = \frac{1}{n}(\mathbf{y} - \mathbf{Sd} - \mathbf{Rc})^T(\mathbf{y} - \mathbf{Sd} - \mathbf{Rc}).$$

where y ¼ y1;…; yn � �<sup>T</sup> .

By the reproducing property (5), the roughness penalty term can be expressed as

$$J(\eta) = \sum\_{i=1}^{n} \sum\_{j=1}^{n} c\_i \mathbf{R}(\mathbf{x}\_i, \mathbf{x}\_j) c\_j = \mathbf{c}^T \mathbf{R} \mathbf{c}.$$

Therefore, the penalized least squares criterion (15) becomes

$$\frac{1}{m}(\mathbf{y} - \mathbf{S}\mathbf{d} - \mathbf{R}\mathbf{c})^T(\mathbf{y} - \mathbf{S}\mathbf{d} - \mathbf{R}\mathbf{c}) + \lambda\mathbf{c}^T\mathbf{R}\mathbf{c}.\tag{16}$$

The penalized least squares (16) is a quadratic form of both d and c. By differentiating (16), one can obtain the linear system:

$$
\begin{pmatrix} S^T \mathbf{S} & S^T \mathbf{R} \\ R^T \mathbf{S} & R^T \mathbf{R} + n\lambda R \end{pmatrix} \begin{pmatrix} \mathbf{d} \\ \mathbf{c} \end{pmatrix} = \begin{pmatrix} S^T \mathbf{y} \\ R^T \mathbf{y} \end{pmatrix}. \tag{17}
$$

Note that (17) only works for penalized least squares (15), and hence a normal assumption is needed in this case.

#### 2.6.2. Selection of smoothing parameters

J fð Þ¼ ; <sup>g</sup> <sup>X</sup>

models can be defined on the space H ¼ ⊕ <sup>j</sup>H<sup>j</sup> with the reproducing kernel:

1 n Xn i¼1

yi � ηð Þ xi � �<sup>2</sup> <sup>¼</sup> <sup>1</sup>

By the reproducing property (5), the roughness penalty term can be expressed as

Xn j¼1

i¼1

<sup>J</sup>ð Þ¼ <sup>η</sup> <sup>X</sup><sup>n</sup>

<sup>n</sup> ð Þ <sup>y</sup> � <sup>S</sup><sup>d</sup> � <sup>R</sup><sup>c</sup>

Therefore, the penalized least squares criterion (15) becomes

1

assumption and selecting the smoothing parameters.

2.6. Estimation

72 Topics in Splines and Applications

2.6.1. Penalized least squares

penalized least squares:

expressed as

(15) becomes

where y ¼ y1;…; yn

� �<sup>T</sup>

matrix with the ð Þ i; j th entry R xi; xj

where <sup>η</sup> <sup>¼</sup> ð Þ <sup>η</sup>ð Þ <sup>x</sup><sup>1</sup> ;…; <sup>η</sup>ð Þ xn <sup>T</sup>, <sup>d</sup> <sup>¼</sup> ð Þ <sup>d</sup>1; …; dM

1 n Xn i¼1

.

j θ�<sup>1</sup> <sup>j</sup> f <sup>j</sup> ; gj D E

<sup>R</sup> <sup>¼</sup> <sup>X</sup> j

where θ<sup>j</sup> ≥ 0 are the tuning parameters. If a penalty J in (2) has the form (13), the SSANOVA

In this section, we show the procedure of estimating the minimizer <sup>b</sup><sup>η</sup> of (2) under the Gaussian

We consider the same model shown in (1), and then the η can be estimated by minimizing the

Let S denote the n � M matrix with the ð Þ i; j th entry ξjð Þ xi as in (6) and R denote the n � n

η ¼ Sd þ Rc,

<sup>n</sup> ð Þ <sup>y</sup> � <sup>S</sup><sup>d</sup> � <sup>R</sup><sup>c</sup>

ciR xi; xj

� �cj <sup>¼</sup> <sup>c</sup>TRc:

yi � ηð Þ xi

j

, (13)

θjRj: (14)

� �<sup>2</sup> <sup>þ</sup> <sup>λ</sup>Jð Þ <sup>η</sup> : (15)

<sup>T</sup>, and <sup>c</sup> <sup>¼</sup> ð Þ <sup>c</sup>1;…; cn <sup>T</sup>. The least squares term in

<sup>T</sup>ð<sup>y</sup> � <sup>S</sup><sup>d</sup> � <sup>R</sup>cÞ þ <sup>λ</sup>cTRc: (16)

<sup>T</sup>ð Þ <sup>y</sup> � <sup>S</sup><sup>d</sup> � <sup>R</sup><sup>c</sup> ,

� � with the form (14). Then, based on Theorem 2.3, η can be

In SSANOVA models, properly selecting smoothing parameters is important to estimate η [9, 27, 28], as shown in Figure 1. Here, we introduce the generalized cross validation (GCV) method for the smoothing parameter selection.

For the multivariate predictors, the penalty term in (15) has the form

$$\lambda f(f) = \lambda \sum\_{j=1}^{S} \theta\_j^{-1} \left< f\_j, f\_j \right>\_{f'} $$

where S is the number of smoothing parameters, which is related to the functional ANOVA decomposition, and θj's are the extra smoothing parameters. To avoid overparameterization, we treat λ ¼ ð Þ λ=θ1; ⋯; λ=θ<sup>S</sup> <sup>T</sup> as the effective smoothing parameters.

A GCV score is defined as

$$V(\lambda) = \frac{n^{-1}\mathbf{y}^T(I - A(\lambda))^2\mathbf{y}}{\left[n^{-1}\text{tr}(I - A(\lambda))\right]^2},$$

where Að Þ λ is a symmetric matrix similar to the hat matrix in linear regression. We can select a proper λ by minimizing the GCV score [21].

#### 2.7. Case study: Twitter data

Tweets in the contiguous United States were collected over five weekdays in January 2014. The dataset contains information of time, GPS location, and tweet counts (see Figure 2). To illustrate the application of SSANOVA models, we study the time and spatial patterns in this data.

The bivariate function η xh i<sup>1</sup> ; xh i<sup>2</sup> � � is a function of time and location, where <sup>x</sup>h i<sup>1</sup> denotes the time and xh i<sup>2</sup> represents the longitude and latitude coordinates. We use the thin-plate spline for the spatial variable and cubic spline for the time variable. As a rotation-free method, the thinplate spline is popular for modeling spatial data [29–31]. For a better interpretation, we decompose the function η as

$$
\eta(\mathbf{x}\_{\langle 1\rangle}, \mathbf{x}\_{\langle 2\rangle}) = \eta\_c + \eta\_1(\mathbf{x}\_{\langle 1\rangle}) + \eta\_2(\mathbf{x}\_{\langle 2\rangle}) + \eta\_{12}(\mathbf{x}\_{\langle 1\rangle}, \mathbf{x}\_{\langle 2\rangle}),
$$

<sup>π</sup> <sup>¼</sup> <sup>3</sup> � <sup>10</sup>�16, which is so small that the interaction term is negligible. This indicates that there is no significant difference for the Twitter usages across time in the contiguous United States.

Smoothing Spline ANOVA Models and their Applications in Complex and Massive Datasets

http://dx.doi.org/10.5772/intechopen.75861

75

In this section, we consider SSANOVA models under the big data settings. The computational cost of solving (17) is of the order O n<sup>3</sup> � � and thus gives rise to a challenge on the application of SSANOVA models when the volume of data grows. To reduce the computational load, an obvious way is to select a subset of basis functions randomly. However, it is hard to keep the data features by uniform sampling. In the following section, we present an adaptive basis selection method and show its advantages over uniform sampling [14]. Instead of selecting basis functions, another approach to reduce the computational cost is shrinking the original

A natural way to select the basis functions is through uniform sampling. Suppose that we

The computational cost will be reduced significantly to O nð Þ n2 if n is much smaller than n. Furthermore, it can be proven that the minimizer of (2), η, by uniform sampling basis selection,

Although the uniform basis selection reduces the computational cost and the corresponding η achieves the optimal asymptotic convergence rate, it may fail to retain the data features occasionally. For example, when the data are not evenly distributed, it is hard for uniform sampling to capture the feature where there are only a few data points. In [14], an adaptive basis selection method is proposed. The main idea is to sample more basis functions where the response functions change largely and fewer basis functions on those flat regions. More details

Step 2 For each Sk, draw a random sample of size nk from this collection. Let

H<sup>E</sup> ¼ N ⊕ span R xi

has the same asymptotic convergence rate as the full basis minimizer <sup>b</sup>η.

of adaptive basis selection method are shown in the following procedure:

� �<sup>n</sup>

Step 3 Combine x<sup>∗</sup>ð Þ<sup>1</sup> , …, x<sup>∗</sup>ð Þ <sup>K</sup> together to form a set of sampled predictor values x<sup>∗</sup>

� � from f g <sup>x</sup>1; …; xn , where <sup>n</sup> is the subsample size.

; <sup>x</sup> � �; <sup>i</sup> <sup>¼</sup> <sup>1</sup>; <sup>2</sup>;…; <sup>n</sup> � �:

; <sup>x</sup> � �, <sup>i</sup> <sup>¼</sup> <sup>1</sup>,…, <sup>n</sup>. Then, one minimizes (17) in the effective

<sup>i</sup>¼<sup>1</sup> into <sup>K</sup> disjoint intervals, <sup>S</sup>1,…, SK. Denote <sup>∣</sup>Sk<sup>∣</sup> as

<sup>1</sup>;…; x<sup>∗</sup> n∗ � �,

3. Efficient approximation algorithm in massive datasets

sample size by rounding algorithm [15].

randomly select a subset x ¼ x1;…; xn

Step 1 Divide the range of responses yi

∗ð Þk nk

<sup>k</sup>¼<sup>1</sup> nk.

� � be the predictor values.

the number of observations in Sk.

Thus, the kernel matrix would be R xi

3.1. Adaptive basis selection

model space:

<sup>x</sup><sup>∗</sup>ð Þ<sup>k</sup> <sup>¼</sup> <sup>x</sup>

where <sup>n</sup><sup>∗</sup> <sup>¼</sup> <sup>P</sup><sup>K</sup>

∗ð Þk <sup>1</sup> ;…; x

Figure 2. Heatmaps of tweet counts in the contiguous United States. (a) Tweet counts at 2:00 a.m. (b) Tweet counts at 6:00 p.m.

where η<sup>c</sup> is a constant function; η<sup>1</sup> and η<sup>2</sup> are the main effects of time and location, respectively; and η<sup>12</sup> is the spatial-time interaction effect.

The main effects of time and location are shown in Figure 3. Obviously, in panel (a), the number of tweets has the periodic effect, where it attains the maximum value at 8:00 p.m. and the minimum value at 5:00 a.m. The main effect of time shows the variations of Twitter usages in the United States. In addition, we can infer how the tweet counts vary across different locations based on panel (b) in Figure 3. There tend to be more tweets in the east than those in the west regions and more tweets in the coastal zone than those in the inland. We use the scaled dot product

$$\boldsymbol{\pi} = \left(\widehat{\boldsymbol{\eta}}\_{12}\right)^{T} \widehat{\mathbf{y}} / \|\widehat{\mathbf{y}}\|^{2}$$

to quantify the percentage decomposition of the sum of squares of <sup>b</sup><sup>y</sup> [11], where <sup>b</sup><sup>y</sup> <sup>¼</sup> <sup>b</sup>y1;…; <sup>b</sup>yn � �<sup>T</sup> is the predicted values of log ð Þ #Tweets , and <sup>b</sup>η<sup>12</sup> <sup>¼</sup> <sup>η</sup>12ð Þ <sup>x</sup><sup>1</sup> ;…; <sup>η</sup>12ð Þ xn � �<sup>T</sup> is the estimated interaction effect term, where η12ð Þ¼ x η<sup>12</sup> xh i<sup>1</sup> ; xh i<sup>2</sup> � �. In our fitted model,

Figure 3. (a) The main effect function of time (hours). (b) The main effect function of location.

<sup>π</sup> <sup>¼</sup> <sup>3</sup> � <sup>10</sup>�16, which is so small that the interaction term is negligible. This indicates that there is no significant difference for the Twitter usages across time in the contiguous United States.
