2. Individual forecasting models and combination methods

In this section, we introduce briefly the symbols, formation, and estimation methods in forecasting models; also, we introduce and discuss the various combining methods.

#### 2.1. Individual forecasting models

#### 2.1.1. The dynamic factor model and the estimation of factors

This subsection handles DFM to get common elements from a large group of variables; then, these common components are used to predict the variables of interest.

Suppose that we have a group of observations, Xt be the N stationary time series variables having observations at times t = 1,…, T, where it is considered that the series have zero mean. Factor model assumes that most of the variation in the dataset can be explained by a small number q ≪ N of factors involved in the vector f <sup>t</sup> . We can express the dynamic factor model representation as follows:

$$X\_t = \chi\_t + \xi\_t = \lambda(L)'\_t + \xi\_t \tag{1}$$

where χ<sup>t</sup> is the common components driven by factor ft and ξ<sup>t</sup> is the idiosyncratic components for each of the variables. ξ<sup>t</sup> is the portion of Xt that cannot be explained by the common components. χ<sup>t</sup> is a function of the q � 1 vectors of λð Þ L <sup>0</sup> f t ; the operator λð Þ¼ L 1þ <sup>λ</sup>1L <sup>þ</sup> … <sup>þ</sup> <sup>λ</sup>sL<sup>s</sup> is a lag polynomial with positive powers on the lag operator L with Lft ¼ ft�1. The static representation of the model can be rewritten in as

$$X\_t = \Lambda^\prime F\_t + \xi\_t \tag{2}$$

where Ft is a vector of r ≥ q static factors that compose of the dynamic factors ft and all lags of the factors. From a set of data, there are three different methods of estimating the factors in Ft. These methods were developed by Stock and Watson [39] hereafter SW [30] and Forni, Hallin, Lippi, and Reichlin [20] hereafter FHLR<sup>1</sup> . In the current chapter, we employ the estimation method developed by FHLR. For more details of the dynamic factor model estimation, see Babikir and Mwambi [2]. Thus, the estimated factors will be used to forecast the variables of interest. The forecasting model is specified and estimated as a linear projection of an h-step ahead transformed variable ytþ<sup>h</sup> into <sup>t</sup>-dated dynamic factors. The forecasting model follows the setup in [3, 21, 41] with the form

$$y\_{t+h} = \beta(L)\dot{f}\_t + \gamma(L)y\_t + u\_{t+h} \tag{3}$$

where bf <sup>t</sup> represents the dynamic factors that estimated using the method by FHLR, while βð Þ L and γð Þ L are the lag polynomials, which are determined by the Schwarz information criterion (SIC). The uh <sup>t</sup>þ<sup>h</sup> is an error term. The coefficient matrix for factors and autoregressive terms are estimated by ordinary least squares (OLS) for each forecasting horizon h. To find

<sup>1</sup> For further technical details on this type of factor models, see [35].

the estimate and forecast of the AR benchmark, we enforce a condition to Eq. (3), where we setβð Þ¼ L 0.

#### 2.1.2. The artificial neural network model

2. Individual forecasting models and combination methods

these common components are used to predict the variables of interest.

components. χ<sup>t</sup> is a function of the q � 1 vectors of λð Þ L <sup>0</sup>

Lft ¼ ft�1. The static representation of the model can be rewritten in as

2.1. Individual forecasting models

80 Advanced Applications for Artificial Neural Networks

representation as follows:

2.1.1. The dynamic factor model and the estimation of factors

number q ≪ N of factors involved in the vector f <sup>t</sup>

Lippi, and Reichlin [20] hereafter FHLR<sup>1</sup>

the setup in [3, 21, 41] with the form

For further technical details on this type of factor models, see [35].

criterion (SIC). The uh

1

casting models; also, we introduce and discuss the various combining methods.

In this section, we introduce briefly the symbols, formation, and estimation methods in fore-

This subsection handles DFM to get common elements from a large group of variables; then,

Suppose that we have a group of observations, Xt be the N stationary time series variables having observations at times t = 1,…, T, where it is considered that the series have zero mean. Factor model assumes that most of the variation in the dataset can be explained by a small

Xt ¼ χ<sup>t</sup> þ ξ<sup>t</sup> ¼ λð Þ L <sup>0</sup>

where χ<sup>t</sup> is the common components driven by factor ft and ξ<sup>t</sup> is the idiosyncratic components for each of the variables. ξ<sup>t</sup> is the portion of Xt that cannot be explained by the common

<sup>λ</sup>1L <sup>þ</sup> … <sup>þ</sup> <sup>λ</sup>sL<sup>s</sup> is a lag polynomial with positive powers on the lag operator L with

where Ft is a vector of r ≥ q static factors that compose of the dynamic factors ft and all lags of the factors. From a set of data, there are three different methods of estimating the factors in Ft. These methods were developed by Stock and Watson [39] hereafter SW [30] and Forni, Hallin,

method developed by FHLR. For more details of the dynamic factor model estimation, see Babikir and Mwambi [2]. Thus, the estimated factors will be used to forecast the variables of interest. The forecasting model is specified and estimated as a linear projection of an h-step ahead transformed variable ytþ<sup>h</sup> into <sup>t</sup>-dated dynamic factors. The forecasting model follows

where bf <sup>t</sup> represents the dynamic factors that estimated using the method by FHLR, while βð Þ L and γð Þ L are the lag polynomials, which are determined by the Schwarz information

terms are estimated by ordinary least squares (OLS) for each forecasting horizon h. To find

Xt ¼ Λ<sup>0</sup>

. We can express the dynamic factor model

f t

Ft þ ξ<sup>t</sup> (2)

. In the current chapter, we employ the estimation

ytþ<sup>h</sup> <sup>¼</sup> <sup>β</sup>ð Þ <sup>L</sup> <sup>b</sup><sup>f</sup> <sup>t</sup> <sup>þ</sup> <sup>γ</sup>ð Þ <sup>L</sup> yt <sup>þ</sup> utþ<sup>h</sup> (3)

<sup>t</sup>þ<sup>h</sup> is an error term. The coefficient matrix for factors and autoregressive

f <sup>t</sup> þ ξ<sup>t</sup> (1)

; the operator λð Þ¼ L 1þ

The ANN is one of the most popular and successful biological-inspired forecasting methods, which emulate the framework of the human brain; thus, ANNs have gradually achieved immense importance in forecasting among other fields. The ANN model is one of the generalized nonlinear nonparametric models (GNLNPMs). Compared to the traditional econometric models, the advantage of ANNs is that they can handle complex, nonlinear relationships without any prior assumptions about the underlying data-generating process (see [28]; Figure 1).

The properties of the ANN model made the method an attractive alternative to traditional forecasting models. Most importantly, ANN models deal with the limitations of traditional forecasting methods, including misspecification, biased outliers, and assumption of linearity [27]. One of the most recognized ANN structures in time series forecasting problems is the multilayer perceptron (MLP). An MLP is basically a feedforward architecture of an input, one or more hidden, and an output layer. The network structure illustrated in this chapter gives forward network connected with linear neuron activation function. Basically, the input nodes are connected forward to all nodes in the hidden layer, and these latent nodes are joined to the single node in the output layer, as shown in Figure 1. The inputs in this model serve as the independent variables in the multiple regression model and are joined to the output node which is similar to the dependent variable—through the latent layer. We follow [33], in describing the network model. Thus, the model can be specified as follows:

$$m\_{k,t} = w\_0 + \sum\_{i=1}^{p} w\_i y\_{t-i} + \sum\_{j=1}^{l} \mathcal{Q}\_j N\_{t-1,j} \tag{4}$$

$$N\_{k,t} = f(n\_{k,t}) \tag{5}$$

$$y\_t = \alpha\_{i,0} + \sum\_{k=1}^{K} \alpha\_{i,k} N\_{k,t} + \sum\_{i=1}^{p} \beta\_i y\_{t-i} \tag{6}$$

Figure 1. A p � h � 1 structure of a feed forward neutral network.

where inputs yt�<sup>i</sup> represent the lagged values of the variable of interest and the output yt is their forecasts. The w<sup>0</sup> and αi, <sup>0</sup> are the bias, and wi and αi, <sup>k</sup> denote the weights that link the inputs to the latent layer and the latent layer to output, respectively. The ∅<sup>j</sup> and β<sup>i</sup> connect the input to the output via the latent layer. The p-independent variables are connected linearly to form K neurons which then are combined linearly to produce the prediction or output. Eqs. (4)–(6) link inputs yt�<sup>i</sup> to outputs <sup>y</sup> through the hidden layer. The function <sup>f</sup> is a logistic function meaning that Nk,t <sup>¼</sup> f nð Þ¼ k,t <sup>1</sup> 1þe �nk,t : The second summation in Eq. (6) shows that we also have a jump connection or skip-layer network that directly links the inputs yt�<sup>i</sup> to the output yt . The beauty of this ANN structure is that the model combines the true linear model and nonlinear supply-forward neural network. So, if the association between inputs and output is true linear, in this case, the coefficient set β, which is skip layer should be significant, in contrast if the association is a nonlinear in nature the jump connections coefficient β to be insignificant, while the coefficients set w and α be highly significant. Certainly, if the association between input and output is mixed, then we watch for all coefficient sets to be significant. For the best network selection in this chapter, beside the minimum error, we use Bayesian information criterion (BIC), which is usually preferred more than the other three criteria, because it has the ability to penalize the extra parameters more severely; mathematically, BIC is given by the following as described in [31]

$$BIC = N\_{p,h} + N\_{p,h} \ln\left(n\right) + n \ln\left(\frac{S(W)}{n}\right) \tag{7}$$

where Np,h ¼ h pð Þþ þ 2 1 is total number of parameters in the network, n ¼ Ntrain � p is the number of effective observations, Ntrain is the in-sample observation, S Wð Þ is the network misfit function, and W is the space of all weights and biases in the network. The in-sample sum of squared error (SSE) is usually used to determine the function S Wð Þ: Eventually, the optimal model is the model with minimum BIC value.

#### 2.1.3. Factor-augmented artificial neural networks (FAANN)

The FAANN model is a hybrid model of artificial neural network and factor model in order to combine information of factors and lagged values of interested variable to be forecasted for more accurate forecasts in hand. The nonlinear function uses the series, its lag, and factors to formulate the FAANN model that defines as follows:

$$y\_t = f\left[\left(y\_{t-1}, y\_{t-2}, \dots, y\_{t-p}\right), \left(F\_1, F\_2, F\_3, F\_4, F\_5\right)\right] \tag{8}$$

where f is the nonlinear functional form determined via ANN. In the first stage, the factor model is used to extract factors from a large related dataset. In the second stage, a neural network model is used to model the nonlinear and linear relationships existing in factors and original data. Thus, based on the model structure depicted on Figure 2,

$$y\_{t+h} = \alpha\_0 + \sum\_{j=1}^{h} \alpha\_j \text{g}\left(\beta\_{0j} + \sum\_{i=1}^{p} \beta\_{ij} y\_{t-i} + \sum\_{i=p+1}^{p+5} \beta\_{ij} F\_{t,i}\right) + \varepsilon\_t \tag{9}$$

Dynamic Factor Model and Artificial Neural Network Models: To Combine Forecasts or Combine Models? http://dx.doi.org/10.5772/intechopen.71536 83

Figure 2. The FAANN model architecture (N(<sup>p</sup> + 5,h,1)).

where inputs yt�<sup>i</sup> represent the lagged values of the variable of interest and the output yt is their forecasts. The w<sup>0</sup> and αi, <sup>0</sup> are the bias, and wi and αi, <sup>k</sup> denote the weights that link the inputs to the latent layer and the latent layer to output, respectively. The ∅<sup>j</sup> and β<sup>i</sup> connect the input to the output via the latent layer. The p-independent variables are connected linearly to form K neurons which then are combined linearly to produce the prediction or output. Eqs. (4)–(6) link inputs yt�<sup>i</sup> to outputs <sup>y</sup> through the hidden layer. The function <sup>f</sup> is a logistic

also have a jump connection or skip-layer network that directly links the inputs yt�<sup>i</sup> to the

and nonlinear supply-forward neural network. So, if the association between inputs and output is true linear, in this case, the coefficient set β, which is skip layer should be significant, in contrast if the association is a nonlinear in nature the jump connections coefficient β to be insignificant, while the coefficients set w and α be highly significant. Certainly, if the association between input and output is mixed, then we watch for all coefficient sets to be significant. For the best network selection in this chapter, beside the minimum error, we use Bayesian information criterion (BIC), which is usually preferred more than the other three criteria, because it has the ability to penalize the extra parameters more severely; mathemati-

BIC <sup>¼</sup> Np, <sup>h</sup> <sup>þ</sup> Np, <sup>h</sup>ln ð Þþ <sup>n</sup> nln S Wð Þ

where Np,h ¼ h pð Þþ þ 2 1 is total number of parameters in the network, n ¼ Ntrain � p is the number of effective observations, Ntrain is the in-sample observation, S Wð Þ is the network misfit function, and W is the space of all weights and biases in the network. The in-sample sum of squared error (SSE) is usually used to determine the function S Wð Þ: Eventually, the

The FAANN model is a hybrid model of artificial neural network and factor model in order to combine information of factors and lagged values of interested variable to be forecasted for more accurate forecasts in hand. The nonlinear function uses the series, its lag, and factors to

where f is the nonlinear functional form determined via ANN. In the first stage, the factor model is used to extract factors from a large related dataset. In the second stage, a neural network model is used to model the nonlinear and linear relationships existing in factors and

p

<sup>β</sup>ijyt�<sup>i</sup> <sup>þ</sup> <sup>X</sup>

i¼1

h i

. The beauty of this ANN structure is that the model combines the true linear model

�nk,t : The second summation in Eq. (6) shows that we

n � �

;ð Þ F1; F2; F3; F4; F<sup>5</sup>

pþ5

βijFt,i

1

A þ ε<sup>t</sup> (9)

i¼pþ1

(7)

(8)

1þe

function meaning that Nk,t <sup>¼</sup> f nð Þ¼ k,t <sup>1</sup>

82 Advanced Applications for Artificial Neural Networks

cally, BIC is given by the following as described in [31]

optimal model is the model with minimum BIC value.

2.1.3. Factor-augmented artificial neural networks (FAANN)

formulate the FAANN model that defines as follows:

ytþ<sup>h</sup> <sup>¼</sup> <sup>α</sup><sup>0</sup> <sup>þ</sup><sup>X</sup>

yt <sup>¼</sup> f yt�<sup>1</sup>; yt�<sup>2</sup>; …; yt�<sup>p</sup> � �

original data. Thus, based on the model structure depicted on Figure 2,

<sup>α</sup>jg <sup>β</sup>0<sup>j</sup> <sup>þ</sup><sup>X</sup>

0 @

h

j¼1

output yt

As previously noted, the α<sup>j</sup> (j ¼ 0, 1, …, hÞ and βijð Þ i ¼ 0; 1;…; p; j ¼ 1; 2;…; h are the parameters of the model that called the connection weights. As we have stated earlier, p and h are the numbers of input and hidden nodes, respectively, and ε<sup>t</sup> is the error term. Figure 2 shows the FAANN model structure used.

#### 2.2. Forecast combining methods

To combine individual forecasts composed by the DFM and ANN models, we used four combination methods. The combining methods involve three linear combining methods (the mean, VACO, and discount MSFE-based methods) and one nonlinear combining method (ANN). Just as some of the combining methods need a holdout period to calculate the weights used to combine individual forecasts, we use the first 24 months of the out of sample as holdout observations. For all combining methods, we form combination forecasts over the post holdout out-of-sample period. Brief details about the above combining methods are given below.

#### 2.2.1. Mean combination method

The mean serves as a convenient criterion as has been shown to achieve better results compared to other fancy methods. For instance, see [10, 21, 32]. Compared to single forecasts, the performance of the simple average combination method is found to be superior (see [18]). The simple average combination method can be expressed as

$$
\widehat{y}\_t^c = \sum\_{i=1}^m w\_i \widehat{y}\_t^i \tag{10}
$$

where <sup>b</sup>y<sup>c</sup> <sup>t</sup> is the combined forecast at time <sup>t</sup>, <sup>b</sup>y<sup>i</sup> <sup>t</sup> is the forecast from ith individual forecasting model, wi <sup>¼</sup> <sup>1</sup> <sup>m</sup> is the individual forecast weight for model i, and m is the number of individual models. There are different forms of weights, but generally the weights have to satisfy the condition <sup>P</sup><sup>m</sup> i¼1 wi ¼ 1.

#### 2.2.2. Variance-covariance (VACO) combination method

The method uses the historical achievement of the individual forecasts to compute the weights. Thus, according to the VACO method, the weights determined as follows:

$$w\_i = \frac{\left[\sum\_{j=1}^{T} \left(y\_j - \widehat{y}\_j^i\right)^2\right]^{-1}}{\sum\_{i=1}^{m} \left[\sum\_{j=1}^{T} \left(y\_j - \widehat{y}\_j^i\right)^2\right]^{-1}} \tag{11}$$

Then, the combined forecast is given by <sup>b</sup>y<sup>c</sup> <sup>t</sup> <sup>¼</sup> <sup>P</sup><sup>m</sup> i¼1 wiby<sup>i</sup> <sup>t</sup> where yj is the jth actual value, <sup>b</sup>y<sup>i</sup> <sup>j</sup> is the jth forecasting value from ith individual forecasting model, and T is the total number of out-ofsample points. The weight in Eq. (11) is based on the inverse sum of squared deviation for model i as the numerator, and the denominator is the sum of these inverse contributions from all models. This guarantees that <sup>P</sup><sup>m</sup> i¼1 wi ¼ 1.

#### 2.2.3. Discounted mean square forecast error (DMSFE) combination method

The DMSFE method weights recent forecasts more heavily than distant ones. [32] suggest that the weights can be calculated as

$$\tau w\_i = \frac{\left[\sum\_{j=1}^T \delta^{T-j+1} \left(y\_j - \hat{y}\_j^i\right)^2\right]^{-1}}{\sum\_{i=1}^m \left[\sum\_{j=1}^T \delta^{T-j+1} \left(y\_j - \hat{y}\_j^i\right)^2\right]^{-1}} \tag{12}$$

where δ is the discount factor with 0 < δ ≤ 1, if δ ¼ 1 and then the DMSFE and VACO methods become one method, which means that the VACO is a special case of the DMSFE. Note that as mentioned above the sum of all weights is equal to one.

#### 2.2.4. Artificial neural network (ANN) combination method

Linearity of combinations of the individual forecasts is the corner stone of linear combination method, but if the individual forecasts are based on nonlinear methods, the combinations are defined to be insufficient or if the true relationship is nonlinear. For the success of the ANN as a combination method over the linear methods, among others, see [15, 25]. Here, we use the same setup used in subsection (2.1.2); the output y bc <sup>t</sup> of combined forecasts can be given by

$$
\widehat{y}\_t^c = \alpha\_{i,0} + \sum\_{k=1}^K \alpha\_{i,k} \mathbf{N}\_{k,t} + \sum\_{i=1}^m \beta\_i \widehat{y}\_t^i \tag{13}
$$

where <sup>b</sup>y<sup>i</sup> <sup>t</sup> is the forecast from ith individual forecasting model.
