*2.3.1 ANN learning process*

The learning process of an ANN consists of modeling past observations with the objective of estimating the underlying temporal relationships [25]. Any artificial neural network learns by a procedure called the backpropagation procedure. In an artificial neural network the backpropagation procedure make the network learn and update its parameters after each training epoch. In detail, it evaluates the gradient of an error function by backpropagating the errors backwards through the network. The resulting derivatives are than used to compute new values for the neural networks weights. These adjustments can lead to significant improvements in optimization of the objective error function, and aim to minimize the error function [5, 30, 33].

In deep learning solutions, when a model converges to a local minimum, that result is accepted, since the loss function is approximately minimized [3]. This characteristic makes any artificial neural network as an approximator to any objective.

Most deep learning optimization algorithms are based on the stochastic gradient descent algorithm. SGD is an optimization algorithm that aims to maximize or minimize an objective function, in this research an error function, also called a loss function [3].

When the algorithm operates on a training set of examples, it usually follows the estimated gradient downhill towards a local or global minimum, that optimizes any objective function [3, 30].

In the predictions produced by a neural network there is always an element of randomness. Therefore the network is trained multiple times where each training cycle is called an epoch. An epoch can be defined as a pass through the training set, where a pass includes both a forward and a backward pass through the neural network. The number of epochs denotes how many passes, forward and backward, were required for the best training of the model. During one epoch the neural network with all the training data is trained for one cycle [22, 34]. After a fixed number of epochs, training of the network stops, and the average or best result of all epochs becomes the resulting output of the neural network [6].

In the time series forecasting domain different deep learning models can be applied. The most commonly used deep learning models in time series forecasting are: the multi-layer perceptron (MLP), the recurrent neural network (RNN), and the convolutional neural network (CNN) [31]. For this regression type problem, RNN and CNN networks are promising solutions [10].

### *2.3.2 MultiLayer perceptron*

A multilayer perceptron is a relatively simple artificial neural network that is used to approximate a mapping function from input variables to output variables. The network is more commonly known as a "feedforward neural network". It can be applied as a deep learning model to any time series forecasting, since the network is robust to noise from the input data. It does not make strong assumptions about the mapping function, and is capable to learn complex and high-dimensional mappings, and both linear and nonlinear relationships [35]. MLP's are memory-less, are unidirectional where neurons are grouped in two or more layers [36], and use the feed forward neural network architecture with backpropagation [20]. The neural network aims to generalize over data samples, such that newer samples are produced beyond by what is known by the model itself. It can therefore make accurate and valuable forecasts.

One key limitation is that a MLP has to specify the temporal dependence upfront during the design of the model [4].

### *2.3.3 Recurrent neural networks*

Recurrent neural networks are a type of ANN with the following characteristics:


In the RNN model architecture, both a mapping from inputs to outputs, and the context from the input sequence useful for the mapping are learned [4]. Each RNN cell contains an internal memory state that serves as a summary of past information, and it is repeatedly updated with new observations for every time step [19].

Besides its overall better performance, flexibility, and improved memory capabilities compared to the MLP, RNN's are computationally more expensive. The overall process takes significant computation time [10, 17]. Also, standard RNN's have difficulty in learning long-term dependencies [26], and could make poor forecasts because of the vanishing gradient problem in larger sequences. This means that RNN's are not capable of carrying long-term dependencies [5, 22].

There are two specific variants of the recurrent neural network, namely the long short term memory (LSTM), and the gated recurrent unit (GRU) networks [22].

A LSTM network model has special LSTM units that are composed of cells, where each has an input gate, output gate, and a forget gate [31]. The input gate and forget gate determine how much of the past information is retained in the current cell state for each LSTM cell, and also how much of the current information to propagate forward [22].

The model learns a function that maps a sequence of past observations as input to an output observation. It reads one time step of the sequence at a time and builds up an internal state representation that can be used as a learned context for making predictions [4]. It is a special RNN variant, since the model is able to learn long term dependencies [8, 22], by replacing the hidden layers of a RNN with memory cells. Each cell in the LSTM network remembers the desired values over arbitrary time intervals [31]. Furthermore, it is able to overcome to most common limitation of standard RNN's, the vanishing gradient problem [2, 10, 19, 26].

Another RNN variant is the Gated Recurrent Unit (GRU), which is recently developed, first in 2014. It is an artificial neural network that uses an input gate, forget gate, update gate, and reset gate. Each gate is a vector, that decides what information should be passed to the output gate. The update gate decides how much of the last memory to keep. The reset gate defines how to combine the new input with the previous memory [8]. The GRU on average is not the most successful model in forecasting, but is less complicated to build and computations made by a GRU are faster than the LSTM. Also, it often shows competitive performances compared to ARIMA and RNN models.

*Time Series Forecasting on COVID-19 Data and Its Relevance to International Health… DOI: http://dx.doi.org/10.5772/intechopen.104920*

### *2.3.4 Convolutional neural network*

A convolutional neural network (CNN) is a specialized kind of neural network for processing data that has a known grid-like topology. It is different from other known neural networks since it uses convolutions instead of matrix multiplications in at least one layer. The input of a convolutional matrix can be a matrix, or a sequence. Also typical CNN's do not need medium of large sized datasets to perform excellent, only require a small set of parameters, and are able to make connection when data is sparse. Time series data can be considered as a type of 1-D grid taking samples at regular time intervals [3].

A convolutional neural network combines three architectural ideas; local receptive fields, shared weights, and spatial or temporal subsampling [35]. In a CNN, a sequence of observations can be treated like a one-dimensional(1-D) image, what a CNN can read and distill the most pertinent elements. In a 1-D CNN, the network uses inputs within its local receptive field to make forecasts [19]. CNN's support both univariate and multivariate input data and supports efficient feature learning [4].

Layers in a typical CNN model are the convolutional layer, the hidden layer, a pooling layer, a flatten layer, and a dense layer [4]. The convolutional layer has to ability to extract useful knowledge. The pooling layer distills and sub samples the output of the convolutional layer to the most salient elements [10], it thereby reduces the size of the convolved feature, in this research the input sequence. A flatten layer is implemented as a layer between the convolutional and dense layer to reduce the feature maps to a single one-dimensional vector. And the dense layer is a fully connected layer, similar to an MLP, which at a final stage of the CNN network, interprets the features extracted by the convolutional part of the model [3, 37]. **Figure 1** illustrates the one dimensional sequential CNN architecture, and how any input data is transformed by the convolutional operations into certain output.

A convolutional neural network has a kernel, that can be considered as a tiny window. The kernel slides over the input sequence or matrix, and applies the convolution operation on each subregion, called a patch, that the kernel meets across the input data. It functions as a filer that extracts the features from any 1-D sequence or higher dimensional image. This results in a convolved matrix, which is more useful than the original features of the input data, and often improves modeling performance [10].

In training a convolutional network, a forward pass executes training on the entire network, from the initial layer to the final dense layer. Loss is calculated and during the backward pass, in a the backpropagation procedure. This procedure takes place by computing local gradients for each CNN gate: δ output/δ input for each input/output

**Figure 1.**

*1-D sequential CNN model architecture.*

combination, also with use of convolutions. Similar as in the forward phase, one matrix slides over the other, what results in the computation of a local gradient. These local gradients are found and than taken together with the use of the chain rule that completely propagates all the gradients back through the convolutional network [38]. Afterwards, the networks parameters in each layer are updated [37].

A CNN can be very effective at automatically extracting and learning features from one-dimensional sequence data, such as univariate time series data, and can directly output a in multi-step vector [4]. Also, pooling operations in the neural network can significantly reduce the number of required network parameters, and makes the model more robust [10]. This can result in faster training and less overfitting on training data [37].

### **2.4 Automatic machine learning**

Automatic machine learning (AutoML) is a research area in the AI field that focuses on the automatic optimization of ML and CI hyperparameters, stages and pipelines [39]. This results in the further development of function methods that allows complex data preparation, feature extraction, and CI modeling in fewer lines of code without the need of building whole machine learning and data science frameworks from scratch [32].

It therefore becomes easier for novices in machine learning to build competitive models, and for machine learning experts in building complex models faster [29]. This is because two main barriers, structured programming and higher mathematics, are bypassed with the progressions made by AutoML. Examples of application of AutoML that are used in this research are hyperparameter tuning in the ARIMA model, and feature engineering in the pre modeling phase.

## **3. Methodology**

The objective of time series forecasting is the development of one or more mathematical or advanced deep learning models, that can explain the observed behavior of a time series, and possibly forecast future states of the series.

The actual time series research was subdivided into the following tasks:


### **3.1 Exploratory data analysis**

Exploratory data analysis can be considered as the set of techniques that try to maximize insight into the data, uncover the underlying structure of the data, and the extraction of important variables and features [7]. For time series analysis, to understand the underlying data and its characteristics, plotting the data is very useful. When plotting time series data, there are always two variables; the time scale on the x-axis, and the numerical variable on the y-axis. Most commonly used plots in time series data analysis are; run sequence plots, lag plots, autocorrelation plots, partial autocorrelation plots, histograms, and box plots. These

*Time Series Forecasting on COVID-19 Data and Its Relevance to International Health… DOI: http://dx.doi.org/10.5772/intechopen.104920*

time series plots can determine what models would be appropriate to model the time series.

### *3.1.1 Run sequence plot*

The run sequence plot is in time series analysis another name for the line plot. It shows the development of the corona virus infections over time in line graph format. In this particular plot, the 7-day moving average per day is also included. **Figure 2** shows the development of the number of COVID-19 cases over a time period april 2020 - april 2021.

### *3.1.2 Lag plot*

A lag plot can check for randomness in data. If data is random than the data should not show any identifiable structure in the lag plot, such as linearity [7]. Plotting a lag plot can be very efficient, since it quickly concludes if time series data is random or not. If such data is not random, than a random walk model for forecasting would not be appropriate.

### *3.1.3 Auto correlation and partial auto correlation plot*

From every collection of time series data samples, its auto correlation is its most important internal structure to analyze, besides trend and seasonality.

The auto correlation function (ACF) shows how similar the previous term and the current term are. In fact, autocorrelation shows the correlation coefficients between current and previous values [9].

Autocorrelation can be defined as the second order moment E(xp x{t + h} = g(h), that is a function of only the time lag h, and independent of the actual time index t.

It measures the degree of linear dependency between the time series at index t, and the time series at indices t-h or t + h. A positive auto correlation indicates that the present and future values of the time series move in the same direction. A negative auto correlation illustrates present and future values moving in the opposite direction [40]. If a ACF is close to 1, there is an upward trend, and an increasing value in the time series is often followed by another increase. Also, when the ACF is negative and close to 1, a decrease will probably be followed by another decrease [9].

**Figure 2.** *Run sequence plot of the number of corona infections per day.*

The auto correlation function is used to determine if the time series data is stationary of non-stationary. The function can be plotted into a graph, a correlogram, called the ACF plot, which is a plot of the autocorrelation of a time series by its lag. It is used in the model identification stage for various Box-Jenkins (ARIMA) models. If the data is truly stationary, the ACF plot will drop to zero very quickly after a few lags, while a the line graph in the ACF plot of any non-stationary data will converge to zero very slowly.

A partial autocorrelation function (PACF) in time series analysis defines the correlation between xt and x{t + h}, that is not accounted for by lags t + 1 and t + h-1. It actually measures the correlation between the time series with a lagged version of itself, but after eliminating the variations already explained by the intervening comparisons [18].

The ACF and PACF plots can tell if an autoregressive (AR) or moving average (MA) model, or both, can be appropriate. If the ACF plot shows a few serious spikes in the beginning, but not in later lags, than its recommended to model the time series data with an AR model, and use the AR component (p > 0) in the ARIMA model. If the PACF plot shows serious data spikes in the beginning, and in later lags, but only very few spikes in the ACF plot, than it is recommended to use the MA model, and make use of the MA component (q > 0) in the ARIMA model. If cases when both charts show many significances, it is recommended to model the data with an autoregressive moving average (ARMA) or autoregressive integrated moving average (ARIMA) model [40].

ACF and PACF plots are also useful for hyperparameter tuning ARIMA models, and both plot can indicate upper and lower bound values in the grid search for the p and q parameters [40].

### *3.1.4 Other plots*

Other visualization plots that can be used to understand the time series data, are commonly used dataplots in statistics such as the QQ-plot, histogram and boxplot.

A QQ-plot helps to determine whether or not datapoints are normally distributed [9, 11]. A histogram represents the distribution of numerical data, and shows the shape of the variables distribution. The boxplot will display how data is distributed, based on minimum value, first quartile, median, third quartile, and maximum value. The smaller the boxplot, the less variability in data values [41]. These plots can be easily plotted in with Pythons matplotlib and seaborn libraries.

### **3.2 Data wrangling**

The data wrangling process applied in this research contains data pre-processing techniques such as data preparation, feature selection, feature engineering, and data aggregation.

Data pre-processing is an important, but also time consuming process in the field of data science, which has gained importance over the past decade. This is because most CI algorithms have not been made for time series data [42].

Also, most CI models rely on high quality data in order to improve modeling performance from models that operate in real-world environments [43]. Thus, to effectively run a model and yield results, pre-processing of real-time and time series data is necessary.

In any large dataset only the most relevant features are selected, and irrelevant information, what does not have any influence on the desired output, is removed in a data pre-processing process called feature subset selection. This leads

### *Time Series Forecasting on COVID-19 Data and Its Relevance to International Health… DOI: http://dx.doi.org/10.5772/intechopen.104920*

consequently to dimensionality reduction in data, and having a learning algorithms that operates faster and more effective on more simple input data [41].

Also feature engineering and data aggregation are applied, since the cumulative values from the original dataset are transformed into its single original date values, by means of differencing, groupby mechanisms, and resampling data to daily sums. This creates new features to day scale, which make the Covid-19 data more suitable for analytical and modeling purposes [32, 39, 44].

Also these data transformation techniques are able to reduce noise in the time series data, and produce smoothing of the original time series [11].

For example, in the run sequence plots, a 7-day moving average calculated over the daily sum is plotted, which has a smoothing effect on the data, in the same plot as the daily sums. This is a statistic that is computed by sliding a window, in this research seven days, over the daily series, that aggregates the data for every window [11].

In the later modeling stage, data from the used dataset will be divided in a smaller data subset, by a process called instance reduction [43]. For modeling simplicity, only months that contains data from every day of the month are considered. Since the used dataset starts recording the Corona virus infections from 13 March 2020, March 2020 is not included in the used data. The input data for every models contains the number of corona virus infections, from April 2020 until and including April 2021.

### **3.3 Time series analysis**

Time series data can be considered as a series of measurements, or a sequence of observations that are indexed in time order [45]. An important aspect in time series is fitting models on historical data, and using these model to predict future observations.

Not every chunk of data can be considered and treated in similar ways. Data is only considered as time series data when it has a datetime measurement, also called a time component. This makes the time series data different from other types of data, but also more difficult to interpret. Time series data adds specifically a time dimension, which means an explicit order dependence between observations. In these instances, each data point in a time series depends on previous data points from that same time series [13].

Time series analysis is about the use of statistical methods and machine learning algorithms to extract certain information and characteristics of data, in order to predict future values based on stored past time series data [23]. Time series analysis is different than other analysis in supervised machine learning. Time series cannot be considered as a standard linear regression problem, since the assumption that observations are independent does not hold [6]. In the case of time series data, each data value depends on the previous data value, and is so called lag dependent. Therefore time series analysis cannot be solved with simple linear regression.

Time series analysis is the phase in the whole time series process that follows right after the exploratory data analysis. In any time series analysis data plot, the x-axis has in many cases the time variable, showing the amount of a numerical variable plotted on the y-axis at specific datetime points and time interval [46].

Time series data can be univariate or multivariate. Univariate time series data are datasets containing a single series of observations with a temporal ordering and a model is required to learn a function from the series of past observations to predict one or more new output values. In multivariate time series data there is more than one variable observed at each time step [4, 15].

### *3.3.1 Stationarity*

One very important characteristic for time series data, is that the data needs to be stationary before doing any forecasts in many classical ML models. Time series data that are "stationary" have values that fluctuate around a constant mean or have a constant variance, that does not change over time [11, 28]. This means that a change it time, for example taking the time series of a newer year or month, does not change the shape of the distribution when all data is stationary [9]. Also, stationary time series have no predictable pattern in the long-term [23].

Non-stationarity is in many cases caused by fluctuations in trend or seasonality [6, 7]. When the data is non-stationary, it can be set stationary with differencing, or with a method called time series decomposition [11, 15].

### *3.3.2 Differencing*

Any time series can be made stationary with first-order or higher-order differencing [20]. It transforms the data in a way that a previous observation is subtracted from the current observation, and thereby removes any trend and seasonality structure of the time series data [18, 23].

The differencing approach is used as one of the main parameters in the Box-Jenkins ARIMA model [45]. First order differencing is mathematically formulated by: d = 1, xt = xt - x{t-1}, and second order differencing by: d = 2, xt = xt - 2x{t-1} x{t-2} [11].

In this research, first order differencing is applied, and can be easily computed with the python. diff() build-in function.

### *3.3.3 Time series decomposition*

Data can often consist of multiple patterns, and can show linear or cyclic behavior. Therefore data splitting into multiple components can be very beneficial to improve understanding, and to discover any irregularities or white noise. The process of splitting or dividing time series data into multiple components is called time series decomposition [18].

Any time series model can be decomposed in trend, seasonality, and irregular components. The trend component is the pattern and behavior of the data in the long term. It is a certain pattern of growth of the data, and a description of the variable over a certain period of time. The seasonal component is a particular pattern in the time series data, that is repeated at specific time periods. Irregular components can be considered as data that is far off-trend. It contains abnormal values, sometimes called residual or outliers. It is also referred to as "white noise". Time series data can also contain cyclical components. These can be considered as movements observed after every few units of time, but they occur less frequently than seasonal fluctuations [11].

The objective of time series decomposition is to model the long-term trend and seasonality, and to estimate the overall time series as a combination of them [11]. A time series decomposition model can be additive or multiplicative. When the time series data appears to have any sort of changing seasonality pattern, the multiplicative decomposition model is recommended, in other cases the additive model is endorsed [7].

### *3.3.4 Augmented dickey fuller test*

The augmented dickey fuller (ADF) test is a statistical unit root test that is used to determine stationarity of a time series, and the magnitude of the trend component

### *Time Series Forecasting on COVID-19 Data and Its Relevance to International Health… DOI: http://dx.doi.org/10.5772/intechopen.104920*

[8, 11]. It is a hypothesis test, and any time series can be considered as stationary(where H0 is false), with 95% confidentiality, when the ADF's p-value is less than 0.05 [18].

The ADF unit root test should at first be applied on the original time series that has not been differenced, to check if the data is already stationary. If not, than data should be stationarized with first order or higher order differencing [11, 15, 20].

### *3.3.5 Smoothing with moving averaging*

Smoothing can be defined as the removal of noise from data, and can be applied in both regression and clustering problems [44]. In time series data, smoothing can be applied as a rolling moving average over a number time steps [18]. In this cases a one week(7-day) or one month(30-day) moving average can be computed. For each day, the moving average changes, since the method makes use of the sliding window, taking the average over different days [21].

Smoothing is also applied as a modeling technique that assigns weights to observations, whereas the most recent observations have more weight, than observation further away in the past. Examples of these techniques are single exponential smoothing, and triple(Holt-Winters) exponential smoothing.

### **3.4 Model construction and implementation**

The final objective of time series analysis is the development of one or more mathematical or advanced deep learning models, that can explain the observed behavior of a time series, and possibly forecast future states of the series.

The construction and implementation of multiple time series forecasting models can be divided into the following parts:


One of the first steps in performing time series research is to determine what data will be used for training a computational intelligence model, and what data will be used to test the performance of that model. The input sequence is divided into a training set and a test set. The sequence contains thirteen months of data, measured and aggregated to a total of 395 observations, where each observation is one day. Of those 395 days, 306 days are taken as the training set, the remaining 89 days are the test set. For the total time series data, around 77.5% is used as training data, and approximately 22.5% as testing data.

### *3.4.1 Data pre-processing and data plotting*

Before data is used as input for a CI model, is it pre-processed first. For the dataset, only the most relevant features are selected in a process called feature selection. Secondly, all relevant data is transformed into its shape that is useful for modeling with data aggregation, which contains rescaling and resampling the time series data.

Before data gets fed to a CI model, the data is plotted to recognize its underlying structure, to determine what models can be suitable to model the time series data. In the classical time series models, the run sequence plot is plotted with a seven day moving average applied as a method for smoothing the data. The ARIMA model contains multiple data plots in its pre modeling phase, including the run sequence plot, lag plot, QQ plot, ACF plot, and PACF plot. Before constructing the CNN, it is very beneficial to have sufficient understanding in the underlying data patterns. Therefore, before building the CNN model, the data was plotted and interpreted with run sequence, ACF, and first-order differences plots.

### *3.4.2 Model building*

Since the input time series data is univariate, and contains one column, it is best practice to extract the whole time series and store it in a variable [9].

The baseline model that is constructed in this research, simply calculates the average of all the training examples, and takes that average as its forecast.

Any AR model can be defined by an ARIMA(x, 0, 0), where x is the autoregressive parameter, a positive integer ranging between 1 and 5. The MA model is defined by an ARIMA(0, 0, x), where x is the moving average parameter, what is also a positive integer, that in many cases ranges between 1 and 5. In the ARIMA model, all three parameters from the ARIMA(p, d, q) model needs to be defined, including the differencing parameter d, because the data is non-stationary. In both exponential smoothing models, the smoothing factor needs to be determined by manual input. An optimal search would be to run a smoothing model twice, on a smoothing factor of 0.1 and 0.9, and check the performance metrics. Than can be determined if the smoothing factor needs have a high value close to 1, or a low value close to 0.

In the CNN model, the training and testing data can be split with the split\_sequence function. After reshaping the input data, all CNN operations transform the sequence into a 1-D output vector. The model is fitting and best predictions are made after performing two thousand epoch training cycles.

When running and testing any model, it is run against the testing set to predict data it has not seen before.

### *3.4.3 Model evaluation*

To evaluate a models performance, first some intuition from its characteristics is required, before making any judgments. This can be done with summarizing or describing its outcomes.

The baseline model is summarized with the .describe() function. It is a function that gives the count, mean, standard deviation, and all the boxplot values [11].

All other model are summarized with the .summarize() function.

The essential aspect in model evaluation is determining how well its forecasting results are. Performance measures such as the (root) mean squared error are a way of measuring the performance of a model [1, 3]. The following two performance measures determine the effectiveness of each model:


*Time Series Forecasting on COVID-19 Data and Its Relevance to International Health… DOI: http://dx.doi.org/10.5772/intechopen.104920*

### *3.4.4 Model improvement*

When a model is performing less accurate than expected, the model can be improved. The modeling improvement step is however not a mandatory step, and is only needed when a model performs poorly than was originally intended.

In any AR, MA, ARMA, and ARIMA model, the modeling performances can be improved with hyperparameter tuning. It is considered as important in CI, since it evaluates the model to be implemented on different configurations, in order to find the best set of hyperparameters that yields in the best predictive performance [39].

Grid search is a search method that can do any hyperparameter tuning. It is a brute-force and semi-automatic based search method that explores all possible model configurations within a user specified parameter range, and is considered as an exhaustive search that often takes relatively long runtime [22, 39].

One important metric that is used in tuning hyperparameters, is the Akaike Information Criterion (AIC). It measures the relative quality of the model being considered for the description of the phenomenon, and is proven to be fast and efficient. Its value shows an estimate of the information lost, when a specific order of the model is being considered. The smaller the value of the AIC, the less information is lost, and the more accurate the model is considered to be [9].

## **4. Used models**

### **4.1 Baseline model**

The research started with the setup of a simple and consistent baseline model. The purpose of a baseline model is out of simplicity and for setting the boundaries for all the other models. It also helps understanding the data better, and could determine whether or not more data preparation or feature engineering is necessary [48]. If a more complex model performs worse than the baseline model, it can be considered as a poor model for forecasting the specific dataset.

The simple average model discussed in the previous section, serves in this research as the baseline model. A simple average can be implemented with formula [21]:

$$
\hat{\mathcal{y}}\_{t+1} = \frac{1}{\varkappa} \sum\_{i=1}^{t} \mathcal{y}\_i \tag{6}
$$

Most baseline models serve as a benchmark in forecasting research for comparing new methods to this simple method [6].

Since pandemic infections have the tendency to suddenly increase, but also to decrease very quickly in some cases, a naive or persistence model, that predicts the last seen value, would not be a model to consider as a baseline model.

### **4.2 Classical machine learning models**

The classical machine learning, or statistical models that are implemented assume that observations are continuous, time is discrete and equally spaced, and that there are no missing observations [28].

The classical machine learning methods that are implemented are the moving average and autoregressive models, the simple exponential smoothing model, and the triple (Holt-Winters) exponential smoothing model. Special focus in the

research in spend on modeling the ARIMA model, since ARIMA is a very popular and quite accurate model in time series forecasting.

### **4.3 CNN model**

In this research one promising neural network will be implemented and tested on the COVID-19 data, the convolutional neural network (CNN) model. The CNN network that is implemented in this research, treats the input data as a sequence over which convolutional read operations can be performed, in a similar fashion as in one dimensional images [49].

The CNN model is a univariate multi-step one-dimensional vector output forecasting model. Univariate, means that one feature will be forecasted, and multi step defines the output of a sequence with multiple output values.

The CNN architecture includes an activation function, and an optimization algorithm. The activation function used in the convolutional neural network, is the ReLu activation function. It is a function that is able to overcome the problem of exploding and vanishing gradients, which occur in typical ANN's like the MLP and RNN [4]. It can be defined as ReLU = R(z) = max(0,z) [31, 32]. When activating, negative and zero inputs will have zero output, and positive inputs will be exactly the same. The ReLU activation function, therefore consistently filters out negative numbers [50].

The optimization function that is used in the CNN model, is the ADAM algorithm, which is a modified version of the SGD algorithm, and an adaptive learning rate optimization algorithm. It uses running averages and both the gradients and second moments of the gradients [31]. It has a built-in tensorflow implementation, and requires the learning rate parameter to operate [22]. The learning rate, in many cases denoted as α, indicates at which pace the weights in the neural network get updated [32]. It is currently the most commonly used optimization algorithm in artificial neural networks [3, 51].

### **4.4 Python packages**

The research conducted in the classical statistical models rely on a few well known and frequently applied Python-packages, such as Pandas, Numpy, and Matplotlib. The Seaborn libary is used for some data visualizations. For specific time series analysis the python library Statsmodels is often used. It is a Python package that includes basic tools and models for time series analysis and modeling, and is specifically build for time series data [43]. It also provides all functionality required to model an ARIMA and exponential smoothing model [26].

Another package that is applied is pmdarima, which is used in the hyperparameter tuning of the ARIMA model [14, 52].

Sklearn, a famous python library in machine learning, is used for evaluating performance metrics of all trained CI models [43].

In the modeling phase of the 1-D sequential CNN model, the Keras library performs all CNN operations. Keras is a python library that is extensively used in many deep learning modeling situations.

### **5. Research results**

In the time series analysis results, the lag plot as displayed in **Figure 3** indicates that the data containing the daily COVID-19 cases is non-random. Since the plot

*Time Series Forecasting on COVID-19 Data and Its Relevance to International Health… DOI: http://dx.doi.org/10.5772/intechopen.104920*

**Figure 3.**

*Lag plot of the number of corona infections per day.*

**Figure 4.** *First order differences number of Covid-19 infections.*

**Figure 5.** *ACF plot of the number of Covid-19 infections.*

clearly shows a linear structure between y(t) and its lag y(t + 1), the data can be considered as non-random and suitable for time series forecasting.

First order differences as displayed in **Figure 4** show the daily changes, and clearly indicates upward and downward trends in the data starting from October 2020, up and until April 2021. **Figure 5** shows the autocorrelation function of the data of the first 400 lags. The ACF graph slowly moves to the zero value, indicating that the data is non-stationary. Therefore, differencing and appling an ARIMA model with the differencing paramater set at a value of at least 1, is strongly advised.


### **Table 1.**

*Performance of all seven implemented models.*

In the modeling phase, all models have been constructed, implemented, and have produced forecasts in the Jupyter notebook environment, in python 3 code. Each forecast made by each of the seven models was measured against the test set, and resulted into a root mean-squared error (RMSE) and mean average error (MAE) score. **Table 1** below shows all RMSE and MAE performances of the implemented models.

The results as displayed in **Table 1** show that five out of six models were able to make better forecasts than the forecasts made by the baseline model. The AR, ES, HWES, ARIMA and CNN model made forecasts that were all slightly or significantly better than the forecast made by the simple average model. Only the moving average (MA) model made forecasts that resulted in almost identical RMSE and MAE error scores, compared to the performances of the baseline model.

The research results show that the 1-D sequential CNN with vector output is the best performing model on the test data. In **Figure 6** the forecasts made by the CNN model are displayed, and clearly show significant alignment with the test data. The models relative good results could indicate that convolutional operations perform well on 1-D sequence data, and are able to make better forecasts than traditional machine learning models. The 1-D sequential CNN made approximately five times more accurate forecasts than the ARIMA model. The CNN model has a RMSE error of 409,86 and a MAE error of 315,99, compared to an RMSE error of 2172,88 and MAE error of 1721,42 from the ARIMA model. The initial set goal of having a CNN model that is able to outperform the prominent and accurate ARIMA model, has been clearly achieved.

### **Figure 6.**

*CNN forecasting performance (green) and actual infections (blue and orange) on the number of corona infections per day, in time interval February – April 2021.*

*Time Series Forecasting on COVID-19 Data and Its Relevance to International Health… DOI: http://dx.doi.org/10.5772/intechopen.104920*

Another favorable model is the Holt-Winters exponential smoothing (HWES) model. With a RMSE error of 1611,34 and MAE error of 1379,34. It performed approximately 25 percent better on the RMSE metric, and made around 20 percent better forecasts according to the MAE metric, than the forecasts made by the ARIMA model. Also the single exponential smoothing (ES) model slightly outperformed the ARIMA model, by about 10 percent.

### **6. Conclusion**

Many sources claim that classical statistical methods, like ARIMA and exponential smoothing, achieve better performance than standard deep learning models, like MLP's and RNN's on smaller datasets.

However, the one-dimensional CNN model with vector output made forecasts that were five time more accurate than the forecasts made by the ARIMA model, on a smaller dataset containing less than one thousand observations. These findings support the fact that neural networks are resistant to errors and some outliers in the underlying dataset, which make them useful in the analysis and prediction of larger and sometimes even smaller time series datasets. Also, classical machine learning methods like ARIMA and exponential smoothing fail to identify and capture nonlinear and complex behavior of time series. Thus, in the case of pandemics, where the data does not show a clear trend and data patterns are relatively hard to extract, neural networks can be a solution for this complexity.

Nevertheless, predicting the future with accurate forecasts is still a very difficult to an even impossible task. This is because of the presence of confounding variables, for example human decision making processes, that cannot be modeled upfront in any of the models. However, the CNN model have proven its potential in predicting and forecasting new corona virus infections. When dealing with viruses that act like COVID-19 in a similar way, artificial neural can, in some cases, simulate future values surprisingly well and close to actual future values.

### **Abbreviation**


