*Prediction Analysis Based on Logistic Regression Modelling DOI: http://dx.doi.org/10.5772/intechopen.103090*


#### **Table 1.**

*Variables from the database.*

The choice of the variables is made by the selected method: Forward Wald. This method is based on adding or removing variables from the model by using two statistics: the score of Rao and the Wald statistic.

The score of Rao allows to compare for each independent variable Xj the null hypothesis: Ho = Bj = 0; that is, the regression coefficient B associated with the variable in the model is null. The variable that presents the minimum associated pvalue provided it is always less than 0.05, for the proposed independent variable will be selected to enter the model.

Also, for the Wald statistic, the null hypothesis can be compared Ho: Bj = 0, but in this case, it is for the independent values that are already selected and have entered the model.

A variable with a p-value associated with the Wald statistic bigger than 0.1 will be eliminated, as this is by default the option of the program.


#### **Table 2.**

*Variables created from existing variables.*

According to the criteria exposed above, there will be several steps in which independent variables will be entered and eliminated.

At step 0, only the constant is introduced to the model. For this constant, it is essential to measure B (the regression coefficient), the estimated standard error in the estimation (SE), the Wald statistic and its degrees of freedom (df) and the associated p-value. When this p-value is less than 0.1, the constant is considered to be significant.

All the independent variables are out of the model at this step. One variable has to be selected to enter the model in step 1. The variable with the smaller p-value associated with the score of Rao, provided it is less than 0.05, will be selected. It should be considered that the variables created from a categorical variable should be considered as a whole.

If two or more variables have the same p-value, the score should then be considered, choosing the variable with the bigger score to enter the model in Step 1.

Once the variable enters the model, we should study the Wald statistic, given by:

$$\text{Wald} = \left(\text{B/SE}\right)^2\tag{1}$$

If its p-value is above 0.1 (output value, POUT), then the corresponding variable would be eliminated (as a whole in the case of the categorical variables). It is always eliminated before the new variable is selected.

After this, another variable would be selected (or not) to enter the model in the next step. Suppose no variable can be selected due to the p-values of the score of Rao. In that case, the process is terminated, and the model is presented with a mathematical formula, given as:

$$Z = B\_1 X\_1 + \dots + B\_q X\_q + B\_0 \tag{2}$$

being q the number of independent variables, and B the regression coefficients of the independent variables included in the model.

This model would explain the probability of the dependent value to be 1, that is, the possibility of an incident having a loss of position. The parameters that must be estimated are the regression coefficients B0, B1, … , Bq.

The column SE presents the standard error for estimating these coefficients, which is necessary for calculating the Wald statistic.

From here, the probability p of a case having an excursion is given by:

$$p = \mathbf{1}/\left(\mathbf{1} + e^{(-Z)}\right) \tag{3}$$

So the probability p for each case can be obtained. When the value p is less than 0.5, it will indicate that the model classifies this case in the first group (not having excursion), and when the value is bigger than 0.5, then the model predicts the case to have an excursion:

Moreover, the probability of not having any loss of position is:

$$q = \mathbf{1} - p \tag{4}$$

Furthermore, the relative ratio is defined as:

$$\begin{split} p/q &= \left( \mathbf{1} / \left( \mathbf{1} + e^{(-Z)} \right) \right) / \left( \mathbf{1} - \mathbf{1} / \left( \mathbf{1} + e^{(-Z)} \right) \right) \\ &= \left( \mathbf{1} / \left( \mathbf{1} + e^{(-Z)} \right) \right) / \left( \left( \mathbf{1} + e^{(-Z)} - \mathbf{1} \right) / \left( \mathbf{1} + e^{(-Z)} \right) \right) = \mathbf{1} / e^{(-Z)} = e^{(Z)} \end{split} \tag{5}$$

Then, the mean relative ratio can be obtained. According to the definition of relative ratio, the i-th incident will be more likely to occur if P/Q > 1.0, while another incident will be more prone to be associated with not having an excursion when this ratio P/Q < 1.0.

#### **4.2 Goodness of fit**

It is not enough to give the model, as the goodness of fit must be checked to decide whether the model is good or not.

We have estimated the possibility of an incident having or not an excursion, but this does not necessarily need to be real. According to the model, the case can have a more significant possibility of belonging to the first group (no excursion) and yet belong to the second group (excursion). It is a bigger problem when the probabilities are close to 0.5. In this case, there is an error, the difference between the observed probability and the estimated probability Ei = p observedi – p estimatedi, where pi = can take the values 1 or 0, depending on whether the case belongs or not to the second group.

Evidencing the goodness of fit is checking how probable the obtained results for the estimated model are. It is based on comparing the number of cases that belong to the second group (excursion = yes) with the expected number if the model is valid. This expected number is the product of the total of cases in the sample by the estimated probability of belonging to the second group.

The statistic -2Log Likelihood (abbreviated -2LL) is used for this fit. When the - 2LL results in low values, the likelihood is significant; the closer to zero, the bigger the likelihood.

Also, the following statistic can be used to compare the observed probabilities with the estimated from the model:

$$Z^2 = \frac{\sum\_{i=1}^{n} E\_i^2}{\text{potential}\_i \left(1 - \text{estimated}\_i\right)}\tag{6}$$

They both follow a chi-square distribution with n-2 degrees of freedom under the hypothesis that the model adjusts to the observed data. It shows the percentage of correctly classified cases after the model has been defined.

When the percentage of correctly classified cases is high, it is expected to provide good results when predicting whether any incident will have an excursion or not.
