**2. Method**

In this section, an advanced accident prediction model is developed, which enables to rank risky LXs accurately and identify the significant impacting parameters efficiently. The model considers the average daily road traffic, the average daily railway traffic, the annual road accidents, the vertical road profile, the horizontal road alignment, the road width, the crossing length, the railway speed limit, and the geographic region. The nonlinear least-squares (NLS) method, Poisson regression method, NB regression method, ZIP regression method, and ZINB regression method are employed to estimate the respective coefficients of parameters in the prediction model.

#### **2.1 Data sources and coding**

The dataset used in our study, which cover SAL2 LXs in 21 administrative regions in mainland France from 2004 to 2013, has been provided by SNCF Réseau (the French national railway infrastructure manager). Moreover, the dataset includes 10 years of information about annual LX accident frequency, annual roadway accident statistics and railway, roadway, and LX characteristics. In total, there are 8332 public SAL2 LXs involved in our investigation. The impacting parameters relevant to LX accidents considered in our investigation can fulfill the following characteristics: (1) important in determining accident frequency, (2) more permanent in nature (e.g., sight obstruction noted as a problematic factor due to involved alterable construction topography, vegetation, and other environmental elements), and (3) not accidentdependent [18]. The statistical characterization of parameters considered in this investigation are shown in **Table 1**. It is worth noticing that the road accident factor is reflected by the ratio of the annual number of road accidents in a given year to the average number of road accidents per year over the period of 10 years considered, while the region risk factor is reflected by the general accident frequency per SAL2 in the corresponding region. Overall, the data coding is shown in **Table 2**.

#### **2.2 Advanced accident prediction model**

Here, we define that the formula of the conventional traffic moment is given as: Traffic moment = Road traffic frequency � Railway traffic frequency [19]. However, based on some previous analyses [20], we adopt a variant called "corrected moment," or CM for short. *CM* <sup>¼</sup> *<sup>V</sup><sup>a</sup>* � *<sup>T</sup><sup>b</sup>*, where *<sup>a</sup>* <sup>þ</sup> *<sup>b</sup>* <sup>¼</sup> 1 and the optimal value of *<sup>a</sup>* in terms of fitting is calculated to be *a* ¼ 0*:*354 according to the statistical analysis performed by SNCF Réseau [21]. Therefore, we consider *<sup>V</sup>*<sup>0</sup>*:*<sup>354</sup> � *<sup>T</sup>*<sup>0</sup>*:*<sup>646</sup> as an integrated parameter that reflects the combined exposure frequency of both railway and road traffic.

The developed advanced model takes into account various variables as interpreted in **Table 2**. The general form of the model is shown as follows:

$$
\lambda\_{10Y} = K \times F\_{RAac} \times \left(V^a \times T^b\right) \times \exp\left(\mathbf{C}\_{Proflle} \times I\_{Proflle} + \mathbf{C}\_{Align} \times I\_{Align} + \mathbf{C}\_{Wid} \quad \text{(3)}\right)
$$

$$
\times Wid + \mathbf{C}\_{Leng} \times Leng + \mathbf{C}\_{RSL} \times \mathbf{RSL} + \mathbf{C}\_{\text{Reg}} \times F\_{\text{Reg}}\right),
$$


#### **Table 1.**

*Statistical characterization of parameters considered.*

where *λ*10*<sup>Y</sup>* represents the annual accident frequency at a given SAL2 for a period of 10 years; *FRAcc* is the road accident factor, which is a time-dependent variable and reflects the variation of annual road accidents as time advances; *K* is the coefficient of *FRAcc*; *V* denotes the average daily road traffic; *T* denotes the average daily railway traffic; *IProfile* is the profile indicator and *CProfile* is the coefficient of *IProfile*; *IAlign* is the alignment indicator and *CAlign* is the coefficient of *IAlign*; *Wid* is the LX width and *CWid* is the coefficient of *Wid*; *Leng* is the crossing length and *CLeng* is the coefficient of *Leng*; *RSL* is the railway speed limit and *CRSL* is the coefficient of *RSL*; *FReg* is the region factor and *CReg* is the coefficient of *FReg*. Note that this model does not only rank risky LXs accurately but also allow for identifying significant parameters efficiently.

#### *2.2.1 Regression approaches*

In this section, several regression approaches are adopted to estimate the coefficients associated with the parameters of our model. The nonlinear least-squares (NLS) technique and Gauss-Newton algorithm [22] are firstly considered to estimate the variable coefficients in our model. Considering a fitting model function *y* ¼ *f x*ð Þ , *β* , where variable *x* depends on a vector of *l* parameters: *β* ¼ *β*1, *β*2, … , *β<sup>l</sup>* ð Þ. The goal is to find the vector *β* which can let the model function fit best the actual observed data in the least-squares sense. In other words, minimize the sum of residual squares *S* expressed as follows:

*Accident Prediction Modeling Approaches for European Railway Level Crossing Safety DOI: http://dx.doi.org/10.5772/intechopen.109865*


#### **Table 2.**

*Parameters considered and data coding.*

$$S = \sum\_{i=1}^{m} r\_i^2, \ m \ge l,\tag{4}$$

where *ri* is the residual between the fitting model estimation and the actual observation, *ri* ¼ *yi* � *f xi* ð Þ , *β* .

The minimum value of *S* is obtained by solving the gradient function *<sup>∂</sup>S=∂β<sup>j</sup>* <sup>¼</sup> 0, i.e.,

$$\begin{aligned} \partial \mathbb{S}/\partial \boldsymbol{\beta}\_{j} &= 2 \sum\_{i} r\_{i} \boldsymbol{\partial} r\_{i} / \partial \boldsymbol{\beta}\_{j} = \mathbf{0}, \\ \boldsymbol{\beta}\_{j} &\approx \boldsymbol{\beta}\_{j}^{k+1} = \boldsymbol{\beta}\_{j}^{k} + \Delta \boldsymbol{\beta}\_{j}, \end{aligned} \tag{5}$$

where k is the iteration number and Δ*β<sup>j</sup>* is the shift parameter.

At each iteration step, the model is linearized by approximation to the first-order Taylor series expansion about *β<sup>k</sup>* :

$$f(\boldsymbol{x}\_{i},\boldsymbol{\mathfrak{P}}) \approx \boldsymbol{f}\left(\boldsymbol{x}\_{i},\boldsymbol{\mathfrak{P}}^{k}\right) + \sum\_{j=1}^{l} \left(\boldsymbol{\beta}\_{j} - \boldsymbol{\beta}\_{j}^{k}\right) \boldsymbol{\upbeta}\left(\boldsymbol{x}\_{i},\boldsymbol{\mathfrak{P}}^{k}\right) / \partial \boldsymbol{\upbeta}\_{j} \approx \boldsymbol{f}\left(\boldsymbol{x}\_{i},\boldsymbol{\mathfrak{P}}^{k}\right) + \sum\_{j=1}^{l} I\_{ij} \Delta \boldsymbol{\upbeta}\_{j},\tag{6}$$

where *Jij* is the element of Jacobian matrix **<sup>J</sup>** and *<sup>∂</sup>ri=∂β<sup>j</sup>* ¼ �*Jij*. Therefore, *ri* can be rewritten as:

$$\begin{aligned} r\_i &= \Delta y\_i - \sum\_{s=1}^{l} J\_{is} \Delta \beta\_s, \\ \Delta y\_i &= y\_i - f\left(\mathbf{x}\_i, \boldsymbol{\theta}^k\right). \end{aligned} \tag{7}$$

By substituting the above expressions into the gradient equation in Eq. (5), we obtain the normal equation and its matrix notation:

$$\begin{aligned} \sum\_{i=1}^{m} \sum\_{s=1}^{l} J\_{i\bar{j}} J\_{i\bar{s}} \Delta \beta\_s &= \sum\_{i=1}^{m} J\_{i\bar{j}} \Delta \mathbf{y}\_i, \\ \left( \mathbf{J}^T \mathbf{J} \right) \Delta \boldsymbol{\mathcal{J}} &= \mathbf{J}^T \boldsymbol{\Delta} \mathbf{y}. \end{aligned} \tag{8}$$

For an NLS model, *S* should be modified as follows:

$$S = \sum\_{i=1}^{m} W\_{il} r\_i^2, \quad m \ge l. \tag{9}$$

Therefore, the matrix notation of normal equation for an NLS model is expressed as follows:

$$(\mathbf{J}^T \mathbf{W} \mathbf{J}) \Delta \boldsymbol{\mathfrak{J}} = \mathbf{J}^T \mathbf{W} \Delta \mathbf{y}. \tag{10}$$

These aforementioned equations form the basis of the Gauss-Newton algorithm for solving an NLS problem.

In fact, the Poisson regression model shown as Eq. (11) is a natural choice for modeling accident occurrence:

$$\operatorname{Poi}(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}, \ k = 0, 1, 2, \dots, \tag{11}$$

where *Poi X*ð Þ ¼ *k* is the probability of *k* accidents occurring, *k*∈ , and *λ* is the expectation value of the number of accidents.

However, [23] indicates that accident frequency is likely to be over-dispersed (see Eq. (12)) and suggests using the negative binomial (NB) regression model as an alternative to the Poisson model:

$$\text{VAR}(X) \begin{cases} = E(X) \\ > E(X), \text{ for over-dispersed} \\ < E(X), \text{ for under-dispersed} \end{cases} . \tag{12}$$

The NB model as a special case of Poisson-Gamma mixture model is a variant of the Poisson model designed to deal with over-dispersed data [11, 24, 25]. The over-dispersion could come from several possible sources, e.g., omitted variables, uncertainty in exposure data, covariates, or nonhomogeneous LX environment [26]. The NB model considered in this study has the following expression:

$$P\_{NB}(X=k) = \frac{\Gamma\left(k+\frac{1}{a}\right)}{\Gamma(k+1)\Gamma\left(\frac{1}{a}\right)} \left(\frac{1}{1+a\lambda}\right)^{1/a} \left(\frac{a\lambda}{1+a\lambda}\right)^k, \ k = 0, 1, 2, \dots,\tag{13}$$

*Accident Prediction Modeling Approaches for European Railway Level Crossing Safety DOI: http://dx.doi.org/10.5772/intechopen.109865*

where *PNB*ð Þ *X* is the probability of *k* accidents occurring, *k*∈ , *α* is the dispersion parameter, and *λ* is the expectation of the number of accidents.

The relationship between the mean value and the variance in the NB model is given as follows:

$$\text{VAR}(X) = aE(X)^2 + E(X),\tag{14}$$

if *α* <0, there is an under-dispersion; if *α*> 0, there is an over-dispersion; in the case where *α* ¼ 0, the NB model reduces to the Poisson model.

In practice, the count data may contain extra zeros relative to the Poisson or NB distribution. In this case, the ZIP or ZINB regression model is useful for analyzing such data [27]. The ZIP model is expressed as follows:

$$P\_{\rm ZIP}(X=k) = \begin{cases} \boldsymbol{\alpha} + (\boldsymbol{1} - \boldsymbol{\omega}) \exp(-\boldsymbol{\lambda}), & \text{for } k = \mathbf{0} \\ (\boldsymbol{1} - \boldsymbol{\omega}) \exp(-\boldsymbol{\lambda}) \boldsymbol{\lambda}^k / k!, & \text{for } k > \mathbf{0} \end{cases} \tag{15}$$

where *PZIP*ð Þ *X* ¼ *k* is the probability of *k* accidents occurring, *k*∈ , *λ* is the expectation value of the number of accidents, and *log <sup>ω</sup>* 1�*ω* � � <sup>¼</sup> *<sup>z</sup>*<sup>0</sup> *γ* is the ZI link function that *z*<sup>0</sup> is the ZI covariate and *γ* is the corresponding ZI coefficient. The mean value and variance of ZIP model are *E X*ð Þ¼ ð Þ 1 � *ω λ* and *VAR X*ð Þ¼ ð Þ 1 � *ω λ*ð Þ 1 þ *ωλ* .

The ZINB model is expressed as follows:

$$P\_{\rm ZINB}(X=k) = \begin{cases} \alpha + (1-\alpha)(1+a\lambda)^{-1/a}, & \text{for } k=0\\ \alpha - \alpha \frac{\Gamma\left(k+\frac{1}{a}\right)}{\Gamma(k+1)\Gamma\left(\frac{1}{a}\right)} \left(\frac{1}{1+a\lambda}\right)^{1/a} \left(\frac{a\lambda}{1+a\lambda}\right)^k, & \text{for } k>0 \end{cases},\tag{16}$$

where *PZINB*ð Þ *X* ¼ *k* is the probability of *k* accidents occurring, *k*∈ and *λ* is the expectation value of the number of accidents. The mean value and variance of ZINB model are *E X*ð Þ¼ ð Þ 1 � *ω λ* and *VAR X*ð Þ¼ ð Þ 1 � *ω λ*ð Þ 1 þ *ωλ* þ *αλ* . The ZINB reduces to the ZIP in the limit *α* ! 0.

However, the NB and ZINB models are limited to handling under-dispersed data (*α*< 0) [11]. That is why [13] proposed the Gamma model to handle under-dispersed samples. The Gamma model is given as follows:

$$P\_G(X=k) = \operatorname{Gamma}(\beta k, \lambda) - \operatorname{Gamma}(\beta(k+1), \lambda), \tag{17}$$

where *PG*ð Þ *X* is the probability of *k* accidents occurring, *k*∈ , *λ* is the expectation of the number of accidents, and *β* is the dispersion parameter. If *β* >1, there is an under-dispersion; while *β* <1, there is an over-dispersion and if *β* ¼ 1, the Gamma model reduces to the Poisson model. However, the Gamma model shown in Eq. (18) is limited to the time-dependent observation assumption and zero observations, since general Γð Þ *x* restricts discrete responses to positive values:

$$
gamma(\beta k, \lambda) = \left\{ \frac{\mathbf{1}}{\mathbf{1}}\_{\Gamma(\beta k)}, \mathbf{1}\_{\mathbf{u} > \mathbf{1}\_{\mathbf{u}}}, \text{for } k > \mathbf{0} \right. \tag{18}
$$

According to the above discussion, the restriction between mean value and variance can be used to identify an appropriate regression model. Therefore, we firstly make preliminary variance analysis by means of group classification. Namely, the annual accidents at a given SAL2 during the 10 years were divided into 100 groups with the same number of samples in each group. Then, the variance and mean value of accidents in each group were calculated, respectively, to analyze the relationship between the group variance and the group mean value. The variance analysis shows that the variance and mean value are very close to each other. Hence, we performed meticulous analyses to assess the NLS regression, the Poisson regression, the ZIP regression, the NB regression, and the ZINB regression methods with regard to SAL2 LXs in our accident dataset so as to identify which model is more effective.

### *2.2.2 Regression modeling results*

### NLS regression:

When applying the NLS regression, the form of *λ*10*<sup>Y</sup>* is given by Eq. (3). The estimated coefficients computed by NLS regression are provided in **Table 3**. ∣*t* � *statistic*∣>1*:*96 is introduced to identify the significant parameters corresponding to a 95% confidence level. As a result, the railway speed limit, the average daily railway traffic, the average daily road traffic, the annual road accidents, the LX-accident-prone region, the road alignment, the LX width, and the crossing length have been shown to have significant and positive influence on SAL2 accident frequency. However, the test shows that the road profile is not a significant factor (∣*t* � *statistic*∣ ¼ 0*:*635< <1*:*96); thus, the impact of road profile could be neglected. Moreover, the coefficients of the considered variables with the exponential form can reflect the sensitive degrees of the SAL2 accident frequency to these variables, respectively. According to these sensitive degrees (rank indicated in brackets), the LX-accident-prone region factor is the most sensitive contributor among these variables.

In order to assess the predictive accuracy of accident occurrence estimated by the NLS regression model *λ*10*<sup>Y</sup>* combined with the NB and ZINB distributions (see Section 3.1), we adopt the maximum likelihood estimation (MLE) method to estimate the dispersion parameter *α* of the dataset [28]. As expressed by Eq. (19) and Eq. (20), the values of *α* in NB and ZINB distributions are estimated, respectively, using R language to solve *<sup>∂</sup>l=∂<sup>α</sup>* <sup>¼</sup> 0:


#### **Table 3.** *Results of the λ*10*<sup>Y</sup> NLS regression model.*

*Accident Prediction Modeling Approaches for European Railway Level Crossing Safety DOI: http://dx.doi.org/10.5772/intechopen.109865*

$$l(a)\_{\text{NB}} = \ln\left(\prod\_{i}^{n} P\_{\text{NB}}\left(\mathbf{X}\_{i} = y\_{i}\right)\right) = \sum \left(y\_{i} \ln\left(\lambda\_{i}\right) - \left(y\_{i} + a^{-1}\right) \ln\left(\mathbf{1} + a\lambda\_{i}\right) + \sum\_{v=0}^{\eta\_{i}-1} \ln\left(\mathbf{1} + a v\right)\right), \tag{19}$$

$$\begin{split} l(\boldsymbol{\alpha})\_{\text{ZINB}} &= \ln \left( \prod\_{i}^{n} P\_{\text{ZINB}} \left( \mathbf{X}\_{i} = \boldsymbol{y}\_{i} \right) \right) \\ &= \begin{cases} \sum \ln \left( o \boldsymbol{\alpha}\_{i} \right) + (\mathbf{1} - o \boldsymbol{\imath}\_{i}) \Big( \frac{1}{1 + \boldsymbol{\imath} \boldsymbol{\imath}\_{i}} \Big)^{1/a}, \text{if} \boldsymbol{\jmath}\_{i} = \mathbf{0} \\\\ \sum \ln \left( o \boldsymbol{\imath}\_{i} \right) + \ln \Gamma \Big( \frac{1}{a} + \boldsymbol{\jmath}\_{i} \Big) - \ln \Gamma \Big( \mathbf{1} + \boldsymbol{\jmath}\_{i} \big) - \ln \Gamma \Big( \frac{1}{a} \Big) . \\ &+ \frac{1}{a} \ln \left( \frac{1}{1 + a \boldsymbol{\lambda}\_{i}} \right) + \boldsymbol{\jmath}\_{i} \ln \left( 1 - \frac{\mathbf{1}}{\mathbf{1} + a \boldsymbol{\lambda}\_{i}} \right), \text{if} \boldsymbol{\jmath}\_{i} > \mathbf{0} \end{cases} \end{split} \tag{20}$$

Poisson regression:

When applying the Poisson regression, the general form of *λ*10*Poi* is given by *e* P*<sup>m</sup> j*¼1 *β*0þ*βjxj* . Therefore, we need to transform Eq. (3) into the following expression:

$$
\lambda\_{10\text{psi}} = \begin{cases}
\mathbf{0}, \text{if } F\_{R\text{Ac}} = \mathbf{0}, \text{ } V = \mathbf{0} \text{ or } T = \mathbf{0} \\
\exp\left(\mathbf{K}\_1 + \mathbf{C}\_F \times F\_{R\text{Ac}} + \mathbf{C}\_{\text{CM}} \times \mathbf{CM} + \mathbf{C}\_{\text{Profile}} \times I\_{\text{Profile}} + \mathbf{C}\_{\text{Align}} \times I\_{\text{Align}} + \mathbf{0}\right) \\
\mathbf{C}\_{\text{Wid}} \times \mathbf{Wid} + \mathbf{C}\_{\text{Leng}} \times \text{Length} + \mathbf{C}\_{\text{RSL}} \times \mathbf{RSL} + \mathbf{C}\_{\text{Rg}} \times F\_{\text{Rg}}), \text{if } F\_{\text{RAc}} \neq \mathbf{0}, \\
V \neq \mathbf{0}, \text{and } T \neq \mathbf{0} \end{cases} \tag{21}
$$

The results estimated through the Poisson regression approach are shown in **Table 4**. According to these results, being similar to the NLS case, one can notice that the road profile is not significant (∣*t* � *statistic*∣ ¼ 0*:*621< <1*:*96). On the other hand, with an exponential form, the impact of road accident factor *FRAcc* is weakened, namely the impact of *FRAcc* with an exponential form is not significant when using Poisson regression approach (∣*t* � *statistic*∣ ¼ 1*:*913<1*:*96). Furthermore, according to


**Table 4.** *Regression results of λ*10*Poi*. the sensitive degrees of these parameters with the exponential form (rank indicated in brackets), once again the LX-accident-prone region factor is the most sensitive contributor among these parameters.

NB regression:

When applying the NB regression, the general form of *λ*10*NB* is given by *e* P*<sup>m</sup> j*¼1 *β*0þ*βjxj*þ*ε* , and it still requires to be expressed by Eq. (21). The dispersion parameter *α* is estimated at 3.2394 in our study through the iterative estimation algorithm automatically. The estimated results of the NB regression are shown in **Table 5**. According to the results associated with the NB regression approach, it is

worth noticing that the road profile is still not significant

(∣*t* � *statistic*∣ ¼ 0*:*850 < < 1*:*96). One can also notice that the impact of *FRAcc* with an exponential form is not significant as well, when using the NB regression approach (∣*t* � *statistic*∣ ¼ 1*:*793 <1*:*96). Moreover, according to the sensitive degrees of these parameters with the exponential form (rank indicated in brackets), the LX-accidentprone region factor is still the most sensitive contributor among these parameters.

ZIP regression:

When applying the ZIP regression, the general form of *λ*10*ZIP* is given by *e* P*<sup>m</sup> j*¼1 *β*0þ*βjxj* , and it still requires to be expressed by Eq. (21). The estimated results of the ZIP regression are shown in **Table 6** and (for nonzero observations) and **Table 7** (for zero-inflation observations).

According to the results associated with the ZIP regression approach, it is worth noticing that, as for the nonzero related model, *FRAcc*, *IProfile*, *IAlign*, and *Leng* are not significant (<1*:*96). Moreover, according to the sensitive degrees of other significant parameters with the exponential form (rank indicated in brackets), the LX-accidentprone region factor is still the most sensitive contributor among these parameters. While as for the zero-inflation model, only the *Wid*, *RSL*, and *FReg* are significant (>1*:*96).

ZINB regression:

When applying the ZINB regression, the general form of *λ*10*ZINB* is given by

*e* P*<sup>m</sup> j*¼1 *β*0þ*βjxj*þ*ε* , and it still requires to be expressed by Eq. (21). The values of dispersion parameter *α* for nonzero observations and zero-inflation observations are estimated at


#### **Table 5.**

*Regression results of λ*10*NB.*

*Accident Prediction Modeling Approaches for European Railway Level Crossing Safety DOI: http://dx.doi.org/10.5772/intechopen.109865*


#### **Table 6.**

*Count model regression results of λ*10*ZIP*.


#### **Table 7.**

*Zero-inflation model regression results of λ*10*ZIP*.

3.8102 and 1.4069, respectively, in our study through the iterative estimation algorithm automatically. The estimated results of the ZINB regression are shown in **Table 8** (for nonzero observations) and **Table 9** (for zero-inflation observations). According to the results associated with the ZINB regression approach, it is worth noticing that, as for the nonzero related model, *CM*, *IAlign*, and *Wid* are significant (>1*:*96). One can also notice that according to the sensitive degrees of the three parameters (rank indicated in brackets), the LX width is the most sensitive contributor among them. While as for the zero-inflation model, only the *FRAcc* and *CM* are significant (>1*:*96).
