**2. Method**

#### **2.1 Theoretical frame**

#### *2.1.1 Data*

The zone under study is comprised of two streets located in Lerma de Villada, Mexico: Av. Miguel Hidalgo and Av. Reolin Barejon. Data was obtained using the Google Maps Directions API. The time for a vehicle to traverse each segment was

#### *Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part… DOI: http://dx.doi.org/10.5772/intechopen.88280*

recorded every 15 min, after [6]. We found this time interval to be highly efficient for incorporating relevant data while ignoring redundant information. In this way, the average travel speed on each segment was measured. Three weeks (*w*1*, w*2, and *w*3) of data were considered: *w*<sup>1</sup> from Dec 27, 2016 to Jan 03, 2017; *w*<sup>2</sup> from Jan 03, 2017 to Jan 10, 2017; and *w*<sup>3</sup> from Jan 20, 2017 to Jan 27, 2017. The time interval to acquire data was from 6 a.m. to 11:59 p.m. (an interval of 18 h per day) and only in weekdays, i.e., between Monday and Friday.

#### *2.1.2 Clusters*

physical world; (2) the info generated by the taxis is sent to the cloud (cyber world) and mined, and then knowledge is acquired about the taxis' preferred directions and traffic patterns on the roads; (3) the knowledge in the cloud is sent to the users with the Internet; and (4) the recommendations for a specific user are improved using its driving behavior and preferred routes. In [24], a short-term traffic prediction model (combining fuzzy theory with Markov progress) is presented, which is part of a vehicular cyber-physical system; the prediction results are expressed in terms of traffic flow and speed. A proper discussion about the definition of a cyber-

From a cyber-physical system point of view, in the procedure presented in this work, the cyber part corresponds to the elements in charge to acquire and mine data for generating knowledge and the process to communicate that Intel to the users. The user (a biological entity) and intelligent devices (e.g., the user smartphone, the vehicle computer) reacting in response of the knowledge correspond to the physical

The aim of the present work is to introduce a method for analyzing speed data measured on streets where the traffic infrastructure is assumed to be the cause of low speeds. Then, we develop models and algorithms that, working with our data, allow to make predictions. The procedure presented in this work is summarized in

• Street segmentation is performed considering traffic control elements (speed

• With the Chi-Square distance (χ<sup>2</sup>Þ, the travel speed histograms of weekdays are

• Two techniques (polynomial and logistic regression) were used to develop the models that describe speed data. An algorithm for each modeling technique

This chapter is organized as follows: Section 1 Introduction; Section 2 Method, which includes theoretical frame (data, clusters, histograms, outliers) and procedure (street segmentation, clustering, comparative analysis of histograms, outlier detection, mathematical models, connecting Intel with users); Section 3 Results

The zone under study is comprised of two streets located in Lerma de Villada, Mexico: Av. Miguel Hidalgo and Av. Reolin Barejon. Data was obtained using the Google Maps Directions API. The time for a vehicle to traverse each segment was

physical system, and its relationship with transportation, is in [25].

• Clustering speed data, validated with the silhouette metric.

compared and also the histograms of segments.

• Communicate the generated knowledge to the users.

(with discussion); and Section 4 Conclusions (with future work).

• Mahalanobis distance is used to detect outliers.

was developed to predict travel speed.

part.

the following steps:

**2. Method**

*2.1.1 Data*

**108**

**2.1 Theoretical frame**

bumps and traffic lights).

*Sustainability in Urban Planning and Design*

The *k*-means technique [26] was selected (because it is easy to implement and is commonly used in distinct traffic problems [27–29]) to cluster the speed data of any of the 3 weeks; since these are close in time, it is expected a similar travel speed from 1 week to another, and then we select *w*1. In simple terms, the *k*-means technique consists in calculating the centroid of each cluster as the mean of the data in the corresponding cluster and is recalculated until convergence.

We apply the *k*-means technique selecting a number of clusters in the range 3–6; for each case we calculate the silhouette score [30], given in Eq. (1), where *a i*ð Þ is the average distance from *i* with the data in the same cluster, *b i*ð Þ is the minimum average distance from *i* with the data of each other's cluster, and *i* is the data index. The silhouette score is in the range � 1 to +1; a value close to 1 indicates that the speed data is well matched in the selected clusters, while a value close to �1 indicates the opposite situation:

$$\text{cs}(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \tag{1}$$

#### *2.1.3 Histograms*

Analyzing the speed frequency, by comparing speed histograms of certain locations (special selection) and certain time (temporal selection), we expected to find spatial and temporal relationship about the weekdays when the speed is similar (dissimilar) and the segments where the speed is similar (dissimilar).

The metric employed to compare a pair of histograms is the Chi-Square (χ2) histogram distance [31], given in Eq. (2), where P and Q are the histograms to be compared and *Pi* and *Qi* contain the speed frequency of the *i* bin (*i* is the bin index, the selected bin width is 1):

$$\chi^2(\mathbf{P}, \mathbf{Q}) = \frac{1}{2} \sum\_{i} \frac{\left(P\_i - Q\_i\right)^2}{\left(P\_i + Q\_i\right)} \tag{2}$$

This metric has the advantage of reducing the importance of the result when bins with large count are compared, as in many natural histograms, the difference of bins with high values is less important [31]. If the metric gets a 0 result, then there is no difference between the compared histograms; as the result value becomes larger, the difference in terms of the speed frequency also becomes higher.

#### *2.1.4 Outliers*

We filtered the speed data using the Mahalanobis distance (MD) [32] to detect outliers, i.e., atypical speed not belonging to normal driving behavior, since we are not interested in including this data for modeling. The MD is presented in Eq. (3),

where *xi* is a vector containing the time and speed, *x* is a vector with the means, and *C*�<sup>1</sup> *<sup>x</sup>* is the covariance matrix:

$$\text{MD}\_{i} = \sqrt{(\mathbf{x}\_{i} - \overline{\mathbf{x}})^{T} \mathbf{C}\_{\mathbf{x}}^{-1} (\mathbf{x}\_{i} - \overline{\mathbf{x}})} \tag{3}$$

contains a high percentage of data (at least 79.61%), which validates our clustering

**Segment Cluster 1 (%) Cluster 2 (%) Cluster 3 (%)** *s*<sup>0</sup> 0.55 9.9 89.53 *s*<sup>1</sup> 11.84 80.71 7.43 *s*<sup>2</sup> 88.98 11.01 0 *s*<sup>3</sup> 93.66 6.33 0 *s*<sup>4</sup> 14.04 85.95 0 *s*<sup>5</sup> 79.61 20.38 0 *s*<sup>6</sup> 83.47 16.52 0 *s*<sup>7</sup> 4.13 3.85 92.01

*Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part…*

First, we consider all segments as a single road, and then the histograms of the speed frequency (from 6 a.m. to 11:59 p.m.) happening on weekdays (in *w*1) are compared in pairs, with the Chi-Square metric presented in Eq. (2). The results are shown in **Table 3**, starting with the lowest χ<sup>2</sup> value, i.e., the similar histograms

Second, the speed data throughout weekdays, but individual segments, was used to conform the histograms of the speed frequency happening on each segment for 5 days (the weekdays of *w*1). These histograms were compared in pairs with the χ2. **Table 4** shows the results starting with the lowest χ2. We found that if the compared segments share similar traffic elements, the speed frequency also is similar,

**D2–D3 D4–D5 D3–D4 D2–D4 D1–D3 D1–D2 D1–D5 D3–D5 D2–D5 D1–D4** 7.77 10.758 10.936 11.139 14.097 15.168 16.347 16.827 17.609 20.653

*s***2**–*s***<sup>5</sup>** *s***3**–*s***<sup>6</sup>** *s***5**–*s***<sup>6</sup>** *s***1**–*s***<sup>4</sup>** *s***2**–*s***<sup>6</sup>** *s***0**–*s***<sup>7</sup>** *s***2**–*s***<sup>3</sup>** 20.37 34.17 39.01 45.57 46.05 73.75 88.591 *s***3**–*s***<sup>5</sup>** *s***4**–*s***<sup>5</sup>** *s***4**–*s***<sup>6</sup>** *s***1**–*s***<sup>5</sup>** *s***2**–*s***<sup>4</sup>** *s***3**–*s***<sup>4</sup>** *s***1**–*s***<sup>6</sup>** 98.59 167.65 185.37 198.5 212.47 220.04 226.01 *s***1**–*s***<sup>2</sup>** *s***0**–*s***<sup>1</sup>** *s***1**–*s***<sup>3</sup>** *s***1**–*s***<sup>7</sup>** *s***0**–*s***<sup>4</sup>** *s***4**–*s***<sup>7</sup>** *s***6**–*s***<sup>7</sup>** 238.39 269.15 271.36 303.87 321.89 337.16 337.43 *s***2**–*s***<sup>7</sup>** *s***5**–*s***<sup>7</sup>** *s***0**–*s***<sup>5</sup>** *s***0**–*s***<sup>6</sup>** *s***0**–*s***<sup>2</sup>** *s***3**–*s***<sup>7</sup>** *s***0**–*s***<sup>3</sup>** 339.58 340.24 342.78 344.59 346.80 350.76 356.21

among weekdays, with D1 = Monday, D2 = Tuesday, and so on.

results.

**Table 3.**

**Table 4.**

**111**

**Table 2.**

*2.2.3 Comparative analysis of histograms*

*DOI: http://dx.doi.org/10.5772/intechopen.88280*

*Percentage of speed data in a cluster.*

and therefore a low χ<sup>2</sup> is obtained.

*Chi-Square distance between histograms with weekdays' data.*

*Chi-Square distance between histograms with segments data.*

#### **2.2 Procedure**

#### *2.2.1 Street segmentation*

The avenues under study were divided into segments: each segment is denoted *sk*, with *k* as the segment index. On each segment, we have number of speed bumps *c*1, number of traffic lights *c*2, and landmarks *c*3. A segment's length *l* is set to approximately 500 m, and then on each segment there are specific traffic elements: *sk* ¼ *c*1*;c*2*;c*3*,* f g*l* , as shown in **Table 1**.

#### *2.2.2 Clustering*

The silhouette score, considering three clusters, is better evaluated, with *ss* ¼ 0*:*7360. For four, five, and six clusters, we calculated a *ss* ¼ 0*:*7331, *ss* ¼ 0*:*7194, and *ss* ¼ 0*:*7105, respectively. As we were interested in communicating in a simple way the speed category at which is possible to travel, three options (as slow, medium, and normal) seem adequate. A similar approach in Google Maps (traffic option), where the speed is represented considering four options, from fast to slow.

The resultant average speed (in km/h) range of each cluster (or category) is category 1 (5.4112–18.1455), category 2 (18.1455–23.4234), and category 3 (23.4234–36.0750). For *w*<sup>2</sup> and *w*3, values smaller than 5.4112 fall into category 1, and those larger than 36.0750 fall into category 3.

The percentage of a segment's speed data (from *w*1) in a cluster is shown in **Table 2**. It is interesting to note that for all segments, there is a specific cluster that


**Table 1.** *Segments' characteristics.* *Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part… DOI: http://dx.doi.org/10.5772/intechopen.88280*


#### **Table 2.**

where *xi* is a vector containing the time and speed, *x* is a vector with the means, and

ð Þ *xi* � *x*

The avenues under study were divided into segments: each segment is denoted *sk*, with *k* as the segment index. On each segment, we have number of speed bumps *c*1, number of traffic lights *c*2, and landmarks *c*3. A segment's length *l* is set to approximately 500 m, and then on each segment there are specific traffic elements:

The silhouette score, considering three clusters, is better evaluated, with *ss* ¼ 0*:*7360. For four, five, and six clusters, we calculated a *ss* ¼ 0*:*7331, *ss* ¼ 0*:*7194, and *ss* ¼ 0*:*7105, respectively. As we were interested in communicating in a simple way the speed category at which is possible to travel, three options (as slow, medium, and normal) seem adequate. A similar approach in Google Maps (traffic option), where the speed is represented considering four options, from fast to slow. The resultant average speed (in km/h) range of each cluster (or category) is

category 1 (5.4112–18.1455), category 2 (18.1455–23.4234), and category 3

*sk c***<sup>1</sup>** *c***<sup>2</sup>** *c***<sup>3</sup>** *l* **(m) GPS start**

*s*<sup>0</sup> 2 0 None 501 19.284512,

*s*<sup>4</sup> 3 0 Telecom company offices, shopping mail 499 19.286477,

*s*<sup>6</sup> 4 1 School, supermarket, hospital 500 19.284943,

*s*<sup>7</sup> 2 0 None 481 19.284500,

(23.4234–36.0750). For *w*<sup>2</sup> and *w*3, values smaller than 5.4112 fall into category 1,

The percentage of a segment's speed data (from *w*1) in a cluster is shown in **Table 2**. It is interesting to note that for all segments, there is a specific cluster that

**coordinate**

�99.500927

�99.505498

�99.514964

�99.519630

�99.514944

�99.510282

�99.505561

500 19.285725,

500 19.286330, �99.510221

501 19.286711,

500 19.285784,

**GPS end coordinate**

19.285725, �99.505498

19.286330, �99.510221

19.286711, �99.514964

19.286477, �99.519630

19.285784, �99.514944

19.284943, �99.510282

19.284500, �99.505561

19.284403, �99.500993

q

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

*TC*�<sup>1</sup>

*<sup>x</sup>* ð Þ *xi* � *x*

(3)

*MDi* ¼

*C*�<sup>1</sup>

**2.2 Procedure**

*2.2.2 Clustering*

*2.2.1 Street segmentation*

*sk* ¼ *c*1*;c*2*;c*3*,* f g*l* , as shown in **Table 1**.

and those larger than 36.0750 fall into category 3.

*s*<sup>1</sup> 2 0 School, museum, gas station, government

*s*<sup>2</sup> 0 1 Banks, center square, school, fast-food

*s*<sup>3</sup> 3 2 Cultural center, hospital, school offices,

*s*<sup>5</sup> 2 1 Hospital, government offices, cultural

**Table 1.**

**110**

*Segments' characteristics.*

offices

restaurants

kindergarten

forum

*<sup>x</sup>* is the covariance matrix:

*Sustainability in Urban Planning and Design*

*Percentage of speed data in a cluster.*

contains a high percentage of data (at least 79.61%), which validates our clustering results.

#### *2.2.3 Comparative analysis of histograms*

First, we consider all segments as a single road, and then the histograms of the speed frequency (from 6 a.m. to 11:59 p.m.) happening on weekdays (in *w*1) are compared in pairs, with the Chi-Square metric presented in Eq. (2). The results are shown in **Table 3**, starting with the lowest χ<sup>2</sup> value, i.e., the similar histograms among weekdays, with D1 = Monday, D2 = Tuesday, and so on.

Second, the speed data throughout weekdays, but individual segments, was used to conform the histograms of the speed frequency happening on each segment for 5 days (the weekdays of *w*1). These histograms were compared in pairs with the χ2. **Table 4** shows the results starting with the lowest χ2. We found that if the compared segments share similar traffic elements, the speed frequency also is similar, and therefore a low χ<sup>2</sup> is obtained.


**Table 3.**

*Chi-Square distance between histograms with weekdays' data.*


**Table 4.**

*Chi-Square distance between histograms with segments data.*

**Tables 3** and **4** show that comparing histograms with the speed frequency of individual days (and all segments) are evaluated with a lower χ<sup>2</sup> (the lower value is 7.77, the higher is 20.653) than the observed comparing histograms with the speed frequency of individual segments (and all days), where the lower value is 20.37 and the higher is 356.21. Then, it appears that the travel speed is weakly influenced by the day of the week, since the traffic control elements of the whole road, from day to day, are the same. However, it seems that the segment strongly influences the travel speed, since the traffic control elements, which characterize each segment,

*Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part…*

To corroborate the abovementioned statement, we use the speed frequency of *w*2. **Figure 3** shows the histograms of the speed frequency of each day (and all segments), where it can be observed the histograms'similarity. **Figure 4** shows the histograms of the speed frequency of each segment (and all days), where it can be

To put an example, the speed data of *s*<sup>0</sup> and *w*<sup>1</sup> is presented in **Figure 5**. We calculate the MD of this data (**Figure 5**), and then the probability density of the MD is presented in **Figure 6**, which has mean = 1.2331 and standard deviation SD = 0.6894. From **Figure 6**, a point with value MD > (2\*SD + mean) = 2.6119 corresponds to a red point in **Figure 5** and is considered an atypical point. The inequality value, i.e., (2\*SD + mean), was established through trial and error.

The speed data from *w*<sup>1</sup> and *w*2, for all segments, is filtered the same way as the example. The data used in the polynomial regression satisfy MD < = (2\*SD + mean)

modify the speed at which is possible to travel.

and in the logistic regression MD < = (3\*SD + mean).

observed the histograms' dissimilarities.

*DOI: http://dx.doi.org/10.5772/intechopen.88280*

*2.2.4 Outlier detection*

**Figure 3.**

**113**

*Seed frequency of days.*

**Figure 1.** *Dissimilar histograms (s*<sup>0</sup> *and s*3*).*

**Figure 2.** *Similar histograms (s*<sup>2</sup> *and s*5*).*

**Figure 1** shows the most dissimilar histograms, *s*<sup>0</sup> and *s*3. **Table 1** shows that *s*<sup>0</sup> has two speed bumps and no traffic lights, while *s*<sup>3</sup> has three speed bumps and two traffic lights; because the traffic lights on *s*3, we will expect a lower speed in this segment, and this conclusion can be corroborated by looking at **Figure 1**.

**Figure 2** shows the most similar histograms, *s*<sup>2</sup> and *s*5. Segments *s*<sup>2</sup> and *s*<sup>5</sup> share the same number of traffic lights; however, there are two speed bumps in *s*<sup>5</sup> and 0 in *s*2, then a slight superior speed is expected in *s*<sup>2</sup> (see **Figure 2**).

#### *Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part… DOI: http://dx.doi.org/10.5772/intechopen.88280*

**Tables 3** and **4** show that comparing histograms with the speed frequency of individual days (and all segments) are evaluated with a lower χ<sup>2</sup> (the lower value is 7.77, the higher is 20.653) than the observed comparing histograms with the speed frequency of individual segments (and all days), where the lower value is 20.37 and the higher is 356.21. Then, it appears that the travel speed is weakly influenced by the day of the week, since the traffic control elements of the whole road, from day to day, are the same. However, it seems that the segment strongly influences the travel speed, since the traffic control elements, which characterize each segment, modify the speed at which is possible to travel.

To corroborate the abovementioned statement, we use the speed frequency of *w*2. **Figure 3** shows the histograms of the speed frequency of each day (and all segments), where it can be observed the histograms'similarity. **Figure 4** shows the histograms of the speed frequency of each segment (and all days), where it can be observed the histograms' dissimilarities.

#### *2.2.4 Outlier detection*

To put an example, the speed data of *s*<sup>0</sup> and *w*<sup>1</sup> is presented in **Figure 5**. We calculate the MD of this data (**Figure 5**), and then the probability density of the MD is presented in **Figure 6**, which has mean = 1.2331 and standard deviation SD = 0.6894. From **Figure 6**, a point with value MD > (2\*SD + mean) = 2.6119 corresponds to a red point in **Figure 5** and is considered an atypical point. The inequality value, i.e., (2\*SD + mean), was established through trial and error.

The speed data from *w*<sup>1</sup> and *w*2, for all segments, is filtered the same way as the example. The data used in the polynomial regression satisfy MD < = (2\*SD + mean) and in the logistic regression MD < = (3\*SD + mean).

**Figure 3.** *Seed frequency of days.*

**Figure 1** shows the most dissimilar histograms, *s*<sup>0</sup> and *s*3. **Table 1** shows that *s*<sup>0</sup> has two speed bumps and no traffic lights, while *s*<sup>3</sup> has three speed bumps and two traffic lights; because the traffic lights on *s*3, we will expect a lower speed in this segment, and this conclusion can be corroborated by looking at **Figure 1**.

**Figure 2** shows the most similar histograms, *s*<sup>2</sup> and *s*5. Segments *s*<sup>2</sup> and *s*<sup>5</sup> share the same number of traffic lights; however, there are two speed bumps in *s*<sup>5</sup> and 0 in

*s*2, then a slight superior speed is expected in *s*<sup>2</sup> (see **Figure 2**).

**Figure 1.**

**Figure 2.**

**112**

*Similar histograms (s*<sup>2</sup> *and s*5*).*

*Dissimilar histograms (s*<sup>0</sup> *and s*3*).*

*Sustainability in Urban Planning and Design*

The following terminology is used to describe the model: the data size of all

*Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part…*

observed *i* speed is denoted by *y i*ð Þ, while time is *t i*ð Þ. The speed model of segment *k*

model is presented in Eq. (4), where coefficients *φ*1…*φ*<sup>6</sup> were calculated with speed

**Multinomial logistic**: The number of speed bumps and traffic lights (see **Table 1**) are used to explain the speed. With multinomial logistic regression [33],

The coefficients are denoted by *ψ*1…*ψ*6, *and q* ¼ f g 1*;* 2 refers again to the data

average speed (during weekdays) divided by the sum of the speed average of each

*<sup>a</sup>*ð Þ*<sup>i</sup> <sup>=</sup>* <sup>1</sup> <sup>þ</sup> *<sup>e</sup>*

*<sup>b</sup>*ð Þ*<sup>i</sup> <sup>=</sup>* <sup>1</sup> <sup>þ</sup> *<sup>e</sup>*

from *w*<sup>1</sup> and *w*2, respectively. The explanatory variables are *v*<sup>1</sup> = day weight, *v*<sup>2</sup> = number of speed bumps, *v*<sup>3</sup> =number of traffic lights, *v*<sup>4</sup> = segment weight, and *v*<sup>5</sup> = time. The weight of a specific day is calculated as the day average speed (of the speed measured from 6 a.m. to 11:59 p.m.) divided by the sum of the speed

average of each weekday. A segment's weight is calculated as the segment's

*Eq*

*Eq*

<sup>3</sup>ðÞ¼ *<sup>i</sup>* <sup>1</sup> � *Rq*

we obtained the logistic model presented in Eq. (5), with *ψ* = f g *a; b* :

*<sup>k</sup>*¼<sup>0</sup> *Nk*, with *Nk* referring the data size of the *<sup>k</sup>* segment. The

*<sup>k</sup>*ðÞ¼ *<sup>i</sup> <sup>φ</sup>*<sup>1</sup> <sup>þ</sup> *<sup>φ</sup>*2*t i*ðÞþ *<sup>φ</sup>*3*t i*ð Þ<sup>2</sup> <sup>þ</sup> *<sup>φ</sup>*4*t i*ð Þ<sup>3</sup> <sup>þ</sup> *<sup>φ</sup>*5*t i*ð Þ<sup>4</sup> <sup>þ</sup> *<sup>φ</sup>*6*t i*ð Þ<sup>5</sup> (4)

*<sup>ψ</sup>* ðÞ¼ *i ψ*<sup>1</sup> þ *ψ*2*v*1ðÞþ*i ψ*3*v*2ðÞþ*i ψ*4*v*3ðÞþ*i ψ*5*v*4ðÞþ*i ψ*6*v*5ð Þ*i* (5)

*<sup>a</sup>* calculates the relative risk of being in cluster 1 vs*.* cluster 3 (the

*Eq <sup>a</sup>*ð Þ*<sup>i</sup>* <sup>þ</sup> *<sup>e</sup> Eq*

*Eq <sup>a</sup>*ð Þ*<sup>i</sup>* <sup>þ</sup> *<sup>e</sup> Eq*

<sup>1</sup> ðÞþ*<sup>i</sup> <sup>R</sup><sup>q</sup>*

*<sup>b</sup>* calculates the same but for cluster 2 *vs.* cluster 3. The conversion

*<sup>j</sup>* is the probability belonging to the

*<sup>b</sup>*ð Þ*<sup>i</sup>* � � (6)

*<sup>b</sup>*ð Þ*<sup>i</sup>* � � (7)

<sup>2</sup>ð Þ*<sup>i</sup>* � � (8)

*<sup>k</sup>*ð Þ*i* , with *k* ¼ f g 0*;* 1*;* 2*;* 3*;* 4*;* 5*;* 6*;* 7 and *q* ¼ f g 1*;* 2 . The

segments is *<sup>N</sup>* <sup>¼</sup> <sup>P</sup>*<sup>k</sup>*¼<sup>7</sup>

**Figure 6.**

and week *q* is denoted by *Mq*

*Mahalanobis distance vs. probability density.*

*DOI: http://dx.doi.org/10.5772/intechopen.88280*

*Mq*

*Eq*

segment.

**115**

In Eq. (5), *Eq*

*j* category, with *j* ={1,2,3}:

reference), and *E<sup>q</sup>*

data of the corresponding week (*q*) and segment (*k*):

to probability is given in Eqs. (6)–(8), where *R<sup>q</sup>*

*Rq* <sup>1</sup>ðÞ¼ *i e*

*Rq* <sup>2</sup>ðÞ¼ *i e*

*Rq*

**Figure 4.** *Speed frequency of segments.*

**Figure 5.** *Time vs. speed: Data of s*<sup>0</sup> *and w*1*.*

#### *2.2.5 Mathematical models*

**Polynomial**: The data of each segment, with time as the independent variable and travel speed as the dependent variable, is modeled with a five-degree polynomial, enabling four-speed trend changes (the common requirement from the observations). The coefficients are calculated with the least-squares regression technique.

*Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part… DOI: http://dx.doi.org/10.5772/intechopen.88280*

**Figure 6.** *Mahalanobis distance vs. probability density.*

The following terminology is used to describe the model: the data size of all segments is *<sup>N</sup>* <sup>¼</sup> <sup>P</sup>*<sup>k</sup>*¼<sup>7</sup> *<sup>k</sup>*¼<sup>0</sup> *Nk*, with *Nk* referring the data size of the *<sup>k</sup>* segment. The observed *i* speed is denoted by *y i*ð Þ, while time is *t i*ð Þ. The speed model of segment *k* and week *q* is denoted by *Mq <sup>k</sup>*ð Þ*i* , with *k* ¼ f g 0*;* 1*;* 2*;* 3*;* 4*;* 5*;* 6*;* 7 and *q* ¼ f g 1*;* 2 . The model is presented in Eq. (4), where coefficients *φ*1…*φ*<sup>6</sup> were calculated with speed data of the corresponding week (*q*) and segment (*k*):

$$M\_k^q(\mathbf{i}) = \rho\_1 + \rho\_2 t(\mathbf{i}) + \rho\_3 t(\mathbf{i})^2 + \rho\_4 t(\mathbf{i})^3 + \rho\_5 t(\mathbf{i})^4 + \rho\_6 t(\mathbf{i})^5 \tag{4}$$

**Multinomial logistic**: The number of speed bumps and traffic lights (see **Table 1**) are used to explain the speed. With multinomial logistic regression [33], we obtained the logistic model presented in Eq. (5), with *ψ* = f g *a; b* :

$$E\_{\psi}^{q}(\dot{i}) = \psi\_1 + \psi\_2\upsilon\_1(\dot{i}) + \psi\_3\upsilon\_2(\dot{i}) + \psi\_4\upsilon\_3(\dot{i}) + \psi\_5\upsilon\_4(\dot{i}) + \psi\_6\upsilon\_5(\dot{i})\tag{5}$$

The coefficients are denoted by *ψ*1…*ψ*6, *and q* ¼ f g 1*;* 2 refers again to the data from *w*<sup>1</sup> and *w*2, respectively. The explanatory variables are *v*<sup>1</sup> = day weight, *v*<sup>2</sup> = number of speed bumps, *v*<sup>3</sup> =number of traffic lights, *v*<sup>4</sup> = segment weight, and *v*<sup>5</sup> = time. The weight of a specific day is calculated as the day average speed (of the speed measured from 6 a.m. to 11:59 p.m.) divided by the sum of the speed average of each weekday. A segment's weight is calculated as the segment's average speed (during weekdays) divided by the sum of the speed average of each segment.

In Eq. (5), *Eq <sup>a</sup>* calculates the relative risk of being in cluster 1 vs*.* cluster 3 (the reference), and *E<sup>q</sup> <sup>b</sup>* calculates the same but for cluster 2 *vs.* cluster 3. The conversion to probability is given in Eqs. (6)–(8), where *R<sup>q</sup> <sup>j</sup>* is the probability belonging to the *j* category, with *j* ={1,2,3}:

$$R\_1^{\mathfrak{q}}(i) = e^{E\_a^{\mathfrak{q}}(i)} / \left(\mathbf{1} + e^{E\_a^{\mathfrak{q}}(i)} + e^{E\_b^{\mathfrak{q}}(i)}\right) \tag{6}$$

$$R\_2^{\mathfrak{q}}(i) = \mathfrak{e}^{E\_b^{\mathfrak{q}}(i)} / \left(\mathbf{1} + \mathfrak{e}^{E\_a^{\mathfrak{q}}(i)} + \mathfrak{e}^{E\_b^{\mathfrak{q}}(i)}\right) \tag{7}$$

$$R\_3^q(i) = \mathbf{1} - \left(R\_1^q(i) + R\_2^q(i)\right) \tag{8}$$

*2.2.5 Mathematical models*

*Time vs. speed: Data of s*<sup>0</sup> *and w*1*.*

technique.

**114**

**Figure 5.**

**Figure 4.**

*Speed frequency of segments.*

*Sustainability in Urban Planning and Design*

**Polynomial**: The data of each segment, with time as the independent variable and travel speed as the dependent variable, is modeled with a five-degree polynomial, enabling four-speed trend changes (the common requirement from the observations). The coefficients are calculated with the least-squares regression

#### *2.2.6 Connecting Intel with users*

With the developed procedure, knowledge is acquired about the speed at which is expected to travel on the segments under the study. The architecture design (and the implementation) to connect the Intel with the users is out of the scope in this work (planned as future work); nevertheless we present in this section the basic idea.

MAE, and its standard deviation (SD), with the data of *w*<sup>1</sup> and *w*2, and the respec-

*Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part…*

An algorithm (Appendix A, Algorithm 1) is designed to predict the speed of *w*<sup>3</sup>

**Figure 7** shows, as example, the observed speed data (in black circles) of *w*<sup>3</sup> and

from *w*<sup>3</sup> before the current time. The error between the observed (from *w*3) and predicted (with Algorithm 1) travel speeds is calculated with Eq. (9). The MAE, SD, and hits (percentage of data categorized correctly) for *w*3, using Algorithm 1, are

in green dots), and the estimated speed with Algorithm 1 (in red plus signs).

Algorithm 2 (see Appendix B) is used to predict the speed category of the observed data from *w*3. *H*1ð Þ*i* and *H*2ð Þ*i* are two data sets obtained from *w*<sup>1</sup> and *w*2, respectively. These sets save the associated category of the average speed in a time interval from *t*(*i*)*-0.5* to *t*(*i*) *+ 0.5* (0.5 h = 30 min) and centered on *t*(*i*), of the day and segment under evaluation. *H*3ð Þ*i* is the category speed of *w*<sup>3</sup> (which is only available for previous data, i.e., prior to (*i*), with *i…N* being the data index. The

category is stored in *Sq*ðÞ¼ *i x*, where subindex *q =* {*1,2*} refers to the week. A *threshold* value, selected through trial and error, is used to discard the result in *Sq*ð Þ*i* if *Pq*ð Þ*i* < *threshold*. Algorithm 2 predicts the speed category for *w*3, which is stored in *S*3ð Þ*i* . Choosing *threshold = 0.9* gives 90.09% of correct evaluations. This percentage is the summation of cases, where *S3*(*i*) was categorized correctly divided by the total

Afterward, we attempted to predict the speed category of the observed speed in *w*<sup>3</sup> under the assumption that set *H*2ð Þ*i* is composed only with the average speed of each segment, and not including *H*1. The optimum result was found if *threshold =* �*0.85*, with 85.62% of correct predictions. If the *threshold* value is reduced, the positive prediction decreases (because the model fails to predict accurately with that *threshold* value). Similarly, if the *threshold* is increased, it becomes more

**Segment MAE (km/h) SD (km/h) Hits (%)** *s*<sup>0</sup> 0.7757 0.7142 92.4 *s*<sup>1</sup> 0.9353 1.0061 85.2 *s*<sup>2</sup> 0.8641 0.7537 89.5 *s*<sup>3</sup> 0.7749 0.7658 94.8 *s*<sup>4</sup> 1.0053 1.0903 85.2 *s*<sup>5</sup> 0.8051 0.7942 93.5 *s*<sup>6</sup> 0.6968 0.6639 96.5 *s*<sup>7</sup> 0.7994 0.9391 93.5

<sup>1</sup> ð Þ*<sup>i</sup> ; <sup>R</sup><sup>q</sup>*

<sup>2</sup>ð Þ*<sup>i</sup> ; <sup>R</sup><sup>q</sup>* <sup>3</sup>ð Þ*<sup>i</sup>* � � <sup>¼</sup> *Rq*

j j *y i*ðÞ� ^*y i*ð Þ (9)

*<sup>k</sup>*) and historical data, i.e., the data available

0, in blue dots) and *w*<sup>2</sup> (model *M*<sup>2</sup>

0,

*<sup>x</sup>*ð Þ*i* and the

*MAE* <sup>¼</sup> <sup>1</sup> *n* X*n i*¼1

*<sup>k</sup>* and *M*<sup>2</sup>

tive modeled equations:

shown in **Table 6**.

data N.

**Table 6.**

**117**

*Algorithm 1 prediction results: MAE, SD, and hits.*

using the modeled equations (*M*<sup>1</sup>

*DOI: http://dx.doi.org/10.5772/intechopen.88280*

segment *s*0, the modeled data with *w*<sup>1</sup> (model *M*<sup>1</sup>

probability most likely to occur is *Pq*ðÞ¼ *<sup>i</sup>* max *Rq*

**3.2 Multinomial logistic regression model and Algorithm 2**

The algorithms developed (in Appendix A and Appendix B) were programmed in a regular computer; according the procedure presented, the data acquired (from the zone under study) is modeled, and the models are used in the algorithms to generate knowledge. The link between this knowledge and the users could be established through a cell phone app (via the Internet). When a driver is in the proximity of a street segment, the cell phone (with GPS) detects the current location and acquires information for the driver, as the number of bumps and traffic lights, and also the expected travel speed calculated with the proposed algorithms; this info is presented to the driver in a proper way to not distract him, and then the driver can decide the more convenient route. A more challenging design is to communicate the cell phone with the vehicle (assuming that an intelligent system is part of it and can control some functions) and, for example, when the vehicle is approaching a speed bump, it automatically decelerates (if the driver is not reacting adequately).

The program running in a computer, in charged to acquire and mine data for generating knowledge and to establish communication with the responsive elements, conforms the "cyber" part of the system. The elements reacting with intelligence to the Intel delivered, as the driver, the cell phone, and the vehicle, conform the "physical" part of the system. Finally, the cyber and physical parts combined conform a cyber-physical system.

#### **3. Results**

#### **3.1 Polynomial regression model and Algorithm 1**

*M***<sup>1</sup>** *<sup>k</sup> M***<sup>2</sup>** *k* **Segment MAE (km/h) SD (km/h) MAE (km/h) SD (km/h)** *s*<sup>0</sup> 0.8269 0.6802 0.7895 0.6321 *s*<sup>1</sup> 1.0101 0.9630 1.1939 0.9916 *s*<sup>2</sup> 0.9523 0.7622 1.1754 1.0670 *s*<sup>3</sup> 0.6198 0.4628 0.6701 0.5917 *s*<sup>4</sup> 0.8435 0.6882 0.9765 0.7416 *s*<sup>5</sup> 0.8971 0.6826 0.9234 0.7408 *s*<sup>6</sup> 0.8438 0.7297 0.7737 0.6680 *s*<sup>7</sup> 0.9376 1.0762 0.7894 0.6653

The error between the modeled data, with Eq. (4), and the observed data, was calculated with the mean absolute error (MAE) (see Eq. (9)) [34]. Here, *n* ¼ *Nk*, *y i*ð Þ and ^*y i*ð Þ are the observed and modeled data, respectively. **Table 5** shows the

**Table 5.** *MAE and SD.* *Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part… DOI: http://dx.doi.org/10.5772/intechopen.88280*

MAE, and its standard deviation (SD), with the data of *w*<sup>1</sup> and *w*2, and the respective modeled equations:

$$\text{MAE} = \frac{1}{n} \sum\_{i=1}^{n} |\mathbf{y}(i) - \hat{\mathbf{y}}(i)| \tag{9}$$

An algorithm (Appendix A, Algorithm 1) is designed to predict the speed of *w*<sup>3</sup> using the modeled equations (*M*<sup>1</sup> *<sup>k</sup>* and *M*<sup>2</sup> *<sup>k</sup>*) and historical data, i.e., the data available from *w*<sup>3</sup> before the current time. The error between the observed (from *w*3) and predicted (with Algorithm 1) travel speeds is calculated with Eq. (9). The MAE, SD, and hits (percentage of data categorized correctly) for *w*3, using Algorithm 1, are shown in **Table 6**.

**Figure 7** shows, as example, the observed speed data (in black circles) of *w*<sup>3</sup> and segment *s*0, the modeled data with *w*<sup>1</sup> (model *M*<sup>1</sup> 0, in blue dots) and *w*<sup>2</sup> (model *M*<sup>2</sup> 0, in green dots), and the estimated speed with Algorithm 1 (in red plus signs).

#### **3.2 Multinomial logistic regression model and Algorithm 2**

Algorithm 2 (see Appendix B) is used to predict the speed category of the observed data from *w*3. *H*1ð Þ*i* and *H*2ð Þ*i* are two data sets obtained from *w*<sup>1</sup> and *w*2, respectively. These sets save the associated category of the average speed in a time interval from *t*(*i*)*-0.5* to *t*(*i*) *+ 0.5* (0.5 h = 30 min) and centered on *t*(*i*), of the day and segment under evaluation. *H*3ð Þ*i* is the category speed of *w*<sup>3</sup> (which is only available for previous data, i.e., prior to (*i*), with *i…N* being the data index. The probability most likely to occur is *Pq*ðÞ¼ *<sup>i</sup>* max *Rq* <sup>1</sup> ð Þ*<sup>i</sup> ; <sup>R</sup><sup>q</sup>* <sup>2</sup>ð Þ*<sup>i</sup> ; <sup>R</sup><sup>q</sup>* <sup>3</sup>ð Þ*<sup>i</sup>* � � <sup>¼</sup> *Rq <sup>x</sup>*ð Þ*i* and the category is stored in *Sq*ðÞ¼ *i x*, where subindex *q =* {*1,2*} refers to the week. A *threshold* value, selected through trial and error, is used to discard the result in *Sq*ð Þ*i* if *Pq*ð Þ*i* < *threshold*. Algorithm 2 predicts the speed category for *w*3, which is stored in *S*3ð Þ*i* . Choosing *threshold = 0.9* gives 90.09% of correct evaluations. This percentage is the summation of cases, where *S3*(*i*) was categorized correctly divided by the total data N.

Afterward, we attempted to predict the speed category of the observed speed in *w*<sup>3</sup> under the assumption that set *H*2ð Þ*i* is composed only with the average speed of each segment, and not including *H*1. The optimum result was found if *threshold =* �*0.85*, with 85.62% of correct predictions. If the *threshold* value is reduced, the positive prediction decreases (because the model fails to predict accurately with that *threshold* value). Similarly, if the *threshold* is increased, it becomes more


#### **Table 6.**

*Algorithm 1 prediction results: MAE, SD, and hits.*

*2.2.6 Connecting Intel with users*

*Sustainability in Urban Planning and Design*

basic idea.

adequately).

**3. Results**

**Table 5.** *MAE and SD.*

**116**

conform a cyber-physical system.

**3.1 Polynomial regression model and Algorithm 1**

*M***<sup>1</sup>**

With the developed procedure, knowledge is acquired about the speed at which is expected to travel on the segments under the study. The architecture design (and the implementation) to connect the Intel with the users is out of the scope in this work (planned as future work); nevertheless we present in this section the

The algorithms developed (in Appendix A and Appendix B) were programmed in a regular computer; according the procedure presented, the data acquired (from the zone under study) is modeled, and the models are used in the algorithms to generate knowledge. The link between this knowledge and the users could be established through a cell phone app (via the Internet). When a driver is in the proximity of a street segment, the cell phone (with GPS) detects the current location and acquires information for the driver, as the number of bumps and traffic lights, and also the expected travel speed calculated with the proposed algorithms; this info is presented to the driver in a proper way to not distract him, and then the driver can decide the more convenient route. A more challenging design is to communicate the cell phone with the vehicle (assuming that an intelligent system is part of it and can control some functions) and, for example, when the vehicle is approaching a speed bump, it automatically decelerates (if the driver is not reacting

The program running in a computer, in charged to acquire and mine data for generating knowledge and to establish communication with the responsive elements, conforms the "cyber" part of the system. The elements reacting with intelligence to the Intel delivered, as the driver, the cell phone, and the vehicle, conform the "physical" part of the system. Finally, the cyber and physical parts combined

The error between the modeled data, with Eq. (4), and the observed data, was calculated with the mean absolute error (MAE) (see Eq. (9)) [34]. Here, *n* ¼ *Nk*, *y i*ð Þ and ^*y i*ð Þ are the observed and modeled data, respectively. **Table 5** shows the

**Segment MAE (km/h) SD (km/h) MAE (km/h) SD (km/h)** *s*<sup>0</sup> 0.8269 0.6802 0.7895 0.6321 *s*<sup>1</sup> 1.0101 0.9630 1.1939 0.9916 *s*<sup>2</sup> 0.9523 0.7622 1.1754 1.0670 *s*<sup>3</sup> 0.6198 0.4628 0.6701 0.5917 *s*<sup>4</sup> 0.8435 0.6882 0.9765 0.7416 *s*<sup>5</sup> 0.8971 0.6826 0.9234 0.7408 *s*<sup>6</sup> 0.8438 0.7297 0.7737 0.6680 *s*<sup>7</sup> 0.9376 1.0762 0.7894 0.6653

*<sup>k</sup> M***<sup>2</sup>**

*k*

speed behavior is related to the traffic elements involved. It was observed that the speed histograms of two segments get a low Chi-Square distance if the segments share approximately the same number of speed bumps, traffic lights, and landmarks, independent of the day of the week. A high Chi-Square distance implies the opposite situation, i.e., segments with different number of traffic elements. The fourth step, *outlier detection*, removes atypical speed behavior, e.g., a vehicle circulating slower or faster than the usual. In step five, *mathematical models*, the models explain the speed. From steps 2 and 3, it is already known that on each segment, speed behaves according to the traffic elements involved, and hence the speed data of each segment is modeled independently with a polynomial model, with time as the independent variable. The multinomial logistic model uses as independent variables the number of speed bumps, traffic lights, the time, and two weights. The weights are calculated based on the average of the measured travel speeds considering segments and days. Finally, in step 6, *connecting Intel with users*, the drivers are properly informed about the travel speed expected on the surrounding segments,

*Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part…*

The procedure presented in this chapter proposes street segmentation; on each segment, there are traffic elements that we infer may be related with the observed speed frequency. By comparing speed histograms, we found that the speed frequency of all segments is similar among weekdays, and then the speed frequency of a specific segment is similar regarding the day. Considering the speed frequency of all weekdays, and individual segments, the segments with different traffic elements (speed bumps, traffic lights, and landmarks) yield dissimilar traveling speeds. From this observation, two techniques were considered for modeling speed: (1) polynomial regression, where the data of each segment is modeled independently, using time as the independent term, and (2) logistic regression, with several independent variables—number of speed bumps and traffic lights, time, and two weights (from

the observed speeds on street segments and weekdays). The models were implemented in algorithms, which use the modeled and historical data. With the polynomial model and Algorithm 1, it was possible to categorize correctly the travel speed in the range from 85.2 to 96.5%, depending on the segment. The multinomial logistic model and Algorithm 2 correctly predict the speed category in 90.09% of the evaluated cases. With these results, we conclude that the proposed procedure is suitable to prepare and model speed data and then to predict the speed category at a low computer processing cost. The procedure is useful to establish the relationship

We contemplate as future work the development of the architecture to communicate the expected travel speed (obtained with the proposed procedure) with the users, as well as convert this knowledge in suggestions and decision-making.

In Algorithm 1, if *i*≤*deep* (line 3), the modeled speed of *w*<sup>1</sup> and *w*<sup>2</sup> contributes

the same (each multiplied by 0.5). The case *i* ≥*deep* þ 1 (line 6) enables the

helping them to continuously adjust their route.

*DOI: http://dx.doi.org/10.5772/intechopen.88280*

between traffic infrastructure and travel speed.

**4. Conclusions**

**4.1 Future work**

**Appendix A**

**119**

**Figure 7.** *Time vs. speed: data of s*<sup>0</sup> *and w*3*.*

difficult to satisfy the condition *Pq*ð Þ*i* ≥*threshold*, and then the positive prediction also drops because now the set *H*2ð Þ*i* (with the limitation mentioned before) contributes more. **Table 7** shows the percentage of speed data categorized correctly with different threshold values.
