**1. Introduction**

Traffic conditions have a profound effect on population's quality life. The TomTom traffic index states that in 2017, Mexico City had a travel delay of 66% when compared with normal times of uncongested traffic, placing it as the first in the world rank. The wasted time per day was 59 min, or 227 h per year, with delays in the morning and evening peaks of about 100%. Of the 23 million private cars in Mexico, 72% correspond to metropolitan areas [1]. As a result, those areas are a suitable choice to analyze traffic behavior. In 2010, with a population of 20,116,842 and 0.3 cars per habitant (about 6,035,052 cars), the Mexico City Valley is the most crowded of the country. The number of operating vehicles in a city reduces the average traveling speed and increases pollution [2, 3] and the number of car accidents [4, 5]. The zone under study in this work is located between Mexico City and Toluca, a region that is part of the Mexico City megalopolis, which makes the area a suitable candidate for analyzing traffic conditions. In this research we developed a procedure to analyze speed tendencies (by comparing histograms) and prepare (set clusters and remove anomalies) and model speed data to be used in an application example: speed prediction. The procedure answers the following question: what is the pathway to generate new information when speed data is available?

speed-reducing devices, signalized pedestrian crossings, urban play streets, pedestrian streets, traffic-calming areas, traffic signals at intersections, bus lines and bus stops, parking control, and access control. Group 2 has no statistical effect on accidents: road markings, one-way streets, reversible lanes, traffic control for pedestrians and cyclists, priority control, and yield signs at intersections. Group 3 increases accidents: right turn on red, pedestrian crossing without signs, blinking traffic light, and increasing speed limits. According to [16], the presence of traffic control elements with the purpose to reduce speed or simplify the road users' tasks (e.g., traffic signs) tends to reduce accidents. An obvious consequence of the presence of speed-reducing devices (humps, rumble strips, narrow road width, bollards) is the increase of travel time [17] and the decrease of the average travel speed. One of the conclusions in [15] is that the traffic control elements that reduce

*Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part…*

Traffic elements such as signals and traffic lights are important in human driving decisions. The work presented in [18] intends to determine the relevance of the static road elements in driving situations using Markov logic networks (MLNs). The information considered to determine the relevance of speed limits and supplementary signs were the position in relation to lanes, vehicle type, date, time, and weather. Then, with first-order logic rules, the relevance of each was inferred. To determine the relevance of traffic lights, the following variables were considered: navigation system, environment perception, spatial relations, and the traffic light

The speed changes in the presence of speed bumps were analyzed in [19]. The speed limit on the streets under study is 50 km/h. The speed results measured at the bump location are as follows: about 30% of the cases show an 85th percentile speed higher than the posted limit speed, 26% lie in the range 45–50 m km/h, and the rest is under 45 km/h. The 85th percentile speed (measured after 20–25 m of the bumps' location, at the crosswalk area) tends to increase in 50% of the tested sites, similar result for the 50th percentile case (45%). Nevertheless, for both cases the speed change was not significant, according to the statistical analysis. Another result was obtained comparing the speed at bumps and 100 m away: in most sites, the 85th percentile speed decreases in the range of 1–18% (with respect to the zone without bumps). The statistical analysis concludes for both percentiles that speed values do

The use of cyber-physical system in traffic is a current topic in the literature. In [20], a simulated vehicular cyber-physical system (VCPS) is designed for delivering warnings to the driver and to avoid accidents. With this end, the predicted vehicle motion/location, the driver behavior and the road geometry were considered. Then, the short-term motion of the objective vehicle and the surrounding vehicles are predicted. With the objective vehicle location and the traveled distance among vehicles, the collision risk is estimated, and the driver is notified. In [21], a perceptual Control Architecture of Cyber-Physical Systems (CPSs) is proposed, taking as example a traffic incident management system. The intelligent behavior of this is characterized by the physical-reflex space and cyber-virtual space. In the physicalreflex space, the sensing actuation of the objective scenario is constructed on four levels of traffic infrastructure. In the cyber-virtual space, the decisions (through Bayesian reasoning network) are defined according to three levels: principles, interrelated factors, and situation assessment. In [22] the potential participation of smartphones (equipped with GPS) is discussed to build a traffic information system (to inform the entire transportation network) that is part of the cyber-physical infrastructure system. In [23] a cloud-based cyber-physical system is presented, with the end to find fast routes for the users. The system is presented in four steps: (1) the GPS on taxis are used as mobile sensors to measure the traffic status in the

accidents also reduce mobility.

*DOI: http://dx.doi.org/10.5772/intechopen.88280*

not change significantly.

state.

**107**

Recently, there has been a great effort in studying and analyzing traffic data from different world locations. Travel speed is one way to measure traffic conditions, as is travel time. In [6], the travel time distribution for different kinds of roads is estimated for Beijing. The time intervals to analyze data were set to 15 min, and it was concluded that the best-fitting distribution depends on the congestion level and that the average travel time of all road segments (for all days) can be estimated with acceptable precision using the normal distribution (compared with the log-normal, gamma and Weibull). In [7], travel time prediction is pursued. The variables considered were flow, concentration, and higher order auto-regression, concluding that local linear regression is preferable than global modeling. Characterization of the daily temporal variation of congestion is presented in [8], where a fitted model and live data are combined in a ten-parameter exponential smoothing equation. With the purpose of analyzing historical traffic data, a query processing method with timeline information is proposed in [9], along with an analysis of the congestion dependency along roads. The work presented in [10] estimates the average link speed with vehicles equipped with GPS, and therefore the quantity of equipped vehicles required for estimating the speed was established.

Using traffic data to make predictions is a current challenge, as Google maps traffic and Waze are doing. The purpose in [11] is to use information from Bing Maps to analyze, visualize, and predict traffic jams in Chicago. In addition, a prediction model to correct flow intensities with logistic regression was proposed, where the independent variables were day, hour, street number, and number of pixels (red, yellow, and green). In this work, a tool was developed to extract the roads' traffic intensity from a GIS map service, where colors represent flow intensity: red as congested, green not congested, and yellow in between. In [12], the properties of a community-driven mapping service (Waze) are characterized. Additionally, the authors discuss the use of traffic data to identify traffic accidents and potholes. In [13], a four-phase traffic approach is proposed: (1) data collection and representation, (2) traffic prediction, (3) vehicle selection for re-routing, and (4) alternative route assignment. In our work, we focus our contribution in the first two phases.

The traffic infrastructure elements (such as traffic lights, speed bumps, potholes) involved in driving situations influence driver's behavior, which in turn affects speed and number of accidents. The intention in [14] is the development of statistical models to predict accidents. These models correlate highway characteristics with traffic accidents. The variables considered were classified in groups: section identifiers, cross section related, location, traffic related (e.g., the percentage of trucks on a highway section), alignment, horizontal curvature, and accidents. The regression methods used were Poisson and negative binomial. The statistically significant variables were number of lanes, horizontal curvature, speed limit, tangent length, section length, average annual daily traffic, and peak hour. In addition, accidents are predicted with equations that consider roadway elements such as average daily traffic, commercial and residential units, intersections, speed limits, lane width, and number of lanes.

The work presented in [15] classifies traffic control elements (infrastructure) into three groups according to their effect on accidents. In Group 1 are those elements that reduce the number of accidents, such as speed limit signs,

#### *Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part… DOI: http://dx.doi.org/10.5772/intechopen.88280*

speed-reducing devices, signalized pedestrian crossings, urban play streets, pedestrian streets, traffic-calming areas, traffic signals at intersections, bus lines and bus stops, parking control, and access control. Group 2 has no statistical effect on accidents: road markings, one-way streets, reversible lanes, traffic control for pedestrians and cyclists, priority control, and yield signs at intersections. Group 3 increases accidents: right turn on red, pedestrian crossing without signs, blinking traffic light, and increasing speed limits. According to [16], the presence of traffic control elements with the purpose to reduce speed or simplify the road users' tasks (e.g., traffic signs) tends to reduce accidents. An obvious consequence of the presence of speed-reducing devices (humps, rumble strips, narrow road width, bollards) is the increase of travel time [17] and the decrease of the average travel speed. One of the conclusions in [15] is that the traffic control elements that reduce accidents also reduce mobility.

Traffic elements such as signals and traffic lights are important in human driving decisions. The work presented in [18] intends to determine the relevance of the static road elements in driving situations using Markov logic networks (MLNs). The information considered to determine the relevance of speed limits and supplementary signs were the position in relation to lanes, vehicle type, date, time, and weather. Then, with first-order logic rules, the relevance of each was inferred. To determine the relevance of traffic lights, the following variables were considered: navigation system, environment perception, spatial relations, and the traffic light state.

The speed changes in the presence of speed bumps were analyzed in [19]. The speed limit on the streets under study is 50 km/h. The speed results measured at the bump location are as follows: about 30% of the cases show an 85th percentile speed higher than the posted limit speed, 26% lie in the range 45–50 m km/h, and the rest is under 45 km/h. The 85th percentile speed (measured after 20–25 m of the bumps' location, at the crosswalk area) tends to increase in 50% of the tested sites, similar result for the 50th percentile case (45%). Nevertheless, for both cases the speed change was not significant, according to the statistical analysis. Another result was obtained comparing the speed at bumps and 100 m away: in most sites, the 85th percentile speed decreases in the range of 1–18% (with respect to the zone without bumps). The statistical analysis concludes for both percentiles that speed values do not change significantly.

The use of cyber-physical system in traffic is a current topic in the literature. In [20], a simulated vehicular cyber-physical system (VCPS) is designed for delivering warnings to the driver and to avoid accidents. With this end, the predicted vehicle motion/location, the driver behavior and the road geometry were considered. Then, the short-term motion of the objective vehicle and the surrounding vehicles are predicted. With the objective vehicle location and the traveled distance among vehicles, the collision risk is estimated, and the driver is notified. In [21], a perceptual Control Architecture of Cyber-Physical Systems (CPSs) is proposed, taking as example a traffic incident management system. The intelligent behavior of this is characterized by the physical-reflex space and cyber-virtual space. In the physicalreflex space, the sensing actuation of the objective scenario is constructed on four levels of traffic infrastructure. In the cyber-virtual space, the decisions (through Bayesian reasoning network) are defined according to three levels: principles, interrelated factors, and situation assessment. In [22] the potential participation of smartphones (equipped with GPS) is discussed to build a traffic information system (to inform the entire transportation network) that is part of the cyber-physical infrastructure system. In [23] a cloud-based cyber-physical system is presented, with the end to find fast routes for the users. The system is presented in four steps: (1) the GPS on taxis are used as mobile sensors to measure the traffic status in the

Toluca, a region that is part of the Mexico City megalopolis, which makes the area a suitable candidate for analyzing traffic conditions. In this research we developed a procedure to analyze speed tendencies (by comparing histograms) and prepare (set clusters and remove anomalies) and model speed data to be used in an application example: speed prediction. The procedure answers the following question: what is the pathway to generate new information when speed data is available? Recently, there has been a great effort in studying and analyzing traffic data from different world locations. Travel speed is one way to measure traffic conditions, as is travel time. In [6], the travel time distribution for different kinds of roads is estimated for Beijing. The time intervals to analyze data were set to 15 min, and it was concluded that the best-fitting distribution depends on the congestion level and that the average travel time of all road segments (for all days) can be estimated with acceptable precision using the normal distribution (compared with the log-normal, gamma and Weibull). In [7], travel time prediction is pursued. The variables considered were flow, concentration, and higher order auto-regression, concluding that local linear regression is preferable than global modeling. Characterization of the daily temporal variation of congestion is presented in [8], where a fitted model and live data are combined in a ten-parameter exponential smoothing equation. With the purpose of analyzing historical traffic data, a query processing method with timeline information is proposed in [9], along with an analysis of the congestion dependency along roads. The work presented in [10] estimates the average link speed with vehicles equipped with GPS, and therefore the quantity of

*Sustainability in Urban Planning and Design*

equipped vehicles required for estimating the speed was established.

two phases.

**106**

lane width, and number of lanes.

Using traffic data to make predictions is a current challenge, as Google maps traffic and Waze are doing. The purpose in [11] is to use information from Bing Maps to analyze, visualize, and predict traffic jams in Chicago. In addition, a prediction model to correct flow intensities with logistic regression was proposed, where the independent variables were day, hour, street number, and number of pixels (red, yellow, and green). In this work, a tool was developed to extract the roads' traffic intensity from a GIS map service, where colors represent flow intensity: red as congested, green not congested, and yellow in between. In [12], the properties of a community-driven mapping service (Waze) are characterized. Additionally, the authors discuss the use of traffic data to identify traffic accidents and potholes. In [13], a four-phase traffic approach is proposed: (1) data collection and representation, (2) traffic prediction, (3) vehicle selection for re-routing, and (4) alternative route assignment. In our work, we focus our contribution in the first

The traffic infrastructure elements (such as traffic lights, speed bumps, potholes) involved in driving situations influence driver's behavior, which in turn affects speed and number of accidents. The intention in [14] is the development of statistical models to predict accidents. These models correlate highway characteristics with traffic accidents. The variables considered were classified in groups: section identifiers, cross section related, location, traffic related (e.g., the percentage of trucks on a highway section), alignment, horizontal curvature, and accidents. The regression methods used were Poisson and negative binomial. The statistically significant variables were number of lanes, horizontal curvature, speed limit, tangent length, section length, average annual daily traffic, and peak hour. In addition, accidents are predicted with equations that consider roadway elements such as average daily traffic, commercial and residential units, intersections, speed limits,

The work presented in [15] classifies traffic control elements (infrastructure) into three groups according to their effect on accidents. In Group 1 are those elements that reduce the number of accidents, such as speed limit signs,

physical world; (2) the info generated by the taxis is sent to the cloud (cyber world) and mined, and then knowledge is acquired about the taxis' preferred directions and traffic patterns on the roads; (3) the knowledge in the cloud is sent to the users with the Internet; and (4) the recommendations for a specific user are improved using its driving behavior and preferred routes. In [24], a short-term traffic prediction model (combining fuzzy theory with Markov progress) is presented, which is part of a vehicular cyber-physical system; the prediction results are expressed in terms of traffic flow and speed. A proper discussion about the definition of a cyberphysical system, and its relationship with transportation, is in [25].

recorded every 15 min, after [6]. We found this time interval to be highly efficient for incorporating relevant data while ignoring redundant information. In this way, the average travel speed on each segment was measured. Three weeks (*w*1*, w*2, and *w*3) of data were considered: *w*<sup>1</sup> from Dec 27, 2016 to Jan 03, 2017; *w*<sup>2</sup> from Jan 03, 2017 to Jan 10, 2017; and *w*<sup>3</sup> from Jan 20, 2017 to Jan 27, 2017. The time interval to acquire data was from 6 a.m. to 11:59 p.m. (an interval of 18 h per day) and

*Procedure to Prepare and Model Speed Data Considering the Traffic Infrastructure, as Part…*

The *k*-means technique [26] was selected (because it is easy to implement and is commonly used in distinct traffic problems [27–29]) to cluster the speed data of any of the 3 weeks; since these are close in time, it is expected a similar travel speed from 1 week to another, and then we select *w*1. In simple terms, the *k*-means technique consists in calculating the centroid of each cluster as the mean of the data

We apply the *k*-means technique selecting a number of clusters in the range 3–6; for each case we calculate the silhouette score [30], given in Eq. (1), where *a i*ð Þ is the average distance from *i* with the data in the same cluster, *b i*ð Þ is the minimum average distance from *i* with the data of each other's cluster, and *i* is the data index. The silhouette score is in the range � 1 to +1; a value close to 1 indicates that the speed data is well matched in the selected clusters, while a value close to �1

*ss i*ðÞ¼ *b i*ðÞ� *a i*ð Þ

Analyzing the speed frequency, by comparing speed histograms of certain locations (special selection) and certain time (temporal selection), we expected to find spatial and temporal relationship about the weekdays when the speed is similar

The metric employed to compare a pair of histograms is the Chi-Square (χ2) histogram distance [31], given in Eq. (2), where P and Q are the histograms to be compared and *Pi* and *Qi* contain the speed frequency of the *i* bin (*i* is the bin index,

> 2 X *i*

This metric has the advantage of reducing the importance of the result when bins with large count are compared, as in many natural histograms, the difference of bins with high values is less important [31]. If the metric gets a 0 result, then there is no difference between the compared histograms; as the result value

becomes larger, the difference in terms of the speed frequency also becomes higher.

We filtered the speed data using the Mahalanobis distance (MD) [32] to detect outliers, i.e., atypical speed not belonging to normal driving behavior, since we are not interested in including this data for modeling. The MD is presented in Eq. (3),

*Pi* � *Qi* ð Þ<sup>2</sup>

(dissimilar) and the segments where the speed is similar (dissimilar).

ð Þ¼ <sup>P</sup>*;* <sup>Q</sup> <sup>1</sup>

χ2

maxf g *a i*ð Þ*; b i*ð Þ (1)

*Pi* <sup>þ</sup> *Qi* ð Þ (2)

only in weekdays, i.e., between Monday and Friday.

*DOI: http://dx.doi.org/10.5772/intechopen.88280*

in the corresponding cluster and is recalculated until convergence.

*2.1.2 Clusters*

indicates the opposite situation:

the selected bin width is 1):

*2.1.3 Histograms*

*2.1.4 Outliers*

**109**

From a cyber-physical system point of view, in the procedure presented in this work, the cyber part corresponds to the elements in charge to acquire and mine data for generating knowledge and the process to communicate that Intel to the users. The user (a biological entity) and intelligent devices (e.g., the user smartphone, the vehicle computer) reacting in response of the knowledge correspond to the physical part.

The aim of the present work is to introduce a method for analyzing speed data measured on streets where the traffic infrastructure is assumed to be the cause of low speeds. Then, we develop models and algorithms that, working with our data, allow to make predictions. The procedure presented in this work is summarized in the following steps:


This chapter is organized as follows: Section 1 Introduction; Section 2 Method, which includes theoretical frame (data, clusters, histograms, outliers) and procedure (street segmentation, clustering, comparative analysis of histograms, outlier detection, mathematical models, connecting Intel with users); Section 3 Results (with discussion); and Section 4 Conclusions (with future work).
