**Demand Allocation in Water Distribution Network Modelling – A GIS-Based Approach Using Voronoi Diagrams with Constraints**

Nicolai Guth and Philipp Klingel

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/50014

## **1. Introduction**

282 Application of Geographic Information Systems

*Malaysiana*, 2, No.1, pp. 53-64

Malaysia, pp. 264-268

*Journal,* 15-16, pp. 105-117

University, Taxes

pp. 1-104

Approaches", *ISIS Malaysia*, Malaysia

areas with trees, *Energy and Building*, 31, pp. 221-235

Countries on Urban Growth and Urbanization, Beijing, China

Design, International Islamic University Malaysia

Thesis, University of Minnesota, Minnesota

Thesis, University of Puerto Rica

Sham, S. (1973a). Observations on the city's form and functions on temperature patterns: a

Sham, S. (1973b). The urban heat island: its concept and application to Kuala Lumpur, *Sains* 

Sham, S. (1980a). Effects of urbanization on climate with special reference to Kuala Lumpur-Petaling Jaya area, Malaysia, In: *Urbanization and the atmospheric environment in the low tropics: Experiences from the Klang Valley Region, Malaysia*, Penerbit Universit Kebangsaan

Sham, S. (1980b). *The climate of Kuala Lumpur, Petaling Jaya area, Malaysia: A study of the impact of urbanization on local within the humid tropic*, Universiti Kebangsaan Malaysia Sham, S. (1984a). Inadvertent atmospheric modifications through urbanization in the Kuala Lumpur area, Malaysia, In: *Urbanization and Ecodevelopment with Special Reference to* 

Sham, S. (1984b). Urban development and changing patterns of night-time temperatures in the Kuala Lumpur – Petaling Jaya area, Malaysia, *Journal Teknologi*, 5, pp. 27-36 Sham, S. (1985). Post-Merdeka development and the environment of Malaysia, In: *Urbanization and the atmospheric environment in the low tropics: Experiences from the Klang* 

Sham, S. (1986). *Temperatures in Kuala Lumpur and the merging Kelang Valley conurbation*, The

Sham, S. (1987). *Urbanization and the atmospheric environment in the low tropics: Experiences* 

Sham, S. (1990/1991). Urban climatology in Malaysia: An overview, *Energy and Buildings* 

Sham, S. (1993). Environment and Development in Malaysia: Changing Concerns and

Shashua-Bar, L. & Hoffman, M. E. (2000). Vegetation as a climatic component in the design of an urban street: An empirical model for predicting the cooling effect of urban green

Smoyer, K. E. (1997). *Environmental risk factors in heat wave mortality in St. Louis*, Ph. D.

Streutker, D. R. (2003). *A Study of the Urban Heat Island of Houston, Texas,* Ph. D. Thesis, , Rice

Tamagno, B. et al. (1990). *Changing environments,* Cambridge University Press, London, UK,

Thong, L. B. (1990). *Urbanization Strategies in Malaysia and the Development of Intermediate Urbanization Centers,* Proceedings of the IGU Regional Conference of Asian Pacific

Valazquez-Lozada, A. (2002). *Urban heat island effect analysis for San Juan, Puerto Rico*, M. Sc.

Wan, M. N. & Abdul Malek, M. (2004). *Utilizing satellite remote sensing and GIS technologies for analyzing Kuala Lumpur's urban heat island*, Faculty of Architecture and Environmental

*Kuala Lumpur,* Universiti Malaya, Institute of Advanced Studies, Malaysia

*Valley Region, Malaysia*, Penerbit Universit Kebangsaan Malaysia, pp. 41-75

*from the Klang Valley Region,* Penerbit Universit Kebangsaan Malaysia

Institute of Advanced Studies, Universiti Malaya, Malaysia

case study of Kuala Lumpur, *Tropical Geography*, 36, No. 2, pp. 60-65

In water distribution network modelling the topology of the network is commonly mapped as a (directed) graph consisting of nodes and links (Walski et al., 2001). Pipe sections with constant parameters are modelled as links. Intersections, water inlets, points of water withdrawal and locations of pipe parameter changes are modelled as nodes. Coordinates related to the nodes define the spatial location. To calculate the hydraulic system state (pressures and flows) the water demand of the consumers is assigned to the relevant nodes as the driving parameter (demand driven simulation). Figure 1 schematically shows the abstraction of a water distribution network.

© 2012 Guth and Klingel, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Guth and Klingel, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

To reduce the model size typically not every known demand (e.g. demand at a house connection) is modelled as a node. Besides, location and demand of consumers are often not known, especially in developing countries.

Demand Allocation in Water Distribution Network Modelling – A GIS-Based Approach Using Voronoi Diagrams with Constraints 285

The calculation of common Voronoi diagrams is based on a set of nodes and the Euclidean distance between the nodes and the points of the areas. Various well known and often used algorithms to generate such Voronoi diagrams exist, for example the sweep-algorithm (Fortune, 1986; Preparata and Shamos, 1985) and the divide-and-conquer-approach

The approach to allocate demands using such common Voronoi diagrams follows the assumption that a consumer is connected to the closest existing pipe and to the closest existing node respectively. Boundaries and obstacles are not considered as constraints. In this context, areas with different population densities, for example, constitute boundaries and railway trails and lakes within the supply area, for example, constitute obstacles which cannot be passed by distribution pipes. Thus, with common Voronoi diagrams consumers, areas or building areas are possibly allocated to nodes which are closest according to the Euclidean distance but are not closest if boundaries or obstacles are considered. This constitutes a significant limitation of the application of Voronoi diagrams based on the

If boundaries are convex, no obstacles are located in between two originating nodes of a bisector. The Voronoi diagram can still be calculated based on the Euclidean distance. The convex boundaries can be considered easily by intersecting the Voronoi diagram with the convex boundary. Conversely, sections of non-convex boundaries or obstacles might be located in between two originating nodes of a bi-sector. In these cases the determination of the bi-sectors based on the Euclidean distance is not sufficient any more. The shortest way to

Okabe et al. (2000) describe this problem as "Voronoi diagrams with obstacles" and introduce the "shortest path Voronoi diagram". Thereby, obstacles are supposed to be considered by calculating Voronoi diagrams based on the shortest path instead of the Euclidean distance. A satisfactory solution of this approach would result in an increased

Figure 3 illustrates the difference between a Voronoi diagram, which is determined based on the Euclidean distance (left sketch) and based on the shortest path (right sketch). Not considering (convex sections of) the boundaries, for example, results in an allocation of areas to node n1 which can only be part of the Voronoi regions of the nodes n4 and n5 if the boundaries given in the example are respected. Not considering the shortest path to bypass non-convex sections of the boundaries leads, for example, to an allocation of an area to node n2 instead of node n1. Accordingly, an area is allocated to node n3 instead of node n2 if the shortest path to bypass the obstacle given in the example is ignored and the Voronoi

Several authors have tackled this problem. The approach of Aronov (1987) considers nonconvex boundaries but not obstacles within the Voronoi regions. The approach of Papadopoulou and Lee (1998) does consider obstacles as constraints but only if specific conditions are given. None of the known approaches solves the problem satisfactorily for an

bypass the non-convex section or obstacle respectively has to be considered instead.

model accuracy and therefore higher reliability of the simulation calculations.

Euclidean distance for the allocation of water demand.

diagram is determined based on the Euclidean distance.

(Shamos, 1975).

To allocate demands to the nodes Voronoi diagrams might be used. The Voronoi diagram is also known as Thiessen polygon and dual to the Delaunay triangulation (c.f. fig. 2, left sketch). The Voronoi diagram is based on a set of nodes and defines one area for each node with every point within the area being closer to the originating node of the area than to any other node. In the following these areas are referred to as Voronoi regions. They represent the catchment areas of the nodes. The borders of the Voronoi regions constitute of lines called bi-sectors. Thus, known demands (value and location) can be allocated to the closest nodes according to the Euclidean distance. This approach is also commonly applied to determine, for example, the rainfall catchment areas of inlets of sewer networks.

If demands and their location are not known, they can be determined based on the size of the Voronoi regions, the population distribution and density and a consumption per capita value. Alternatively, building areas can be allocated to the nearest nodes based on the Voronoi diagrams if the correspondent data is available. In this case the actual demand is calculated by using a coefficient relating the population to a building area unit and a consumption per capita value. The sketch on the right in figure 2 schematically shows the allocation of demands (in this case building areas) to nodes using a Voronoi diagram.

**Figure 2.** Illustration of a Voronoi diagram and a Delaunay triangulation and demand allocation using a Voroinoi diagram

Pipe network data, population distribution and building data are commonly stored and managed in geographic information systems (GIS). Thus, a GIS-application to allocate water demand values to the network model nodes is self-evident and commonly applied.

The calculation of common Voronoi diagrams is based on a set of nodes and the Euclidean distance between the nodes and the points of the areas. Various well known and often used algorithms to generate such Voronoi diagrams exist, for example the sweep-algorithm (Fortune, 1986; Preparata and Shamos, 1985) and the divide-and-conquer-approach (Shamos, 1975).

284 Application of Geographic Information Systems

a Voroinoi diagram

Node Distance

n2

Voronoi diagram Delaunay triangulation

d1

n1

n3

d1

known, especially in developing countries.

To reduce the model size typically not every known demand (e.g. demand at a house connection) is modelled as a node. Besides, location and demand of consumers are often not

To allocate demands to the nodes Voronoi diagrams might be used. The Voronoi diagram is also known as Thiessen polygon and dual to the Delaunay triangulation (c.f. fig. 2, left sketch). The Voronoi diagram is based on a set of nodes and defines one area for each node with every point within the area being closer to the originating node of the area than to any other node. In the following these areas are referred to as Voronoi regions. They represent the catchment areas of the nodes. The borders of the Voronoi regions constitute of lines called bi-sectors. Thus, known demands (value and location) can be allocated to the closest nodes according to the Euclidean distance. This approach is also commonly applied to

If demands and their location are not known, they can be determined based on the size of the Voronoi regions, the population distribution and density and a consumption per capita value. Alternatively, building areas can be allocated to the nearest nodes based on the Voronoi diagrams if the correspondent data is available. In this case the actual demand is calculated by using a coefficient relating the population to a building area unit and a consumption per capita value. The sketch on the right in figure 2 schematically shows the

determine, for example, the rainfall catchment areas of inlets of sewer networks.

allocation of demands (in this case building areas) to nodes using a Voronoi diagram.

n4

n5

Voronoi diagram and Delaunay triangulation Demand allocation using Voronoi diagram

**Figure 2.** Illustration of a Voronoi diagram and a Delaunay triangulation and demand allocation using

Voronoi diagram Link Node

2

n1

1 1

2

1 1

Building area/consumer

n2 n3

n4

5 5 5

5 5 4

5

3 3 3

4 4 <sup>4</sup> <sup>4</sup>

1 1

1

1 1

n5

5

5 5 5

5 5

Pipe network data, population distribution and building data are commonly stored and managed in geographic information systems (GIS). Thus, a GIS-application to allocate water

demand values to the network model nodes is self-evident and commonly applied.

The approach to allocate demands using such common Voronoi diagrams follows the assumption that a consumer is connected to the closest existing pipe and to the closest existing node respectively. Boundaries and obstacles are not considered as constraints. In this context, areas with different population densities, for example, constitute boundaries and railway trails and lakes within the supply area, for example, constitute obstacles which cannot be passed by distribution pipes. Thus, with common Voronoi diagrams consumers, areas or building areas are possibly allocated to nodes which are closest according to the Euclidean distance but are not closest if boundaries or obstacles are considered. This constitutes a significant limitation of the application of Voronoi diagrams based on the Euclidean distance for the allocation of water demand.

If boundaries are convex, no obstacles are located in between two originating nodes of a bisector. The Voronoi diagram can still be calculated based on the Euclidean distance. The convex boundaries can be considered easily by intersecting the Voronoi diagram with the convex boundary. Conversely, sections of non-convex boundaries or obstacles might be located in between two originating nodes of a bi-sector. In these cases the determination of the bi-sectors based on the Euclidean distance is not sufficient any more. The shortest way to bypass the non-convex section or obstacle respectively has to be considered instead.

Okabe et al. (2000) describe this problem as "Voronoi diagrams with obstacles" and introduce the "shortest path Voronoi diagram". Thereby, obstacles are supposed to be considered by calculating Voronoi diagrams based on the shortest path instead of the Euclidean distance. A satisfactory solution of this approach would result in an increased model accuracy and therefore higher reliability of the simulation calculations.

Figure 3 illustrates the difference between a Voronoi diagram, which is determined based on the Euclidean distance (left sketch) and based on the shortest path (right sketch). Not considering (convex sections of) the boundaries, for example, results in an allocation of areas to node n1 which can only be part of the Voronoi regions of the nodes n4 and n5 if the boundaries given in the example are respected. Not considering the shortest path to bypass non-convex sections of the boundaries leads, for example, to an allocation of an area to node n2 instead of node n1. Accordingly, an area is allocated to node n3 instead of node n2 if the shortest path to bypass the obstacle given in the example is ignored and the Voronoi diagram is determined based on the Euclidean distance.

Several authors have tackled this problem. The approach of Aronov (1987) considers nonconvex boundaries but not obstacles within the Voronoi regions. The approach of Papadopoulou and Lee (1998) does consider obstacles as constraints but only if specific conditions are given. None of the known approaches solves the problem satisfactorily for an application in the given context. The consideration of both, non-convex boundaries and obstacles would be desirable.

Demand Allocation in Water Distribution Network Modelling – A GIS-Based Approach Using Voronoi Diagrams with Constraints 287

The basis of the novel approach described herein is a geometric and iterative method to calculate Voronoi diagrams. Thereby, the pairs of originating nodes of a bi-sector (the immediate neighbours) have to be determined first. Each node of a pair is defined to be the origin (centre) of concentric circles. Circles with the same radius are intersected. The intersection points constitute points of the bi-sector. Thus, the Voronoi diagram based on the

As discussed in section 1, to consider non-convex boundaries and obstacles, the determination of the Voronoi diagram has to be based on the shortest path. Thus, the shortest paths from every point of a bi-sector to each of the originating nodes of the bi-sector have the same length. This leads to a linear form of the bi-sector if the points of the bi-sector are 'visible' from both originating nodes (the length of the shortest path is equal to the length of the Euclidean distance) and to a hyperbolic form if a non-convex boundary section or an obstacle leads to a shortest path which is longer than the Euclidean distance (c.f. fig. 3). When considering boundaries and obstacles they are assumed to be represented by polygons with vertices, which is the case in the context of the development of a GISsolution. The critical points of bounding polygons are the extreme vertices of non-convex sections. Accordingly, the critical points of obstacles are the extreme vertices of convex sections. These critical points are identified and added to the set of given nodes (primary nodes) as subordinated nodes. A subordinated node is always related to the closest primary node. An additive constant which represents the shortest path length to that primary node is assigned to each subordinated node. In figure 4 an example of three primary nodes (n1, n2 and n3), a bounding polygon and three obstacles is given. Subordinated nodes are defined at all critical points of the boundary and the obstacles. For example, the three subordinated nodes n11, n12 and n13 are related to node n1. The additive constant of the subordinated node

**Figure 4.** Illustration of subordinated nodes of a given configuration in a bounding polygon with three

n3

n11

n1

n13

n12

n2

**2. Approach** 

Euclidean distance can be calculated.

n12 is also shown exemplarily in figure 4.

Primary node Subordinated node Boundary or obstacle polygon Additive constant of n12

primary nodes and three obstacles

**Figure 3.** Comparison of a Voronoi diagram based on the Euclidean distance and the shortest path (Klingel, 2010)

In this context, the additively weighted Voronoi diagram is of particular interest. Johnson and Mehl (1939) invented the additive weights of originating nodes of Voronoi diagrams to model the development of crystals, which grow with different speed into different directions. Every node is weighted with a positive or negative constant (additive constant) which is subtracted from the distance function when calculating the Voronoi diagram. Thus, contrary to a common Voronoi diagram based on two nodes, the bi-sector is not a straight line in the geometric centre between the nodes if the two nodes are weighted with different constants. The bi-sector constitutes a hyperbola being closer to the node with the smaller weight (Aurenhammer, 1991).

This chapter presents a novel approach to calculate Voronoi diagrams based on the shortest path instead of the Euclidean distance considering non-convex boundaries and obstacles, which Guth (2009) has developed in his Bachelor thesis. At critical points of boundaries and obstacles, the so called subordinated nodes are temporarily defined and weighted with an additive constant to consider the shortest path as distance function for the calculation of the Voronoi diagram. The approach is presented in section 2 of this chapter. The resultant algorithm is expounded in section 3. The testing of the algorithm and the implementation of the algorithm as a GIS-tool based on the software package ESRI ArcGIS (Ormsby et al., 2004) are demonstrated in sections 4 and 5.

## **2. Approach**

286 Application of Geographic Information Systems

obstacles would be desirable.

n1

d1 d1

n3

n4

n2

Voronoi diagram Boundary Node Distance

(Klingel, 2010)

weight (Aurenhammer, 1991).

are demonstrated in sections 4 and 5.

application in the given context. The consideration of both, non-convex boundaries and

Voronoi diagram based on the Euclidean distance Voronoi diagram based on the shortest path considering

n5 n1

n2

d2

boundaries and obstacles

d3

n3

d4

n4

n5

d2 = d3 + d4

**Figure 3.** Comparison of a Voronoi diagram based on the Euclidean distance and the shortest path

In this context, the additively weighted Voronoi diagram is of particular interest. Johnson and Mehl (1939) invented the additive weights of originating nodes of Voronoi diagrams to model the development of crystals, which grow with different speed into different directions. Every node is weighted with a positive or negative constant (additive constant) which is subtracted from the distance function when calculating the Voronoi diagram. Thus, contrary to a common Voronoi diagram based on two nodes, the bi-sector is not a straight line in the geometric centre between the nodes if the two nodes are weighted with different constants. The bi-sector constitutes a hyperbola being closer to the node with the smaller

This chapter presents a novel approach to calculate Voronoi diagrams based on the shortest path instead of the Euclidean distance considering non-convex boundaries and obstacles, which Guth (2009) has developed in his Bachelor thesis. At critical points of boundaries and obstacles, the so called subordinated nodes are temporarily defined and weighted with an additive constant to consider the shortest path as distance function for the calculation of the Voronoi diagram. The approach is presented in section 2 of this chapter. The resultant algorithm is expounded in section 3. The testing of the algorithm and the implementation of the algorithm as a GIS-tool based on the software package ESRI ArcGIS (Ormsby et al., 2004) The basis of the novel approach described herein is a geometric and iterative method to calculate Voronoi diagrams. Thereby, the pairs of originating nodes of a bi-sector (the immediate neighbours) have to be determined first. Each node of a pair is defined to be the origin (centre) of concentric circles. Circles with the same radius are intersected. The intersection points constitute points of the bi-sector. Thus, the Voronoi diagram based on the Euclidean distance can be calculated.

As discussed in section 1, to consider non-convex boundaries and obstacles, the determination of the Voronoi diagram has to be based on the shortest path. Thus, the shortest paths from every point of a bi-sector to each of the originating nodes of the bi-sector have the same length. This leads to a linear form of the bi-sector if the points of the bi-sector are 'visible' from both originating nodes (the length of the shortest path is equal to the length of the Euclidean distance) and to a hyperbolic form if a non-convex boundary section or an obstacle leads to a shortest path which is longer than the Euclidean distance (c.f. fig. 3).

When considering boundaries and obstacles they are assumed to be represented by polygons with vertices, which is the case in the context of the development of a GISsolution. The critical points of bounding polygons are the extreme vertices of non-convex sections. Accordingly, the critical points of obstacles are the extreme vertices of convex sections. These critical points are identified and added to the set of given nodes (primary nodes) as subordinated nodes. A subordinated node is always related to the closest primary node. An additive constant which represents the shortest path length to that primary node is assigned to each subordinated node. In figure 4 an example of three primary nodes (n1, n2 and n3), a bounding polygon and three obstacles is given. Subordinated nodes are defined at all critical points of the boundary and the obstacles. For example, the three subordinated nodes n11, n12 and n13 are related to node n1. The additive constant of the subordinated node n12 is also shown exemplarily in figure 4.

**Figure 4.** Illustration of subordinated nodes of a given configuration in a bounding polygon with three primary nodes and three obstacles

Even though the principle idea of using additively weighted nodes was invented by Johnson and Mehl (1939), their approach to calculate the Voronoi diagram cannot be applied successfully to solve the given problem. Transferred to the given context, the approach does only solve configurations where the additive constant of a subordinated node is shorter than the distance between the primary node and the related subordinated node (Okabe et al., 2000). In the given case the additive constant is always equal to the distance between primary and subordinated node. Thus, the development of a novel solution is needed.

Demand Allocation in Water Distribution Network Modelling – A GIS-Based Approach Using Voronoi Diagrams with Constraints 289

n21

n1

n2

node n2 (left of the 'visibility boundary' of node n2), the bi-sector has a hyperbolic shape. The vertices (black rhombi in the figure) are determined by intersecting the concentric circles of the primary node n1 and the subordinated node n21. The radii of the circles with the subordinated node n21 as origin are always decreased by the additive constant, which is the length of the shortest path between the nodes n2 and n21. In the figure, circles which are

For the realisation of the approach described in the previous section, an algorithm was developed, tested (c.f. section 4) and finally implemented in a GIS-tool (c.f. section 5). An

overview of the individual steps of the algorithm is given in figure 6.

intersected, are plotted in the same colour.

Primary node Subordinated node Boundary polygon Visibility boundary of n2 Intersection point of n1 and n2 Intersection point of n1 and n21

Concentric circle

**Figure 5.** Schematic illustration of the approach

**3. Algorithm** 

**3.1. Overview** 

Like the computation of Voronoi diagrams based on a set of primary nodes by intersecting circles, the determination based on primary and subordinated nodes can be regarded as the projection of the intersection of cones. Each node (located in the x-y-plane) represents the tip of a cone which is opened upwards. All cones have the same cone angle. The cones at the subordinated nodes are moved vertically up (z-direction) by the value of their additive constant. At a certain height (z-value), the intersection of the cones can be represented by the intersection of circles with the radius of the cones at that specific height (z-value). Cones related to primary nodes have the same radii at the same height. Thus, the projection (to the x-y-plane) of the intersection of cones related to primary nodes results in a linear shaped section of the according bi-sector. Accordingly, the intersection of the cones related to a primary and a subordinated node results in a hyperbolic shape due to the different radii at a certain height of the cones.

To simplify the calculation, the projected cones are treated (discretised) as concentric circles. The height above the plane is reflected by the circle's radius. Circles with the same radii are intersected to calculate bi-sector points. For the subordinated nodes' circles, the radii are decreased by the respective additive constant. Resultant radii below zero are not admissible.

For the calculation of the Voronoi diagram the radii of the circles are increased successively. For each radius (iteration), the circles which are related to the primary nodes or their subordinated nodes are intersected in several calculation steps. In one calculation step the circles of two nodes are intersected. If there is an intersection point which is visible from both circles' centres (originating nodes) and is not located within another circle, the point is stored as bi-sector point. Visibility may be limited by the bounding polygon or an obstacle. The difference between the radii of two sequent iterations constitutes the discretisation of the iterative approach.

The result of the approach is a Voronoi diagram consisting of polygons with discretely determined vertices. As GIS do not store mathematical functions, this is not a drawback. The resulting polygons can be directly stored without any further processing.

Figure 5 shows a schematic illustration of the approach. The given configuration comprehends two primary nodes (n1 and n2). A boundary polygon constitutes a critical point. The pertinent subordinated node n21 is related to the closest primary node, node n2. In the area visible from both primary nodes (right of the 'visibility boundary' of node n2) the bi-sector between the nodes is linear. The bi-sector vertices (black squares in the figure) are determined by intersecting concentric circles with the same radii. In the area not visible from node n2 (left of the 'visibility boundary' of node n2), the bi-sector has a hyperbolic shape. The vertices (black rhombi in the figure) are determined by intersecting the concentric circles of the primary node n1 and the subordinated node n21. The radii of the circles with the subordinated node n21 as origin are always decreased by the additive constant, which is the length of the shortest path between the nodes n2 and n21. In the figure, circles which are intersected, are plotted in the same colour.

**Figure 5.** Schematic illustration of the approach

### **3. Algorithm**

288 Application of Geographic Information Systems

certain height of the cones.

the iterative approach.

Even though the principle idea of using additively weighted nodes was invented by Johnson and Mehl (1939), their approach to calculate the Voronoi diagram cannot be applied successfully to solve the given problem. Transferred to the given context, the approach does only solve configurations where the additive constant of a subordinated node is shorter than the distance between the primary node and the related subordinated node (Okabe et al., 2000). In the given case the additive constant is always equal to the distance between primary and subordinated node. Thus, the development of a novel solution is needed.

Like the computation of Voronoi diagrams based on a set of primary nodes by intersecting circles, the determination based on primary and subordinated nodes can be regarded as the projection of the intersection of cones. Each node (located in the x-y-plane) represents the tip of a cone which is opened upwards. All cones have the same cone angle. The cones at the subordinated nodes are moved vertically up (z-direction) by the value of their additive constant. At a certain height (z-value), the intersection of the cones can be represented by the intersection of circles with the radius of the cones at that specific height (z-value). Cones related to primary nodes have the same radii at the same height. Thus, the projection (to the x-y-plane) of the intersection of cones related to primary nodes results in a linear shaped section of the according bi-sector. Accordingly, the intersection of the cones related to a primary and a subordinated node results in a hyperbolic shape due to the different radii at a

To simplify the calculation, the projected cones are treated (discretised) as concentric circles. The height above the plane is reflected by the circle's radius. Circles with the same radii are intersected to calculate bi-sector points. For the subordinated nodes' circles, the radii are decreased by the respective additive constant. Resultant radii below zero are not admissible.

For the calculation of the Voronoi diagram the radii of the circles are increased successively. For each radius (iteration), the circles which are related to the primary nodes or their subordinated nodes are intersected in several calculation steps. In one calculation step the circles of two nodes are intersected. If there is an intersection point which is visible from both circles' centres (originating nodes) and is not located within another circle, the point is stored as bi-sector point. Visibility may be limited by the bounding polygon or an obstacle. The difference between the radii of two sequent iterations constitutes the discretisation of

The result of the approach is a Voronoi diagram consisting of polygons with discretely determined vertices. As GIS do not store mathematical functions, this is not a drawback. The

Figure 5 shows a schematic illustration of the approach. The given configuration comprehends two primary nodes (n1 and n2). A boundary polygon constitutes a critical point. The pertinent subordinated node n21 is related to the closest primary node, node n2. In the area visible from both primary nodes (right of the 'visibility boundary' of node n2) the bi-sector between the nodes is linear. The bi-sector vertices (black squares in the figure) are determined by intersecting concentric circles with the same radii. In the area not visible from

resulting polygons can be directly stored without any further processing.

#### **3.1. Overview**

For the realisation of the approach described in the previous section, an algorithm was developed, tested (c.f. section 4) and finally implemented in a GIS-tool (c.f. section 5). An overview of the individual steps of the algorithm is given in figure 6.

**Figure 6.** Flowchart of the algorithm

First, the subordinated nodes of the given configuration are identified and weighted with an additive constant. Next, points of the bi-sectors are calculated by intersecting circles with corresponding radii and the nodes as origin. Finally, based on the calculated set of bi-sector points, the Voronoi diagram is determined. Thus, the algorithm can be divided into three principle parts: initialization, computation and creation of the resulting Voronoi diagram. Each of the following subsections describes one of the three parts.

### **3.2. Initialization**

Identifying the subordinated nodes constitutes the first step of the initialization. Therefore, the bounding polygon and the polygons which describe the obstacles are analysed regarding the exterior angle at each vertex. For the bounding polygon this is facilitated by calculating the azimuths of the current vertex to the predecessor (t1 in equation 1) and the successor (t2 in equation 1) vertices. The difference between the azimuths is subtracted from the full circle. Thus, the outer angle β is calculated by equation 1:

$$
\beta = 2\pi - (\text{tr} - \text{tr})\tag{1}
$$

Demand Allocation in Water Distribution Network Modelling – A GIS-Based Approach Using Voronoi Diagrams with Constraints 291

If ∑β = W, the polygon is defined clockwise and the marks are correct. Otherwise, the order of the vertices is switched and the algorithm is executed again. For the obstacle polygons, the same algorithm is used but the marks are switched in the end. Thus, only the vertices which stick out of the obstacle (convex sections) are considered as subordinated nodes. The

In the succesive step, with the algorithm of Dijkstra (1959), the relation of subordinated nodes and primary nodes is determined and the additive constant for each subordinate node is calculated. The Dijkstra algorithm analyses the graph based on all subordinated and primary nodes with links between pairs of nodes which are visible to each other considering the given boundaries and obstacles. From each subordinated node the possible paths to all primary nodes are calculated. Thus, the shortest path determines the primary node related to the subordinated

As mentioned before, the bi-sectors between the Voronoi regions are made up of intersection points of circle intersections. Node pair by node pair, circles of all nodes are intersected with each other to determine the intersection points. The process starts with a

For the circles to be intersected, the minimum and maximum radius and obligatory radii are determined first. The minimum radius is where both circles touch. To identify the maximum radius, the maximal possible distances from each originating node (centres of the circles) to a visible point within the boundary are calculated. The shorter distance is used as maximum radius. The obligatory radii for each pair of nodes are those where the intersection points lie on a segment of the boundary or obstacle polygon. When the circles of a primary and a subordinated node are intersected, the obligatory radii have to be determined iteratively.

The iteration method is described here below and illustrated in figure 7. B1 is the starting vertex of a segment, IP the unknown intersection point of circles with the centres (originating nodes) n1 and n2. The vector v points into the direction of the segment and is normalised to (length) 1. The unknown vectors u and w point from the intersection point to

n1

IP n2

u w

**Figure 7.** Setup for the iterative calculation of intersection points on polygon segments

B1

v

identified subordinated nodes are added to the list of originating nodes.

node. Furthermore, the length of the shortest path constitutes the additive constant.

**3.3. Computation of bi-sector points** 

randomly chosen pair of nodes.

the respective centres.

Centre (originating node)

Segment of boundary or obstacle polygon Vertex of boundary or obstacle polygon Intersection point of circles

Vector

The result is normalised to β є [0;2π) by successive subtraction of 2π where required. Assuming that the polygon is defined clockwise, the vertex is marked as a subordinated node if it holds that β < π. The angles β of all vertices are added (∑β) and compared with the theoretical angular sum W. W is computed by equation 2, where n is the number of vertices:

$$\mathbf{W} = (\mathbf{n} + \mathbf{2}) \cdot \boldsymbol{\pi} \tag{2}$$

If ∑β = W, the polygon is defined clockwise and the marks are correct. Otherwise, the order of the vertices is switched and the algorithm is executed again. For the obstacle polygons, the same algorithm is used but the marks are switched in the end. Thus, only the vertices which stick out of the obstacle (convex sections) are considered as subordinated nodes. The identified subordinated nodes are added to the list of originating nodes.

In the succesive step, with the algorithm of Dijkstra (1959), the relation of subordinated nodes and primary nodes is determined and the additive constant for each subordinate node is calculated. The Dijkstra algorithm analyses the graph based on all subordinated and primary nodes with links between pairs of nodes which are visible to each other considering the given boundaries and obstacles. From each subordinated node the possible paths to all primary nodes are calculated. Thus, the shortest path determines the primary node related to the subordinated node. Furthermore, the length of the shortest path constitutes the additive constant.

## **3.3. Computation of bi-sector points**

290 Application of Geographic Information Systems

Within another circle?

Identification of subordinated nodes

Rejection of intersection point

yes no

yes

Selection of new radius

Building of the resulting Voronoi diagram

Start

no

Storage of intersection point

**Figure 6.** Flowchart of the algorithm

**3.2. Initialization** 

First, the subordinated nodes of the given configuration are identified and weighted with an additive constant. Next, points of the bi-sectors are calculated by intersecting circles with corresponding radii and the nodes as origin. Finally, based on the calculated set of bi-sector points, the Voronoi diagram is determined. Thus, the algorithm can be divided into three principle parts: initialization, computation and creation of the resulting Voronoi diagram.

Determination of additive constants

> Maximum radius exceeded ?

All node combinations completed ?

End Selection of the

yes

yes

Visible from both centres (nodes) ?

Choice of the first pair of nodes

Calculation of intersection points

no

no

Calculation of the min., max. and obligatory radii

> Selection of radius

next pair of nodes

Identifying the subordinated nodes constitutes the first step of the initialization. Therefore, the bounding polygon and the polygons which describe the obstacles are analysed regarding the exterior angle at each vertex. For the bounding polygon this is facilitated by calculating the azimuths of the current vertex to the predecessor (t1 in equation 1) and the successor (t2 in equation 1) vertices. The difference between the azimuths is subtracted from

The result is normalised to β є [0;2π) by successive subtraction of 2π where required. Assuming that the polygon is defined clockwise, the vertex is marked as a subordinated node if it holds that β < π. The angles β of all vertices are added (∑β) and compared with the theoretical angular sum W. W is computed by equation 2, where n is the number of vertices:

β = 2π – (t1 – t2) (1)

W = (n + 2)· π (2)

Each of the following subsections describes one of the three parts.

the full circle. Thus, the outer angle β is calculated by equation 1:

As mentioned before, the bi-sectors between the Voronoi regions are made up of intersection points of circle intersections. Node pair by node pair, circles of all nodes are intersected with each other to determine the intersection points. The process starts with a randomly chosen pair of nodes.

For the circles to be intersected, the minimum and maximum radius and obligatory radii are determined first. The minimum radius is where both circles touch. To identify the maximum radius, the maximal possible distances from each originating node (centres of the circles) to a visible point within the boundary are calculated. The shorter distance is used as maximum radius. The obligatory radii for each pair of nodes are those where the intersection points lie on a segment of the boundary or obstacle polygon. When the circles of a primary and a subordinated node are intersected, the obligatory radii have to be determined iteratively.

The iteration method is described here below and illustrated in figure 7. B1 is the starting vertex of a segment, IP the unknown intersection point of circles with the centres (originating nodes) n1 and n2. The vector v points into the direction of the segment and is normalised to (length) 1. The unknown vectors u and w point from the intersection point to the respective centres.

**Figure 7.** Setup for the iterative calculation of intersection points on polygon segments

To compute the unknown vectors, the following constraints (equations 3 to 7) are used with ux, uy, wx, wy, vx, vy as x- and y-components of the respective vectors, n1x, n1y, n2x, n2y, B1x, B1y as x- and y-coordinates of the respective points and m as length multiplication factor of vector v, which expresses the distance from B1 to IP:

$$
\mathbf{u}\mathbf{\iota} + \mathbf{\iota}\mathbf{\iota}\mathbf{\iota} - \mathbf{\iota}\mathbf{\iota}\mathbf{\iota} - \mathbf{w}\mathbf{v} = \mathbf{0} \tag{3}
$$

Demand Allocation in Water Distribution Network Modelling – A GIS-Based Approach Using Voronoi Diagrams with Constraints 293

As the intersection points are stored in lists without order, the resulting polygons of the bisectors cannot be created directly. For each list a polyline is created first. This is done by picking the first point in the list and searching for the point with the greatest distance to it. This point is defined as the starting point of the polyline. Next, the closest point to the starting node is searched for, defined as the next point and marked. The process is repeated until no unmarked points are left on the list. To form the resulting Voronoi diagram, the individual polylines are merged by comparing the coordinates of the polylines' starting

The developed approach and algorithm respectively are tested in order to evaluate the functionality and performance. Several configurations of given sets of nodes and boundaries are analysed by comparing the Voronoi diagram based on the Euclidean distance and the Voronoi diagram calculated with the novel approach. The comparison is mainly based on the size of the Voronoi regions. Two of the tested cases are discussed in the following.

The first case is a synthetic example configuration solely created to test the functionality of the developed algorithm. It consists of only three originating nodes (n1, n2, n3), a non-convex boundary and three obstacles located within the boundary. Figure 8 shows the case and the calculated Voronoi diagrams based on the Euclidean distance and the shortest path. The individual sizes of the areas and the differences are shown in table 1. The differences do not sum up to zero as the approach is approximate and small holes between the individual Voronoi regions may occur. Furthermore, the values in the table are rounded to two decimal

n1 2,067.60 1,889.24 -178.36 -8.6 n2 2,210.80 2,213.44 2.63 0.1 n3 2,120.66 2,296.30 175.65 8.3

The Voronoi region of node n1 decreases significantly in size (-8.6%) with the new approach compared to the Euclidean Voronoi diagram due to the large area on the left. Considering the bounding polygon, this area is not visible from the primary node n1. Thus, within the shortest path Voronoi diagram, the specific area is allocated to the Voronoi regions of the nodes n2 and n3 instead of n1. The Voronoi region of node n2 remains almost the same (0.1%): it gains area from the region of node n1, but nearly the same amount is allocated to the region of node n3 in the upper part of the image, where the bisector is bent to the right around the obstacle. Coming closer to the boundary the same bi-sector is bending a little to the left again. This is due to the change of the shortest path to bypass the left obstacle from the way around the obstacle on the left hand side to the way around on the right hand side

Voronoi region based on the shortest path [length²]

Difference [length²]

Difference [%]

**3.4. Determination of Voronoi diagram** 

points and endpoints and assembling the nearest ones.

**4. Testing** 

places.

Node Voronoi region based on the

Euclidean distance [length²]

**Table 1.** Comparison of the Voronoi region sizes of case 1

$$\mathbf{u}\_Y + \mathbf{r}\mathbf{u}\_Y - \mathbf{r}\mathbf{u}\_Y - \mathbf{w}\_Y = \mathbf{0} \tag{4}$$

$$\mathbf{h}\mathbf{m}\cdot\mathbf{v}\_{\mathbf{x}} + \mathbf{w}\_{\mathbf{x}} + \mathbf{B}\_{\mathbf{1x}} - \mathbf{n}\_{\mathbf{1x}} = \mathbf{0} \tag{5}$$

$$\mathbf{h}\mathbf{m}\cdot\mathbf{v}\_{\mathbf{y}} + \mathbf{w}\_{\mathbf{y}} + \mathbf{B}\_{\mathbf{i}\mathbf{y}} - \mathbf{n}\_{\mathbf{i}\mathbf{y}} = \mathbf{0} \tag{6}$$

$$
\sqrt{\mathbf{w}\mathbf{x}^2 + \mathbf{w}\mathbf{y}^2} = \sqrt{\mathbf{u}\mathbf{x}^2 + \mathbf{u}\mathbf{y}^2} \tag{7}
$$

Equation 3 and 4 express a closed vector-path from IP via n1 and n2 back to IP. Equation 5 and 6 represent the closed path from B1 via IP and n1 back to B1. Equation 7 guarantees, that the vectors u and w have an equal length. Taking the constraints into account, m can be computed using equation 8:

$$\mathbf{m} = \frac{\mathbf{m}\mathbf{a}^2 \cdot 2 \cdot \mathbf{m} \mathbf{a} \cdot \mathbf{B} \mathbf{a} + \mathbf{m} \mathbf{a}^2 \cdot 2 \cdot \mathbf{m} \mathbf{y} \cdot \mathbf{B} \mathbf{y} + 2 \cdot \mathbf{m} \mathbf{a} \cdot \mathbf{B} \mathbf{a} \cdot \mathbf{m} \mathbf{a}^2 + 2 \cdot \mathbf{m} \mathbf{y} \cdot \mathbf{B} \mathbf{y} \cdot \mathbf{m} \mathbf{y}^2}{2 \cdot \left(\mathbf{m} \mathbf{a} \cdot \mathbf{v}\_{\mathbf{b}} + \mathbf{m} \mathbf{y} \cdot \mathbf{v}\_{\mathbf{b}} \cdot \mathbf{m} \mathbf{a} \cdot \mathbf{v}\_{\mathbf{b}} - \mathbf{m} \mathbf{y} \cdot \mathbf{v}\_{\mathbf{b}}\right)}}\tag{8}$$

Finally, IP can be calculated with equation 9:

$$
\underline{\mathbf{IP}} = \mathbf{B}\mathbf{i} + \mathbf{m} \cdot \underline{\mathbf{v}} \tag{9}
$$

In the first iteration, the centres are moved in normal direction away from the segment by the value of the additive constant. The normal direction is used as temporary vector for u and w and IP is computed. In the next iterations, the centres are moved in direction of the last u and w respectively. After the changes of u and w fall below a given bound, the final radii are calculated by adding the distance from IP to the centre and the additive constant.

After the minimum, maximum and obligatory radii are identified the intersection points (bi-sector points) related to the pair of nodes are calculated. Beginning with the minimum radii, the circles of the pair of nodes are intersected. For each intersection point it is checked if the point is visible from both centres and if no other centre is closer to the point. If one of the constraints is not fulfilled, the point is dismissed. Otherwise, the point is stored in a list, individually created for each pair of nodes. After the first intersections with the minimum radius, the radius is increased. The new radius is determined with a function, which takes the preceding intersection results into account. For small bends of the resulting bi-sectors the increment is larger than for intersections which form a strong bend. Increments may also be negative if necessary. When the maximum radius is reached, the obligatory radii for the points on the boundary and the obstacles are processed and the next pair of nodes is chosen.

## **3.4. Determination of Voronoi diagram**

As the intersection points are stored in lists without order, the resulting polygons of the bisectors cannot be created directly. For each list a polyline is created first. This is done by picking the first point in the list and searching for the point with the greatest distance to it. This point is defined as the starting point of the polyline. Next, the closest point to the starting node is searched for, defined as the next point and marked. The process is repeated until no unmarked points are left on the list. To form the resulting Voronoi diagram, the individual polylines are merged by comparing the coordinates of the polylines' starting points and endpoints and assembling the nearest ones.

## **4. Testing**

292 Application of Geographic Information Systems

computed using equation 8:

additive constant.

Finally, IP can be calculated with equation 9:

vector v, which expresses the distance from B1 to IP:

To compute the unknown vectors, the following constraints (equations 3 to 7) are used with ux, uy, wx, wy, vx, vy as x- and y-components of the respective vectors, n1x, n1y, n2x, n2y, B1x, B1y as x- and y-coordinates of the respective points and m as length multiplication factor of

ux + n1x – n2x – wx = 0 (3)

uy + n1y – n2y – wy = 0 (4)

Equation 3 and 4 express a closed vector-path from IP via n1 and n2 back to IP. Equation 5 and 6 represent the closed path from B1 via IP and n1 back to B1. Equation 7 guarantees, that the vectors u and w have an equal length. Taking the constraints into account, m can be

m = n2x² - 2· n2x· B1x + n2y² - 2· n2y· B1y + 2 · n1x· B1x - n1x² + 2· n1y· B1y - n1y²

In the first iteration, the centres are moved in normal direction away from the segment by the value of the additive constant. The normal direction is used as temporary vector for u and w and IP is computed. In the next iterations, the centres are moved in direction of the last u and w respectively. After the changes of u and w fall below a given bound, the final radii are calculated by adding the distance from IP to the centre and the

After the minimum, maximum and obligatory radii are identified the intersection points (bi-sector points) related to the pair of nodes are calculated. Beginning with the minimum radii, the circles of the pair of nodes are intersected. For each intersection point it is checked if the point is visible from both centres and if no other centre is closer to the point. If one of the constraints is not fulfilled, the point is dismissed. Otherwise, the point is stored in a list, individually created for each pair of nodes. After the first intersections with the minimum radius, the radius is increased. The new radius is determined with a function, which takes the preceding intersection results into account. For small bends of the resulting bi-sectors the increment is larger than for intersections which form a strong bend. Increments may also be negative if necessary. When the maximum radius is reached, the obligatory radii for the points on the boundary and the

obstacles are processed and the next pair of nodes is chosen.

m· vx + wx + B1x – n1x = 0 (5)

m· vy + wy + B1y – n1y = 0 (6)

�wx² + wy² = �ux² + uy² (7)

2· (n2x · vx + n2y · vy - n1x · vx - n1y · vy) (8)

IP = B1 + m· v (9)

The developed approach and algorithm respectively are tested in order to evaluate the functionality and performance. Several configurations of given sets of nodes and boundaries are analysed by comparing the Voronoi diagram based on the Euclidean distance and the Voronoi diagram calculated with the novel approach. The comparison is mainly based on the size of the Voronoi regions. Two of the tested cases are discussed in the following.

The first case is a synthetic example configuration solely created to test the functionality of the developed algorithm. It consists of only three originating nodes (n1, n2, n3), a non-convex boundary and three obstacles located within the boundary. Figure 8 shows the case and the calculated Voronoi diagrams based on the Euclidean distance and the shortest path. The individual sizes of the areas and the differences are shown in table 1. The differences do not sum up to zero as the approach is approximate and small holes between the individual Voronoi regions may occur. Furthermore, the values in the table are rounded to two decimal places.


**Table 1.** Comparison of the Voronoi region sizes of case 1

The Voronoi region of node n1 decreases significantly in size (-8.6%) with the new approach compared to the Euclidean Voronoi diagram due to the large area on the left. Considering the bounding polygon, this area is not visible from the primary node n1. Thus, within the shortest path Voronoi diagram, the specific area is allocated to the Voronoi regions of the nodes n2 and n3 instead of n1. The Voronoi region of node n2 remains almost the same (0.1%): it gains area from the region of node n1, but nearly the same amount is allocated to the region of node n3 in the upper part of the image, where the bisector is bent to the right around the obstacle. Coming closer to the boundary the same bi-sector is bending a little to the left again. This is due to the change of the shortest path to bypass the left obstacle from the way around the obstacle on the left hand side to the way around on the right hand side

Demand Allocation in Water Distribution Network Modelling – A GIS-Based Approach Using Voronoi Diagrams with Constraints 295

> Difference [length²]

Difference [%]

Voronoi region based on the shortest path [length²]

n1 33,820.06 38,722.92 -4,902.86 -12.70 n2 28,519.46 25,179.86 3,339.60 13.30 n3 20,510.38 20,517.97 -7.59 0.00 n4 14,185.69 13,991.04 194.65 1.40 n5 19,486.64 20,731.79 -1,245.15 -6.00 n6 36,329.44 35,075.23 1,254.21 3.60 n7 t, the 51 33,492.01 -3,970.49 -11.90 n8 15,188.57 14,350.52 838.05 5.80 n9 3,656.39 3,656.57 -0.18 0.00 n10 2,809.11 2,809.19 -0.08 0.00 n11 23,676.38 25,289.44 -1,613.06 -6.40 n12 16,706.28 16,185.26 521.03 3.20 n13 16,916.03 17,686.10 -770.07 -4.40 n14 21,487.65 20,796.47 691.18 3.30 n15 7,085.54 7,085.79 -0.25 0.00 n16 6,603.06 6,603.53 -0.46 0.00 n17 3,779.35 3,831.44 -52.09 -1.40 n18 3,156.40 2,834.48 321.92 11.40 n19 11,720.82 11,494.35 226.47 2.00 n20 23,474.75 22,949.54 525.21 2.30

Node Voronoi region based on the Euclidean distance [length²]

**Table 2.** Comparison of the Voronoi region sizes of case 2

**Figure 8.** Test case 1 with calculated Voronoi diagrams based on the Euclidean distance and the shortest path (novel approach)

(between the two obstacles). Within the Euclidean Voronoi region of node n3 no obstacles or non-convex sections of the bounding polygon are located which might have an influence on the bi-sectors if the Voronoi diagram is calculated based on the shortest path. Thus, it gains area of the Voronoi regions of both nodes n1 and n2 (8.3%).

The second case is a detail of an existing city in Algeria. It consists of 20 originating nodes (n1, n2, …, n20), a non-convex boundary and several obstacles located within the boundary. Figure 9 shows the case and the calculated Voronoi diagrams based on the Euclidean distance and the shortest path. The individual sizes of the Voronoi regions and the differences are shown in table 2. In addition to the explanation given already for table 1, the area differences do not sum up to zero as only a small detail of the data set is shown.

Case 2 clearly shows, that the Voronoi diagram based on the Euclidean distance relates areas to nodes, which are closer to other nodes, if boundaries or obstacles are respected. This is the case, for example, for the nodes n7, n11 and n12. To node n11 the area to the right is assigned, which can only be reached by passing through the Voronoi region of node n7. To the two nodes close to the obstacle in the centre, n7 and n12, areas on the opposite side of the obstacle are allocated. Contrarily, all Voronoi regions of the shortest path Voronoi diagram have a direct connection to the related originating node. As ascertained in table 2, the differences in the area sizes are up to 13.3% gain and -12,7% loss respectively.


**Table 2.** Comparison of the Voronoi region sizes of case 2

294 Application of Geographic Information Systems

shortest path (novel approach)

**Figure 8.** Test case 1 with calculated Voronoi diagrams based on the Euclidean distance and the

area of the Voronoi regions of both nodes n1 and n2 (8.3%).

(between the two obstacles). Within the Euclidean Voronoi region of node n3 no obstacles or non-convex sections of the bounding polygon are located which might have an influence on the bi-sectors if the Voronoi diagram is calculated based on the shortest path. Thus, it gains

The second case is a detail of an existing city in Algeria. It consists of 20 originating nodes (n1, n2, …, n20), a non-convex boundary and several obstacles located within the boundary. Figure 9 shows the case and the calculated Voronoi diagrams based on the Euclidean distance and the shortest path. The individual sizes of the Voronoi regions and the differences are shown in table 2. In addition to the explanation given already for table 1, the

Case 2 clearly shows, that the Voronoi diagram based on the Euclidean distance relates areas to nodes, which are closer to other nodes, if boundaries or obstacles are respected. This is the case, for example, for the nodes n7, n11 and n12. To node n11 the area to the right is assigned, which can only be reached by passing through the Voronoi region of node n7. To the two nodes close to the obstacle in the centre, n7 and n12, areas on the opposite side of the obstacle are allocated. Contrarily, all Voronoi regions of the shortest path Voronoi diagram have a direct connection to the related originating node. As ascertained in table 2, the

area differences do not sum up to zero as only a small detail of the data set is shown.

differences in the area sizes are up to 13.3% gain and -12,7% loss respectively.

Demand Allocation in Water Distribution Network Modelling – A GIS-Based Approach Using Voronoi Diagrams with Constraints 297

Due to the spatial reference of the water distribution network data and the basic data for demand determination such like population density, population distribution, building areas etc. a GIS-application to determine and allocate water demand values to the network model nodes is advantageous. Following is a description of a GIS-tool implemented as an

ArcGIS, a product of the company ESRI, constitutes of several interlinked GIS products. The desktop application *ArcMap* supports visualising, analysing and editing of spatial data and linked attributes. The desktop application *ArcCatalog* enables data management. The data is commonly stored in relational data bases (*geodatabases*) but using single files (e.g. shape files) is possible too. Geometric objects (*features*) are modelled as vector data through defining shape and location of the object. Three principle *feature types* exist: *Point features* (punctual objects), *line features* (linear objects with start point, end point and vertices) and *polygon features* (areas defined by a polygon where the start and end points have the same location). Attributes (without spatial reference) can be linked to *features*. *Features* of the same type and with identical characteristics are stored in object classes (*feature classes*). All ArcGIS products are based on the library *ArcObjects*, developed as Microsoft Component Object Model (COM). *ArcObjects* offers numerous functionalities for the extension of ArcGIS products or the integration of applications into ArcGIS products. For further details on ArcGIS it is referred to the technical literature, c.f. e.g. Ormsby et al. (2004). The application for demand allocation described in this section is based on the ArcGIS data concept and the *ArcObjects* functionalities. The application was implemented as a dynamically linked library (DLL)

The demand allocation consists of two principle steps: (1) The calculation of the catchment areas of the nodes (Voronoi regions related to the originating nodes) and (2) the determination of the water demand within the catchment area and its allocation to the

For the calculation of the catchment areas (first step), input parameters are the demand nodes of the network model graph, the boundaries of the populated areas and the obstacles within the populated areas. The nodes have to be provided as *point features* in a *point feature class*, the boundaries and obstacles as *polygon features* in *polygon feature classes*. With the approach described in the previous sections the Voronoi diagram is calculated based on the nodes and stored in another *polygon feature class*. The references between Voronoi regions

For the calculation of the demand per catchment area (second step) three options are developed and implemented considering the commonly available data basis: (A) Values and locations of the demands are given as *point features*, (B) areas with numbers of the population living there are given as *polygon features* or (C) building areas are given as *polygon features* and areas with the population are also given as *polygon features*. Moreover,

independent library of functions and integrated into ESRI ArcGIS (version 9).

which is integrated into *ArcMap* as a toolbar with an own user interface.

(catchment areas) and originating nodes are saved as an attribute of the nodes.

option B and C need a constant demand per capita value as input parameter

**5. GIS-tool** 

corresponding node.

**Figure 9.** Test case 2 with calculated Voronoi diagrams based on the Euclidean distance and the shortest path (novel approach)

## **5. GIS-tool**

296 Application of Geographic Information Systems

**Figure 9.** Test case 2 with calculated Voronoi diagrams based on the Euclidean distance and the

shortest path (novel approach)

Due to the spatial reference of the water distribution network data and the basic data for demand determination such like population density, population distribution, building areas etc. a GIS-application to determine and allocate water demand values to the network model nodes is advantageous. Following is a description of a GIS-tool implemented as an independent library of functions and integrated into ESRI ArcGIS (version 9).

ArcGIS, a product of the company ESRI, constitutes of several interlinked GIS products. The desktop application *ArcMap* supports visualising, analysing and editing of spatial data and linked attributes. The desktop application *ArcCatalog* enables data management. The data is commonly stored in relational data bases (*geodatabases*) but using single files (e.g. shape files) is possible too. Geometric objects (*features*) are modelled as vector data through defining shape and location of the object. Three principle *feature types* exist: *Point features* (punctual objects), *line features* (linear objects with start point, end point and vertices) and *polygon features* (areas defined by a polygon where the start and end points have the same location). Attributes (without spatial reference) can be linked to *features*. *Features* of the same type and with identical characteristics are stored in object classes (*feature classes*). All ArcGIS products are based on the library *ArcObjects*, developed as Microsoft Component Object Model (COM). *ArcObjects* offers numerous functionalities for the extension of ArcGIS products or the integration of applications into ArcGIS products. For further details on ArcGIS it is referred to the technical literature, c.f. e.g. Ormsby et al. (2004). The application for demand allocation described in this section is based on the ArcGIS data concept and the *ArcObjects* functionalities. The application was implemented as a dynamically linked library (DLL) which is integrated into *ArcMap* as a toolbar with an own user interface.

The demand allocation consists of two principle steps: (1) The calculation of the catchment areas of the nodes (Voronoi regions related to the originating nodes) and (2) the determination of the water demand within the catchment area and its allocation to the corresponding node.

For the calculation of the catchment areas (first step), input parameters are the demand nodes of the network model graph, the boundaries of the populated areas and the obstacles within the populated areas. The nodes have to be provided as *point features* in a *point feature class*, the boundaries and obstacles as *polygon features* in *polygon feature classes*. With the approach described in the previous sections the Voronoi diagram is calculated based on the nodes and stored in another *polygon feature class*. The references between Voronoi regions (catchment areas) and originating nodes are saved as an attribute of the nodes.

For the calculation of the demand per catchment area (second step) three options are developed and implemented considering the commonly available data basis: (A) Values and locations of the demands are given as *point features*, (B) areas with numbers of the population living there are given as *polygon features* or (C) building areas are given as *polygon features* and areas with the population are also given as *polygon features*. Moreover, option B and C need a constant demand per capita value as input parameter

Option A) The demand values [volume/time] are accumulated per catchment area by intersecting the catchment areas (*polygon features*) and the demands (*point features*) according to the spatial references. Finally, the accumulated demands per catchment area are allocated to the originating nodes.

Demand Allocation in Water Distribution Network Modelling – A GIS-Based Approach Using Voronoi Diagrams with Constraints 299

**Figure 10.** Exemplary solution of demand allocation with the developed GIS-tool (option c)

generating Voronoi diagrams in arbitrary environments.

The need for a more precise demand allocation in order to increase the accuracy of water distribution network models led to the development of a novel approach to calculate Voronoi diagrams with constraints. The constraints considered are convex and non-convex boundary and obstacle polygons. Thus, the approach basically solves the problem of

By means of a geometric approach, namely successive circle intersection with discrete radii, the bi-sectors which form the Voronoi diagram are determined. Thus, the result is not a mathematically exact solution but an iterative approximation. Hereby, the accuracy can be controlled by the distances between the individual radii. For GIS software, the discretisation

does not present a drawback as only polygons with discrete vertices can be stored.

**6. Conclusion** 

Option B) The population per catchment area [inhabitants] is determined by intersecting the areas with constant values of the population (*polygon features*) and the catchment areas (*polygon features*) according to the spatial reference. To calculate the demand per node [volume/time], the population values [inhabitants] of the catchment area are multiplied with the demand per capita [volume/inhabitant/time].

Option C) The population per building area [inhabitants] is determined by intersecting the areas with constant values of the population (*polygon features*) and the building areas (*polygon features*) according to the spatial reference. The building areas, and thus the population, are accumulated per catchment area by intersecting the building areas (*polygon features*) and the catchment areas (*polygon features*) according to the spatial references. Finally, the demand per node [volume/time] is calculated by multiplying the population values of the catchment areas [inhabitants] and the demand per capita [volume/inhabitant/time].

Figure 10 shows an exemplary detail of the visualised result of a demand allocation with the GIS-tool. In the example, the nodes of the network model (black dots), a building area data set (dark green and orange areas) and a population density data set (transparent blue and transparent green areas) are given (as *point feature class* and *polygon feature classes*). Moreover, a satellite picture of the setting is used as basic layer. Based on the set of nodes and the bounding population density polygons, the Voronoi diagram was calculated applying the novel approach (*polygon feature class*). The Voronoi regions (*polygon features*) constitute the catchment areas of the nodes (red lines). Considering the given data, the demand is determined based on the population density and the building distribution (option C).

For each population density area (*polygon feature*) the number of inhabitants is calculated by multiplying the respective density and area. Next, the determined population is related to the building areas. Therefore, the total building area per population density area is determined by intersecting the two *polygon feature classes* according to their spatial reference. With the number of inhabitants and the total area of buildings per population density area the inhabitants per building area unit are calculated for each population density area. After accumulating the building areas per node according to the node catchment areas the number of population per node can be calculated. Finally, with the specific per capita demand, the demand per node is determined. Non domestic consumers with given demands (orange areas in the figure), such like industries or administrations, are excluded from the process described and allocated to the nodes directly.

**Figure 10.** Exemplary solution of demand allocation with the developed GIS-tool (option c)

## **6. Conclusion**

298 Application of Geographic Information Systems

the demand per capita [volume/inhabitant/time].

to the originating nodes.

habitant/time].

(option C).

directly.

Option A) The demand values [volume/time] are accumulated per catchment area by intersecting the catchment areas (*polygon features*) and the demands (*point features*) according to the spatial references. Finally, the accumulated demands per catchment area are allocated

Option B) The population per catchment area [inhabitants] is determined by intersecting the areas with constant values of the population (*polygon features*) and the catchment areas (*polygon features*) according to the spatial reference. To calculate the demand per node [volume/time], the population values [inhabitants] of the catchment area are multiplied with

Option C) The population per building area [inhabitants] is determined by intersecting the areas with constant values of the population (*polygon features*) and the building areas (*polygon features*) according to the spatial reference. The building areas, and thus the population, are accumulated per catchment area by intersecting the building areas (*polygon features*) and the catchment areas (*polygon features*) according to the spatial references. Finally, the demand per node [volume/time] is calculated by multiplying the population values of the catchment areas [inhabitants] and the demand per capita [volume/in-

Figure 10 shows an exemplary detail of the visualised result of a demand allocation with the GIS-tool. In the example, the nodes of the network model (black dots), a building area data set (dark green and orange areas) and a population density data set (transparent blue and transparent green areas) are given (as *point feature class* and *polygon feature classes*). Moreover, a satellite picture of the setting is used as basic layer. Based on the set of nodes and the bounding population density polygons, the Voronoi diagram was calculated applying the novel approach (*polygon feature class*). The Voronoi regions (*polygon features*) constitute the catchment areas of the nodes (red lines). Considering the given data, the demand is determined based on the population density and the building distribution

For each population density area (*polygon feature*) the number of inhabitants is calculated by multiplying the respective density and area. Next, the determined population is related to the building areas. Therefore, the total building area per population density area is determined by intersecting the two *polygon feature classes* according to their spatial reference. With the number of inhabitants and the total area of buildings per population density area the inhabitants per building area unit are calculated for each population density area. After accumulating the building areas per node according to the node catchment areas the number of population per node can be calculated. Finally, with the specific per capita demand, the demand per node is determined. Non domestic consumers with given demands (orange areas in the figure), such like industries or administrations, are excluded from the process described and allocated to the nodes

The need for a more precise demand allocation in order to increase the accuracy of water distribution network models led to the development of a novel approach to calculate Voronoi diagrams with constraints. The constraints considered are convex and non-convex boundary and obstacle polygons. Thus, the approach basically solves the problem of generating Voronoi diagrams in arbitrary environments.

By means of a geometric approach, namely successive circle intersection with discrete radii, the bi-sectors which form the Voronoi diagram are determined. Thus, the result is not a mathematically exact solution but an iterative approximation. Hereby, the accuracy can be controlled by the distances between the individual radii. For GIS software, the discretisation does not present a drawback as only polygons with discrete vertices can be stored.

At the moment, the algorithm is not comparable with other algorithms determining Voronoi diagrams based on the Euclidean distance (sweep-algorithm, divide-and-conquer-approach) regarding the computation speed. But the advantages of increased accuracy of the results of an application for demand allocation in water distribution network modelling (and possibly in other applications) outbalance the speed issue. Furthermore, with an enhanced radius controlling mechanism and a reduction of the circle combinations for which intersections are computed, there is still potential for improvements. Moreover, the overall speed could be increased by using highly parallel computations, since each circle combination can be worked at separately. This could be facilitated by means of GPU or multi CPU usage.

Demand Allocation in Water Distribution Network Modelling – A GIS-Based Approach Using Voronoi Diagrams with Constraints 301

*Karlsruhe Institute of Technology (KIT), Institute for Water and River Basin Management, Germany* 

Aronov, B. (1987). On the geodesic Voronoi diagram of point sites in a simple polygon. *Proceedings of the third annual symposium on computational geometry, SCG '87*, pp. 39-49, Association for Computing Machinery, ISBN 0-89791-231-4, Waterloo, Ontario, Canada,

Aurenhammer , F. (1991). Voronoi diagrams – A Survey of a Fundamental Geometric Data Structure. *ACM Computing Survey*, Vol.23, No.3, (September 1991), pp. 345-405, ISSN

Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. *Numerische* 

Fortune, S. (1986). A Sweepline Algorithm for Voronoi Diagrams. *Proceedings of the second annual symposium on computational geometry, SCG '86*, pp. 313-322, Association for Computing Machinery, ISBN 0-89791-194-6, Yorktown Heights, New York, USA, 1986 Guth, N. (2009). *Ermittlung der Einzugsflächen von Knoten mittels Voronoi-Diagrammen unter Berücksichtigung von Nebenbedingungen*, Bachelor thesis (unpublished), Hochschule

Johnson, W. A. & Mehl, R. F. (1939). Reaction kinetics in processes of nucleation growth. *Transactions of American Institute of Mining and Metallurgical and Petroleum Engineers*,

Klingel, P. (2010). *Von intermittierender zu kontinuierlicher Wasserverteilung in Entwicklungsländern*, PhD thesis, Karlsruhe Institute of Technology, Department of Civil Engineering Geo and Environmental Sciences, Karlsruhe, Germany, Retrieved from

Okabe, A.; Boots, B.; Sugihara, K. & Nok Chiu, S. (2000). *Spatial Tessellations: Concepts and Applications of Voronoi Diagrams* (2. edition), John Wiley & Sons, Ltd, ISBN 0-471-98635-

Ormsby, T.; Napoleon, E.; Burke, R.; Groess, C. & Feaster, L. (2004). *Getting to know ArcGIS Desktop: Basics of ArcView, ArcEditor and ArcInfo* (2. Edition), ESRI Press, ISBN 978-

Papadopoulou, E. & Lee, D. T. (1998). A new approach for the geodesic Voronoi diagram of points and simple polygon and other restricted polygonal domains. *Algothimica*, Vol.20,

Preparata, F. P. & Shamos, M. I. (1985). *Computational geometry: an introduction* (1. edition),

Shamos, M. I. (1975). Geometric complexity. *Proceedings of seventh annual ACM symposium on Theory of computing, STOC '75*, pp. 224-233, Association of Computing Machinery,

*Mathematik*, Vol.1, No.1, (1959), pp. 269-271, ISSN 0029-599X

Karlsruhe Technik und Wirtschaft, Karlsruhe, Germany

<http://digbib.ubka.uni-karlsruhe.de/volltexte/1000019357>

**Author details** 

**7. References** 

1987

0360-0300

Nicolai Guth and Philipp Klingel

Vol.135, (1939), pp. 416-458

6, Chichester, England

1589482104, Redlands, USA

Nr.4, (1998), pp. 319-352, ISSN 0178-4617

Albuquerque, New Mexico, USA, 1975

Springer, ISBN 0-387-96131-3, New York, USA

Another method to compute the shortest path Voronoi diagram following the described approach is the analytic representation of the bi-sectors. A bi-sector between just two nodes can be defined as a mathematical function. The major challenge is to identify those sections of the functions, which make up the bi-sectors of the total Voronoi diagram based on more than two nodes by intersecting the functions. Due to the mathematical properties the function intersections cannot be computed directly but only iteratively. Thus, in a mathematical sense, the approach is not exact as well. A benefit of the analytically described bi-sectors is the fact that only two points per function have to be calculated. To compute a Voronoi diagram polygon, the relevant vertices can be simply calculated with a free choice of density using the determined functions. Thus, the computation has no impact on the time needed for the actual determination of the mathematical description of the Voronoi diagram. An analytic solution is currently being developed by the authors.

Furthermore, a GIS-tool to allocate the water demand was developed and presented in section 5. The GIS-tool consists of two principle calculation steps. First, the catchment areas of the nodes are computed using the developed approach to calculate Voronoi diagrams. The second step consists of three options to determine the demand. The most precise result is achieved by applying option A due to the superior input data. In this case, values and location of demands are known and directly allocated to the nodes according to the Voronoi diagram. If such input data is not available, the distribution of water demand has to be determined first. Option B offers an approach to determine the water demand based on the population distribution, which is considered to be constant for defined areas. Option C follows the same approach but provides a higher accuracy by relating the population (and thus, the demand) to the building density instead of considering a constant population distribution for specific areas. The developed GIS-tool has been already successfully applied in several network modelling projects of the Institute for Water and River Basin Management of the Karlsruhe Institute of Technology and proofed to be a useful solution for demand allocation.

Finally, it is worth mentioning that an application of the presented approach is not limited to demand allocation in water distribution network modelling. A broad field of applications is conceivable, such as determination of rainfall catchment areas, modelling of growth processes or stochastic analysis.

## **Author details**

300 Application of Geographic Information Systems

for demand allocation.

processes or stochastic analysis.

At the moment, the algorithm is not comparable with other algorithms determining Voronoi diagrams based on the Euclidean distance (sweep-algorithm, divide-and-conquer-approach) regarding the computation speed. But the advantages of increased accuracy of the results of an application for demand allocation in water distribution network modelling (and possibly in other applications) outbalance the speed issue. Furthermore, with an enhanced radius controlling mechanism and a reduction of the circle combinations for which intersections are computed, there is still potential for improvements. Moreover, the overall speed could be increased by using highly parallel computations, since each circle combination can be

worked at separately. This could be facilitated by means of GPU or multi CPU usage.

diagram. An analytic solution is currently being developed by the authors.

Another method to compute the shortest path Voronoi diagram following the described approach is the analytic representation of the bi-sectors. A bi-sector between just two nodes can be defined as a mathematical function. The major challenge is to identify those sections of the functions, which make up the bi-sectors of the total Voronoi diagram based on more than two nodes by intersecting the functions. Due to the mathematical properties the function intersections cannot be computed directly but only iteratively. Thus, in a mathematical sense, the approach is not exact as well. A benefit of the analytically described bi-sectors is the fact that only two points per function have to be calculated. To compute a Voronoi diagram polygon, the relevant vertices can be simply calculated with a free choice of density using the determined functions. Thus, the computation has no impact on the time needed for the actual determination of the mathematical description of the Voronoi

Furthermore, a GIS-tool to allocate the water demand was developed and presented in section 5. The GIS-tool consists of two principle calculation steps. First, the catchment areas of the nodes are computed using the developed approach to calculate Voronoi diagrams. The second step consists of three options to determine the demand. The most precise result is achieved by applying option A due to the superior input data. In this case, values and location of demands are known and directly allocated to the nodes according to the Voronoi diagram. If such input data is not available, the distribution of water demand has to be determined first. Option B offers an approach to determine the water demand based on the population distribution, which is considered to be constant for defined areas. Option C follows the same approach but provides a higher accuracy by relating the population (and thus, the demand) to the building density instead of considering a constant population distribution for specific areas. The developed GIS-tool has been already successfully applied in several network modelling projects of the Institute for Water and River Basin Management of the Karlsruhe Institute of Technology and proofed to be a useful solution

Finally, it is worth mentioning that an application of the presented approach is not limited to demand allocation in water distribution network modelling. A broad field of applications is conceivable, such as determination of rainfall catchment areas, modelling of growth Nicolai Guth and Philipp Klingel

*Karlsruhe Institute of Technology (KIT), Institute for Water and River Basin Management, Germany* 

## **7. References**

	- Walski, T. M.; Chase, D. V.; Savic, D. A.; Grayman, W.; Beckwith, S. & Koelle, E. (2003). *Advanced Water Distribution Modeling and Management*. 1, Haestad Press, ISBN 0- 9714141-2-2, Waterbury, USA

**Chapter 16** 

© 2012 Patel and Waters, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Patel and Waters, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Using Geographic Information** 

**Systems for Health Research** 

Additional information is available at the end of the chapter

The connection between public health and geography can be traced back to Hippocrates (c. 400 BC) who deduced that spatially varying factors such as climate, elevation, environmental toxins, ethnicity and race contributed to the spatial patterns of illness (Parchman *et al.*, 2002). The observations of Hippocrates still hold true today and these relationships between geography and disease have allowed geospatial methods to become valuable within the field of public health. Maps have long been a useful tool for visualizing patterns in health care. One of the earliest and most commonly cited examples is from the mid 1800s when Dr. John Snow deduced the source of a cholera outbreak in London based on a simple visualization of the incidents of cholera in relation to water pumps (Johnson, 2007). Although visualizing data geographically is still very valuable for uncovering patterns and associations over space, geospatial analysis has become more sophisticated

One tool that can be used to apply advanced geospatial methods to health care problems is a geographic information system (GIS). The power of GIS lies in its ability to analyze, store and display large amounts of spatially referenced data. In a field where manual data analysis can become overwhelming, GIS is a valuable tool. The medical research applications of GIS are numerous and include finding disease clusters and their possible causes (Murray *et al.*, 2009; Srivastava *et al.*, 2009), improving deployment for emergency services (Ong *et al.*, 2009; Peleg & Pliskin, 2004) and determining if an area is being served adequately by health services (Cinnamon *et al.*, 2009; Schuurman *et al.*, 2008). There have been several reviews and textbooks published in the past decade that focus on the application of GIS to different areas of health research (Cromley & McLafferty, 2002; Kurland & Gorr, 2009; McLafferty, 2003;

Alka Patel and Nigel Waters

http://dx.doi.org/10.5772/47941

**1. Introduction** 

over time.

Parchman *et al.*, 2002; Rushton, 2003).

## **Using Geographic Information Systems for Health Research**

Alka Patel and Nigel Waters

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/47941

## **1. Introduction**

302 Application of Geographic Information Systems

9714141-2-2, Waterbury, USA

Walski, T. M.; Chase, D. V.; Savic, D. A.; Grayman, W.; Beckwith, S. & Koelle, E. (2003). *Advanced Water Distribution Modeling and Management*. 1, Haestad Press, ISBN 0-

> The connection between public health and geography can be traced back to Hippocrates (c. 400 BC) who deduced that spatially varying factors such as climate, elevation, environmental toxins, ethnicity and race contributed to the spatial patterns of illness (Parchman *et al.*, 2002). The observations of Hippocrates still hold true today and these relationships between geography and disease have allowed geospatial methods to become valuable within the field of public health. Maps have long been a useful tool for visualizing patterns in health care. One of the earliest and most commonly cited examples is from the mid 1800s when Dr. John Snow deduced the source of a cholera outbreak in London based on a simple visualization of the incidents of cholera in relation to water pumps (Johnson, 2007). Although visualizing data geographically is still very valuable for uncovering patterns and associations over space, geospatial analysis has become more sophisticated over time.

> One tool that can be used to apply advanced geospatial methods to health care problems is a geographic information system (GIS). The power of GIS lies in its ability to analyze, store and display large amounts of spatially referenced data. In a field where manual data analysis can become overwhelming, GIS is a valuable tool. The medical research applications of GIS are numerous and include finding disease clusters and their possible causes (Murray *et al.*, 2009; Srivastava *et al.*, 2009), improving deployment for emergency services (Ong *et al.*, 2009; Peleg & Pliskin, 2004) and determining if an area is being served adequately by health services (Cinnamon *et al.*, 2009; Schuurman *et al.*, 2008). There have been several reviews and textbooks published in the past decade that focus on the application of GIS to different areas of health research (Cromley & McLafferty, 2002; Kurland & Gorr, 2009; McLafferty, 2003; Parchman *et al.*, 2002; Rushton, 2003).

© 2012 Patel and Waters, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Patel and Waters, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

While the use of GIS is gaining favor with health researchers, barriers remain in the uptake of more advanced geospatial methods by health care decision-makers. A recent qualitative study conducted in the UK found that although health care decision-makers see the value of using GIS in the decision making process, many (especially those in the community setting) still view GIS primarily as a visualization tool (Joyce, 2009). This chapter aims to go beyond a simple discussion of GIS and its standard capabilities (e.g. mapping, buffering, clipping, intersecting, etc.) in order to focus more broadly on geospatial methods (including spatial statistics) that can be used to describe and study the distribution of disease and health care delivery. It adds to current literature by taking an approach that compares and contrasts different geospatial methods with the goal of informing health care decision-makers about the strengths and weaknesses of each.

Using Geographic Information Systems for Health Research 305

level (Apparicio *et al.*, 2008). This method of representing population links census data tables to geographic boundary files using unique area level identifiers. The aggregation of populations into larger areal units can affect the results obtained from a spatial analysis. For example, when studying geographic access to urban amenities, Canadian studies have found that when accessibility was measured for smaller spatial units, the measures were less subject to aggregation error than when larger spatial units were used to represent

There are limitations to geospatial analysis when populations are represented in this way. As discussed above, the aggregation errors differ depending on the spatial unit used. One study showed that for some amenities the aggregation to larger areas for abundant resources (e.g. playgrounds) should use smaller areal units more representative of where the population is distributed (Hewko *et al.*, 2002). Another limitation occurs because census data are standardized based on population numbers. This causes rural areas to be represented by larger areal units that do not precisely represent locations of where the population is distributed. Finally, the modifiable areal unit problem (MAUP) should be considered when using different census boundaries to represent populations. The MAUP occurs because the results of any geographic analysis can be sensitive to the areal units that are being used for the study (Fotheringham & Wong, 1991; Oppenshaw, 1996). For example, results can differ depending on the level of census data used in analysis. Notable strengths of using census data to represent populations are: 1) in most developed countries these data are largely publically available and 2) these data can be grouped by age categories and linked with other census variables of interest to study associations (with, for example, income,

Census data are often used by researchers to represent population numbers evenly within a specified area. For example, for analytical purposes the population within a census tract is often represented by a single point at the geographic centre of the polygon, known as the centroid of the area. A method used to represent population distribution more accurately is dasymetric modeling. Dasymetric modeling uses additional data sources to estimate the distribution of aggregated population data within a specified unit of analysis (Briggs *et al.*, 2007a; Mennis, 2009; Poulsen & Kennedy, 2004). This technique essentially uses a combination of variables to represent where population is truly located within an area. Dasymetric models have used light emission and remotely sensed land cover data to represent populations at the small area level in Europe (Briggs *et al.*, 2007b). One study has shown how population estimates in residential areas can be more accurately represented using grid based land cover data in combination with a mailing database to represent where residential populations are located within a census area (Langford *et al.*, 2008). Dasymetric modeling produces population numbers that are identical to those within a census area; it is the spatial distribution of the population that is modified within these spatial units. A major limitation of using this method for representing population distribution is that it is more time intensive and requires the use of additional data sources when compared to simply

populations (Apparicio *et al.*, 2008; Hewko *et al.*, 2002).

employment, education, ethnicity and other socio-economic indicators).

**2.2. Dasymetric modeling** 

This chapter is written for an audience that has a basic knowledge of statistics and GIS. It does not give a comprehensive overview of all geospatial methods that are available for health research but instead highlights ten different geospatial methods that can be used to address problems posed when studying the distribution of disease and health service delivery. Five topic areas will be covered: representing populations, identifying disease clusters, identifying associations in a spatial context, identifying areas with access to health care and identifying the best location for a new health service. Although each method is discussed briefly, references will be provided where health researchers and decision-makers can find details. Each of these five sections will be further subdivided into two sections. 'Representing populations' will discuss the topics of *Geographically linked census data* and *Dasymetric modeling*. 'Identifying disease clusters' will discuss the topics of *Measures of spatial autocorrelation* and *Kernel density estimation*. 'Identifying associations in a spatial context' will cover *Spatial regression* and *Geographically weighted regression*. 'Identifying areas with access to a health care service' will discuss the topics of *Road network analysis* and *Interpolation*. Finally, 'Identifying the best location for a new service' will discuss the topics of *Coverage problems* and *Distance problems*.

## **2. Representing populations**

Studying relationships between disease or health service delivery and environmental or population characteristics, requires researchers to represent populations in terms of their location in space. Two geospatial methods that can be used to represent population groups will be discussed in this section: geographically linked census data and dasymetric modeling.

## **2.1. Geographically linked census data**

Studies in health research may sometimes be focused on the general population that is at risk for a disease or that has access to a specific health service in an area. This group can be represented by population counts in the form of census data. Census data can be represented at many different levels. For example, in Canada census data may be represented at the census block level, the dissemination area level, and at the census tract level (Apparicio *et al.*, 2008). This method of representing population links census data tables to geographic boundary files using unique area level identifiers. The aggregation of populations into larger areal units can affect the results obtained from a spatial analysis. For example, when studying geographic access to urban amenities, Canadian studies have found that when accessibility was measured for smaller spatial units, the measures were less subject to aggregation error than when larger spatial units were used to represent populations (Apparicio *et al.*, 2008; Hewko *et al.*, 2002).

There are limitations to geospatial analysis when populations are represented in this way. As discussed above, the aggregation errors differ depending on the spatial unit used. One study showed that for some amenities the aggregation to larger areas for abundant resources (e.g. playgrounds) should use smaller areal units more representative of where the population is distributed (Hewko *et al.*, 2002). Another limitation occurs because census data are standardized based on population numbers. This causes rural areas to be represented by larger areal units that do not precisely represent locations of where the population is distributed. Finally, the modifiable areal unit problem (MAUP) should be considered when using different census boundaries to represent populations. The MAUP occurs because the results of any geographic analysis can be sensitive to the areal units that are being used for the study (Fotheringham & Wong, 1991; Oppenshaw, 1996). For example, results can differ depending on the level of census data used in analysis. Notable strengths of using census data to represent populations are: 1) in most developed countries these data are largely publically available and 2) these data can be grouped by age categories and linked with other census variables of interest to study associations (with, for example, income, employment, education, ethnicity and other socio-economic indicators).

#### **2.2. Dasymetric modeling**

304 Application of Geographic Information Systems

the strengths and weaknesses of each.

*problems* and *Distance problems*.

modeling.

**2. Representing populations** 

**2.1. Geographically linked census data** 

While the use of GIS is gaining favor with health researchers, barriers remain in the uptake of more advanced geospatial methods by health care decision-makers. A recent qualitative study conducted in the UK found that although health care decision-makers see the value of using GIS in the decision making process, many (especially those in the community setting) still view GIS primarily as a visualization tool (Joyce, 2009). This chapter aims to go beyond a simple discussion of GIS and its standard capabilities (e.g. mapping, buffering, clipping, intersecting, etc.) in order to focus more broadly on geospatial methods (including spatial statistics) that can be used to describe and study the distribution of disease and health care delivery. It adds to current literature by taking an approach that compares and contrasts different geospatial methods with the goal of informing health care decision-makers about

This chapter is written for an audience that has a basic knowledge of statistics and GIS. It does not give a comprehensive overview of all geospatial methods that are available for health research but instead highlights ten different geospatial methods that can be used to address problems posed when studying the distribution of disease and health service delivery. Five topic areas will be covered: representing populations, identifying disease clusters, identifying associations in a spatial context, identifying areas with access to health care and identifying the best location for a new health service. Although each method is discussed briefly, references will be provided where health researchers and decision-makers can find details. Each of these five sections will be further subdivided into two sections. 'Representing populations' will discuss the topics of *Geographically linked census data* and *Dasymetric modeling*. 'Identifying disease clusters' will discuss the topics of *Measures of spatial autocorrelation* and *Kernel density estimation*. 'Identifying associations in a spatial context' will cover *Spatial regression* and *Geographically weighted regression*. 'Identifying areas with access to a health care service' will discuss the topics of *Road network analysis* and *Interpolation*. Finally, 'Identifying the best location for a new service' will discuss the topics of *Coverage* 

Studying relationships between disease or health service delivery and environmental or population characteristics, requires researchers to represent populations in terms of their location in space. Two geospatial methods that can be used to represent population groups will be discussed in this section: geographically linked census data and dasymetric

Studies in health research may sometimes be focused on the general population that is at risk for a disease or that has access to a specific health service in an area. This group can be represented by population counts in the form of census data. Census data can be represented at many different levels. For example, in Canada census data may be represented at the census block level, the dissemination area level, and at the census tract Census data are often used by researchers to represent population numbers evenly within a specified area. For example, for analytical purposes the population within a census tract is often represented by a single point at the geographic centre of the polygon, known as the centroid of the area. A method used to represent population distribution more accurately is dasymetric modeling. Dasymetric modeling uses additional data sources to estimate the distribution of aggregated population data within a specified unit of analysis (Briggs *et al.*, 2007a; Mennis, 2009; Poulsen & Kennedy, 2004). This technique essentially uses a combination of variables to represent where population is truly located within an area. Dasymetric models have used light emission and remotely sensed land cover data to represent populations at the small area level in Europe (Briggs *et al.*, 2007b). One study has shown how population estimates in residential areas can be more accurately represented using grid based land cover data in combination with a mailing database to represent where residential populations are located within a census area (Langford *et al.*, 2008). Dasymetric modeling produces population numbers that are identical to those within a census area; it is the spatial distribution of the population that is modified within these spatial units. A major limitation of using this method for representing population distribution is that it is more time intensive and requires the use of additional data sources when compared to simply

using the raw census data. The strength of this method is that it more accurately represents population distribution in large areas (e.g. rural census areas). One study found lower access to primary health care services using dasymetric modeling when compared to using point census data to represent populations in rural areas (Langford & Higgs, 2006). This shows that census methods of representing population data could be overestimating access in rural areas.

Using Geographic Information Systems for Health Research 307

clustering of variables with similar attributes occurs over space. Moran's I is a global measure for evaluating spatial autocorrelation. It is measured using the location of a given set of points and their associated attributes over an entire study area. The values of Moran's I range from -1 to +1 (dispersed to clustered, respectively, with values close to 0 representing no spatial autocorrelation as might be observed in a random pattern) (Rogerson, 2010). One study conducted to evaluate the survival from Cardiac arrest in a US county used Moran's I to determine that there was no clustering of survival rates for cardiac arrest in the study area (Warden *et al.*, 2007). The Moran's I statistic can also be used to define the extent of spatial bands used for other spatial statistics as exemplified in a study that determined hot spots

Moran's I measure is that it allows for an overall assessment of whether spatial autocorrelation exists over a study area. The limitation of the Moran's I measure is that it does not take into account variation at the local level and reports only one measure for the

In order to deal with the limitation of the global measure of Moran's I, a local measure of spatial association was introduced by Anselin (Anselin, 1995). This local indicator of spatial association (LISA) breaks down the global indicator to allow for an understanding of how each observation contributes to the overall measure. There are three main purposes that the LISA statistic can serve: 1) to indicate local areas of spatial non-stationarity, 2) to evaluate how different locations within an area are contributing to the overall global spatial autocorrelation statistic and 3) to evaluate if spatial outliers exist (Anselin, 1995). A positive LISA value for an area indicates that the feature is surrounded by features with similar values (a cluster). A negative LISA value indicates that the feature is surrounded by features with dissimilar values (an outlier) (Anselin, 1995). LISA has been used to uncover clusters of overweight and obesity in Canada (Pouliou & Elliott, 2009) and clusters of chronic ischemic heart disease in relation to aerosol air pollution in the Eastern United States (Hu & Rao, 2009). As with most geospatial methods, the above methods of cluster analysis are limited

Quadrat analysis is a method that evaluates the points within polygon areas to assess whether a pattern is random, dispersed or clustered in space. Standard quadrat analysis is limited by edge effects, where the density of points between adjoining areas can change abruptly from one value to another. Kernel density estimates address the limitations of simple quadrat analysis though the use of a kernel function that smoothes the values of interest over space by forming a count of events per unit area within a moving quadrat or 'window'. This produces a more spatially 'smooth' estimate of variation in spatial autocorrelation than can be obtained by using a fixed grid of quadrats (Gatrell *et al.*, 1996). A detailed description of the kernel function is outlined in several resources (de Smith *et al.*,

by the quality and scale of the data that are available (Shi, 2009).

**3.2. Kernel density estimation** 

2007a; Gatrell *et al.*, 1996).

statistic (McEntee & Ogneva-Himmelberger, 2008). The value of the

using the Getis-Ord Gi\*

entire study area.

Both methods for representing population distribution have been shown to have strengths and limitations. In the absence of patient data or when the general population is of interest, it is possible to use census data at varying levels of aggregation. Using this method, populations can be represented as a centroid within the areal unit or as being evenly distributed throughout the area. Dasymetric mapping allows for a more accurate representation of the population distribution within areal units but at the cost of increased time and data collection effort. However, this method can be valuable when population data are represented for rural areas. Both methods suffer from the problem of the MAUP and this should be considered when applying any level of population data to analysis. Figure 1 highlights the differences in representing populations using two different representations of census data and dasymetric modeling.

**Figure 1.** Comparison of methods to represent populations (from left to right) by the centroid, evenly distributed points, and dasymetric modeling methods (Wielki, 2009)

## **3. Identifying disease clusters**

The identification of clusters of disease can help to target health care delivery and identify where resource allocation efforts should be focused. Point pattern analysis is a broad category of geospatial methods that can be used to identify disease clusters. In this section we will discuss two specific methods: measures of spatial autocorrelation and kernel density estimates.

## **3.1. Measures of spatial autocorrelation**

Spatial autocorrelation can generally be thought of as the association of a variable with itself over space. Statistical measures of spatial autocorrelation can be used to determine if the clustering of variables with similar attributes occurs over space. Moran's I is a global measure for evaluating spatial autocorrelation. It is measured using the location of a given set of points and their associated attributes over an entire study area. The values of Moran's I range from -1 to +1 (dispersed to clustered, respectively, with values close to 0 representing no spatial autocorrelation as might be observed in a random pattern) (Rogerson, 2010). One study conducted to evaluate the survival from Cardiac arrest in a US county used Moran's I to determine that there was no clustering of survival rates for cardiac arrest in the study area (Warden *et al.*, 2007). The Moran's I statistic can also be used to define the extent of spatial bands used for other spatial statistics as exemplified in a study that determined hot spots using the Getis-Ord Gi\* statistic (McEntee & Ogneva-Himmelberger, 2008). The value of the Moran's I measure is that it allows for an overall assessment of whether spatial autocorrelation exists over a study area. The limitation of the Moran's I measure is that it does not take into account variation at the local level and reports only one measure for the entire study area.

In order to deal with the limitation of the global measure of Moran's I, a local measure of spatial association was introduced by Anselin (Anselin, 1995). This local indicator of spatial association (LISA) breaks down the global indicator to allow for an understanding of how each observation contributes to the overall measure. There are three main purposes that the LISA statistic can serve: 1) to indicate local areas of spatial non-stationarity, 2) to evaluate how different locations within an area are contributing to the overall global spatial autocorrelation statistic and 3) to evaluate if spatial outliers exist (Anselin, 1995). A positive LISA value for an area indicates that the feature is surrounded by features with similar values (a cluster). A negative LISA value indicates that the feature is surrounded by features with dissimilar values (an outlier) (Anselin, 1995). LISA has been used to uncover clusters of overweight and obesity in Canada (Pouliou & Elliott, 2009) and clusters of chronic ischemic heart disease in relation to aerosol air pollution in the Eastern United States (Hu & Rao, 2009). As with most geospatial methods, the above methods of cluster analysis are limited by the quality and scale of the data that are available (Shi, 2009).

#### **3.2. Kernel density estimation**

306 Application of Geographic Information Systems

census data and dasymetric modeling.

**3. Identifying disease clusters** 

**3.1. Measures of spatial autocorrelation** 

estimates.

in rural areas.

using the raw census data. The strength of this method is that it more accurately represents population distribution in large areas (e.g. rural census areas). One study found lower access to primary health care services using dasymetric modeling when compared to using point census data to represent populations in rural areas (Langford & Higgs, 2006). This shows that census methods of representing population data could be overestimating access

Both methods for representing population distribution have been shown to have strengths and limitations. In the absence of patient data or when the general population is of interest, it is possible to use census data at varying levels of aggregation. Using this method, populations can be represented as a centroid within the areal unit or as being evenly distributed throughout the area. Dasymetric mapping allows for a more accurate representation of the population distribution within areal units but at the cost of increased time and data collection effort. However, this method can be valuable when population data are represented for rural areas. Both methods suffer from the problem of the MAUP and this should be considered when applying any level of population data to analysis. Figure 1 highlights the differences in representing populations using two different representations of

**Figure 1.** Comparison of methods to represent populations (from left to right) by the centroid, evenly

The identification of clusters of disease can help to target health care delivery and identify where resource allocation efforts should be focused. Point pattern analysis is a broad category of geospatial methods that can be used to identify disease clusters. In this section we will discuss two specific methods: measures of spatial autocorrelation and kernel density

Spatial autocorrelation can generally be thought of as the association of a variable with itself over space. Statistical measures of spatial autocorrelation can be used to determine if the

distributed points, and dasymetric modeling methods (Wielki, 2009)

Quadrat analysis is a method that evaluates the points within polygon areas to assess whether a pattern is random, dispersed or clustered in space. Standard quadrat analysis is limited by edge effects, where the density of points between adjoining areas can change abruptly from one value to another. Kernel density estimates address the limitations of simple quadrat analysis though the use of a kernel function that smoothes the values of interest over space by forming a count of events per unit area within a moving quadrat or 'window'. This produces a more spatially 'smooth' estimate of variation in spatial autocorrelation than can be obtained by using a fixed grid of quadrats (Gatrell *et al.*, 1996). A detailed description of the kernel function is outlined in several resources (de Smith *et al.*, 2007a; Gatrell *et al.*, 1996).

One important consideration when using kernel density estimation is that the degree of smoothing produced is dependent on the size of the bandwidth. It is often useful to examine several density plots of the same samples, all smoothed by different amounts, in order to gain greater insight into the data (Farwell, 1999). Guidelines for choosing bandwidths vary from simple visual inspection, to specific parameters such as specified resolution limits or nearest neighbor distances (Levine, 2007). One study by Lerner *et al.* used kernel density estimation in ArcGIS with default bandwidth settings to identify out of hospital cardiac arrests in Rochester, NY. By integrating census data with the data from a cluster analysis using kernel density estimation, they were able to discover that out of hospital cardiac arrest clusters were associated with communities with lower income, higher number of African-Americans residents and more residents without a high school diploma than the population of the city in general (Lerner *et al.*, 2005). The identification of clusters of events could lead to an increased understanding of where health care resources should be allocated. One of the major limitations of kernel density estimation is that the choice of size of the 'moving window' (bandwidth of the kernel function) can alter the results of the cluster analysis (de Smith *et al.*, 2007a). This method is superior to a simple quadrat analysis because it takes into account edge effects and creates a more realistic representation of spatial variation through a smoothing process (Gatrell *et al.*, 1996).

Using Geographic Information Systems for Health Research 309

The two methods discussed to identify disease clusters are similar in that they both allow for a statistical measure of spatial association within areas. Using the global and local measures of spatial association can give a quantitative measure of whether global or local spatial autocorrelation (reflecting clustering) exists in a dataset while the kernel density estimates are a way of using data locally within windows to create a grid surface of where clustering is occurring. The spatial measures of association that show whether spatial autocorrelation exists in a dataset can indicate what type of further analysis should be undertaken. The presence of spatial autocorrelation suggests that spatial regression techniques should be used in place of simple linear regression. While the LISA method can be used to measure spatial association within a defined area, the method of kernel density estimation allows for minimizing edge effects between areal units with sharp contrasts. One recent study has shown how LISA and kernel density estimation can be used together to uncover hot spots of diarrhea in Thailand (Chaikaew *et al.*, 2009). Figure 2 shows how kernel density estimation has been used to identify areas of high spousal violence incidents in

The purpose of many studies of the distribution of health services is to determine if inequity exists. These studies strive to determine if those who are considered disadvantaged are served as equally by health services as those who are not (Meade & Emch, 2010). Studies may also seek to understand relationships between disease prevalence and various environmental variables. Spatial regression and geographically weighted regression are two spatial statistical methods that can be used to study these types of associations between

When exploring variables over space it is often found that spatial autocorrelation exists. Spatial dependence violates the assumption of independence needed for standard linear regression (Legendre, 1993). Spatial regression is a valuable method in situations where spatial autocorrelation exists because it considers the spatial dependence between variables by giving weights to them based on distance. The weighting matrix used in the spatial regression process accounts for the spatial dependency in the dependent variable (Fotheringham *et al.*, 2005b). In order to determine accurately the weights used within this matrix it is necessary to first determine the range (distance) of the spatial dependence on the dependent variable using a semivariogram, a graph that visually provides a description of how the data are related (correlated) with distance. Beyond the range defined by the semivariogram there is no longer spatial dependence (Fotheringham *et al.*, 2005b). This incorporation of spatial dependence is important because spatial autocorrelation can reduce the efficiency of the Ordinary Least Squares estimator used in linear regression. Since most features over space have an element of spatial dependence associated with them, it suggests that spatial regression may be a better choice than simple linear regression when modeling

Calgary, Canada (Hansen, 2005).

dependent and independent variables.

**4.1. Spatial regression** 

**4. Identifying associations in a spatial context** 

**Figure 2.** Using kernel density estimation to identify hot spots (Hansen, 2005)

The two methods discussed to identify disease clusters are similar in that they both allow for a statistical measure of spatial association within areas. Using the global and local measures of spatial association can give a quantitative measure of whether global or local spatial autocorrelation (reflecting clustering) exists in a dataset while the kernel density estimates are a way of using data locally within windows to create a grid surface of where clustering is occurring. The spatial measures of association that show whether spatial autocorrelation exists in a dataset can indicate what type of further analysis should be undertaken. The presence of spatial autocorrelation suggests that spatial regression techniques should be used in place of simple linear regression. While the LISA method can be used to measure spatial association within a defined area, the method of kernel density estimation allows for minimizing edge effects between areal units with sharp contrasts. One recent study has shown how LISA and kernel density estimation can be used together to uncover hot spots of diarrhea in Thailand (Chaikaew *et al.*, 2009). Figure 2 shows how kernel density estimation has been used to identify areas of high spousal violence incidents in Calgary, Canada (Hansen, 2005).

## **4. Identifying associations in a spatial context**

The purpose of many studies of the distribution of health services is to determine if inequity exists. These studies strive to determine if those who are considered disadvantaged are served as equally by health services as those who are not (Meade & Emch, 2010). Studies may also seek to understand relationships between disease prevalence and various environmental variables. Spatial regression and geographically weighted regression are two spatial statistical methods that can be used to study these types of associations between dependent and independent variables.

## **4.1. Spatial regression**

308 Application of Geographic Information Systems

through a smoothing process (Gatrell *et al.*, 1996).

**Figure 2.** Using kernel density estimation to identify hot spots (Hansen, 2005)

One important consideration when using kernel density estimation is that the degree of smoothing produced is dependent on the size of the bandwidth. It is often useful to examine several density plots of the same samples, all smoothed by different amounts, in order to gain greater insight into the data (Farwell, 1999). Guidelines for choosing bandwidths vary from simple visual inspection, to specific parameters such as specified resolution limits or nearest neighbor distances (Levine, 2007). One study by Lerner *et al.* used kernel density estimation in ArcGIS with default bandwidth settings to identify out of hospital cardiac arrests in Rochester, NY. By integrating census data with the data from a cluster analysis using kernel density estimation, they were able to discover that out of hospital cardiac arrest clusters were associated with communities with lower income, higher number of African-Americans residents and more residents without a high school diploma than the population of the city in general (Lerner *et al.*, 2005). The identification of clusters of events could lead to an increased understanding of where health care resources should be allocated. One of the major limitations of kernel density estimation is that the choice of size of the 'moving window' (bandwidth of the kernel function) can alter the results of the cluster analysis (de Smith *et al.*, 2007a). This method is superior to a simple quadrat analysis because it takes into account edge effects and creates a more realistic representation of spatial variation

> When exploring variables over space it is often found that spatial autocorrelation exists. Spatial dependence violates the assumption of independence needed for standard linear regression (Legendre, 1993). Spatial regression is a valuable method in situations where spatial autocorrelation exists because it considers the spatial dependence between variables by giving weights to them based on distance. The weighting matrix used in the spatial regression process accounts for the spatial dependency in the dependent variable (Fotheringham *et al.*, 2005b). In order to determine accurately the weights used within this matrix it is necessary to first determine the range (distance) of the spatial dependence on the dependent variable using a semivariogram, a graph that visually provides a description of how the data are related (correlated) with distance. Beyond the range defined by the semivariogram there is no longer spatial dependence (Fotheringham *et al.*, 2005b). This incorporation of spatial dependence is important because spatial autocorrelation can reduce the efficiency of the Ordinary Least Squares estimator used in linear regression. Since most features over space have an element of spatial dependence associated with them, it suggests that spatial regression may be a better choice than simple linear regression when modeling

features over space, although spatial dependence can be removed using a variety of methods including detrending procedures where spatial trends are evident.

Using Geographic Information Systems for Health Research 311

when relationships between variables differ over space (Fotheringham *et al.*, 2005a). A simple map of the residuals of a model indicates whether a spatial regression method should be employed over a traditional linear regression method; when the residuals are clustered into high and low values over space it is likely necessary to use a spatial regression method. The residuals of a spatial regression or GWR can also be mapped to assess the areas

The concept of access to health care is a central focus of many health research studies. Spatial accessibility is one aspect of access to health care that can be measured using geospatial methods. Geographic access to a service can be represented as straight line distance, network distance and travel time based on travel along a road network (Apparicio *et al.*, 2008); travel effort can also be represented as a continuous travel time surface over a specified area. The two methods that we will focus our discussion on in this section are road

Network analysis has been considered by many researchers to be a more accurate method for evaluating accessibility in terms of travel time and distance when the focus is on travel via roads (Brabyn & Skelly, 2002; Christie & Fone, 2003). These researchers argue that although straight line Euclidean distance and the evaluation of catchments/service areas using Thiessen polygons are commonly used in accessibility analysis, these methods only allow for a crude measure of accessibility (Brabyn & Skelly, 2002). Network analysis allows for a more realistic representation of travel times along a road network by accounting for the different travel speeds along the various road links. Network analysis has been used in health care research to study access to video lottery terminals (Robitaille & Herjean, 2008), access to parks (Comber *et al.*, 2008) and access to palliative care (Cinnamon *et al.*, 2008). The measures of access obtained through network analysis can then be used in combination with census data to assess the population numbers with access to a facility within a specified distance or travel time (Christie & Fone, 2003; Nallamothu *et al.*, 2006). The travel times and distances can also be used to evaluate associations between measures of access and various health indicator (socio-economic) variables (Hare & Barcus, 2007). One UK study found that travel time estimates based on network analysis may be preferred to reported travel times for modeling purposes because the reported times can reflect unusual circumstances and reporter error (Haynes *et al.*, 2006). The strength of road network analysis is that it allows for the estimation of travel times between different point locations along a road network. It does not allow for continuous travel time surfaces to be mapped. In order to do this we can use a method referred to as interpolation.

The method of interpolation is based on Tobler's first law of geography: things that are near are more similar than things that are further apart (Tobler, 2004). The process of

where the models are under or over estimating (Fotheringham *et al.*, 2005b).

**5. Identifying areas with access to a health care service** 

network analysis and interpolation.

**5.1. Road network analysis** 

**5.2. Interpolation** 

The benefit of using the spatial regression technique is its ability to create a model that can be used to make generalized inferences in an area while accounting for the effects of spatial dependence. A study by Zenk *et al*. examined the association between neighborhood racial composition, neighborhood poverty and spatial accessibility to supermarkets using spatial regression and found that racial segregation places African-Americans in more disadvantaged neighborhoods and at further distances from supermarkets (Zenk *et al.*, 2005). Although spatial regression can account for spatial dependence in a dataset, a recent review states that there is always the possibility of ecological bias when area level data are used to represent individual characteristics within studies (Wakefield, 2007).

## **4.2. Geographically weighted regression**

Geographically weighted regression (GWR) is a local regression model that creates unique local coefficients in order to describe the variation in the dependent variable. This creates local models that are well suited to explore the variation in the data graphically (Jetz *et al.*, 2005). Instead of incorporating the spatial dependence into the model as is done through the process of spatial regression, GWR "measures the relationships inherent in the model at each point" (Fotheringham *et al.*, 2005a). The weighting matrix used in GWR is dependent on the location of individual points and thus must be recalculated at each point. This is in contrast to the spatial regression method where a single contiguity weighting matrix is used for the entire study area (based on the extent of spatial dependence on the dependent variable). In GWR, kernels can be used which associate aspects of density of data into the process (Fotheringham *et al.*, 2005a). GWR also has the benefit of being able to evaluate changes in strength of relationships over different scales, which is difficult to do using spatial regression (Jetz et al., 2005). Previous publications have summarized GWR and local analysis in further detail (Brunsdon *et al.*, 1996; Fotheringham *et al.*, 1998).

GWR has been used in a recent study to evaluate the relationship between cold surges and cardiovascular disease (CVD) mortality. This study found that the CVD mortality rates increased significantly after cold surges and that the tolerance of different populations to these weather changes differed over space (Yang *et al.*, 2009). Another study compared the findings from linear regression and GWR when studying access to parks and physical activity sites and found that while OLS regression found weak relationships between park density and physical activity site density, GWR indicated disparities in accessibility that varied over space (Maroko *et al.*, 2009). This study shows the importance of using the appropriate method of analysis for the data that is under study. Yet another study used GWR as a tool to indicate the need for spatial coefficients when linking health and environmental data in geographical analysis (Young *et al.*, 2009).

The main difference between spatial regression and GWR is that one is a global model and one is a local model. Overall, GWR should be used when there is spatial non-stationarity or when relationships between variables differ over space (Fotheringham *et al.*, 2005a). A simple map of the residuals of a model indicates whether a spatial regression method should be employed over a traditional linear regression method; when the residuals are clustered into high and low values over space it is likely necessary to use a spatial regression method. The residuals of a spatial regression or GWR can also be mapped to assess the areas where the models are under or over estimating (Fotheringham *et al.*, 2005b).

## **5. Identifying areas with access to a health care service**

The concept of access to health care is a central focus of many health research studies. Spatial accessibility is one aspect of access to health care that can be measured using geospatial methods. Geographic access to a service can be represented as straight line distance, network distance and travel time based on travel along a road network (Apparicio *et al.*, 2008); travel effort can also be represented as a continuous travel time surface over a specified area. The two methods that we will focus our discussion on in this section are road network analysis and interpolation.

## **5.1. Road network analysis**

310 Application of Geographic Information Systems

**4.2. Geographically weighted regression** 

features over space, although spatial dependence can be removed using a variety of

The benefit of using the spatial regression technique is its ability to create a model that can be used to make generalized inferences in an area while accounting for the effects of spatial dependence. A study by Zenk *et al*. examined the association between neighborhood racial composition, neighborhood poverty and spatial accessibility to supermarkets using spatial regression and found that racial segregation places African-Americans in more disadvantaged neighborhoods and at further distances from supermarkets (Zenk *et al.*, 2005). Although spatial regression can account for spatial dependence in a dataset, a recent review states that there is always the possibility of ecological bias when area level data are

Geographically weighted regression (GWR) is a local regression model that creates unique local coefficients in order to describe the variation in the dependent variable. This creates local models that are well suited to explore the variation in the data graphically (Jetz *et al.*, 2005). Instead of incorporating the spatial dependence into the model as is done through the process of spatial regression, GWR "measures the relationships inherent in the model at each point" (Fotheringham *et al.*, 2005a). The weighting matrix used in GWR is dependent on the location of individual points and thus must be recalculated at each point. This is in contrast to the spatial regression method where a single contiguity weighting matrix is used for the entire study area (based on the extent of spatial dependence on the dependent variable). In GWR, kernels can be used which associate aspects of density of data into the process (Fotheringham *et al.*, 2005a). GWR also has the benefit of being able to evaluate changes in strength of relationships over different scales, which is difficult to do using spatial regression (Jetz et al., 2005). Previous publications have summarized GWR and local

GWR has been used in a recent study to evaluate the relationship between cold surges and cardiovascular disease (CVD) mortality. This study found that the CVD mortality rates increased significantly after cold surges and that the tolerance of different populations to these weather changes differed over space (Yang *et al.*, 2009). Another study compared the findings from linear regression and GWR when studying access to parks and physical activity sites and found that while OLS regression found weak relationships between park density and physical activity site density, GWR indicated disparities in accessibility that varied over space (Maroko *et al.*, 2009). This study shows the importance of using the appropriate method of analysis for the data that is under study. Yet another study used GWR as a tool to indicate the need for spatial coefficients when linking health and

The main difference between spatial regression and GWR is that one is a global model and one is a local model. Overall, GWR should be used when there is spatial non-stationarity or

methods including detrending procedures where spatial trends are evident.

used to represent individual characteristics within studies (Wakefield, 2007).

analysis in further detail (Brunsdon *et al.*, 1996; Fotheringham *et al.*, 1998).

environmental data in geographical analysis (Young *et al.*, 2009).

Network analysis has been considered by many researchers to be a more accurate method for evaluating accessibility in terms of travel time and distance when the focus is on travel via roads (Brabyn & Skelly, 2002; Christie & Fone, 2003). These researchers argue that although straight line Euclidean distance and the evaluation of catchments/service areas using Thiessen polygons are commonly used in accessibility analysis, these methods only allow for a crude measure of accessibility (Brabyn & Skelly, 2002). Network analysis allows for a more realistic representation of travel times along a road network by accounting for the different travel speeds along the various road links. Network analysis has been used in health care research to study access to video lottery terminals (Robitaille & Herjean, 2008), access to parks (Comber *et al.*, 2008) and access to palliative care (Cinnamon *et al.*, 2008). The measures of access obtained through network analysis can then be used in combination with census data to assess the population numbers with access to a facility within a specified distance or travel time (Christie & Fone, 2003; Nallamothu *et al.*, 2006). The travel times and distances can also be used to evaluate associations between measures of access and various health indicator (socio-economic) variables (Hare & Barcus, 2007). One UK study found that travel time estimates based on network analysis may be preferred to reported travel times for modeling purposes because the reported times can reflect unusual circumstances and reporter error (Haynes *et al.*, 2006). The strength of road network analysis is that it allows for the estimation of travel times between different point locations along a road network. It does not allow for continuous travel time surfaces to be mapped. In order to do this we can use a method referred to as interpolation.

#### **5.2. Interpolation**

The method of interpolation is based on Tobler's first law of geography: things that are near are more similar than things that are further apart (Tobler, 2004). The process of

interpolation uses mathematical modeling to 'fill in the gaps' between point data values in order to create a continuous surface of values. There are a variety of resources available to researchers that discuss in detail the theoretical basis of the different interpolation methods (Lam, 1983; Mitas & Mitasova, 1999; Waters, 1997a, 1997b). The advantage of interpolation over simple network analysis is that it can allow for a comparison between different modes of transportation to health care facilities. To do this it would be necessary to have point data representing travel times via different modes for this type of analysis. For example, using interpolation, a study conducted by Lerner *et al.* considered how GIS could be used as a tool to help make transport decisions for trauma patients (Lerner *et al.*, 1999). The goal of that study was to create a map that would show where air or ground ambulance should be preferred to transport patients to a trauma center from a given patient starting location. One Canadian study used this method to compare where ground, helicopter or fixed-wing transport was faster when transporting patients to a cardiac catheterization facility (Patel *et al.*, 2007). Figure 3 highlights how interpolation can be used to model travel times via different modes of emergency transport over an area.

Using Geographic Information Systems for Health Research 313

method to compare access via different modes of travel. This is achieved by evaluating those areas that are served faster by different modes of travel using grid cell comparisons. Both network analysis and interpolation can be used to estimate travel time values that can be used as a measures of geographic access. These measures can then be used to evaluate the associations between spatial access to services and other variables (e.g. socio-economic variables). These access measures can also be linked with census or patient data to study the populations with access to certain services (Nallamothu *et al.*, 2006; Schuurman *et al.*, 2008). It is important to note that because the method of interpolation focuses on creating continuous surfaces from point data, it can also be used to model the concentration of environmental variables from a point source or the spread of an infectious agent over

The spatial analysis of health services is based on the principle that populations and their need for healthcare vary across space. People are not located randomly about the earth and it is often observed that different areas are populated by groups with differing characteristics (e.g. socio-economic status, race, age, income, education, etc.); these varying characteristics of a group often influence their need for health services (McLafferty, 2003). Recent studies have used GIS and the principles of location analysis to evaluate optimal locations for pre-hospital helicopter emergency medical services (Schuurman *et al.*, 2009) and to evaluate how patient access can change depending on where cardiac services are located (Pereira *et al.*, 2007). The operations research literature is a strong source of methods papers and includes literature on how models can be used to maximize service area (Mahmud & Indriasari, 2009), how models can simulate different starting point locations for ambulance (Ingolfsson *et al.*, 2003) and how stochastic approaches can be superior to deterministic approaches when planning health service delivery (Harper *et al.*, 2005). It is beyond the scope of this chapter to discuss the mathematical models that are used for location analysis; this information is available elsewhere (de Smith *et al.*, 2007b; ReVelle & Eiselt, 2005). This section will discuss two different approaches for identifying the best

There are two types of coverage problems that are commonly referred to in the operations research literature: the location set covering problem (LSCP) and the maximal covering location problem (MCLP). The goal of LSCP is to "find the minimum number of facilities and their locations such that all neighborhoods are covered within the maximal distance or time standard" (Church & Murray, 2009a). In this model, each demand point is covered at least once. The goal of MCLP is to "find a prespecified number of facilities such that coverage within a maximal service distance or time is maximized" (Church & Murray, 2009a). Given these constraints this model seeks to provide the best possible coverage with the limited available resources and does not guarantee service. Geospatial tools available

boundaries (Kistemann *et al.*, 2002).

**6. Identifying the best location for a new service** 

location for a new service: coverage and distance problems.

**6.1. Coverage problems** 

**Figure 3.** Using interpolation to model emergency transport (Patel *et al.*, 2007)

There are strengths and limitations to using interpolation as part of a geospatial analysis. One limitation of using this method is that an accurate interpolation requires many points that are evenly distributed over the study area. Another limitation is that depending on the type of interpolation used (e.g. global or local, exact or inexact) the results could vary (Lam, 1983; Mitas & Mitasova, 1999). In spite of these limitations, interpolation is a powerful method to compare access via different modes of travel. This is achieved by evaluating those areas that are served faster by different modes of travel using grid cell comparisons. Both network analysis and interpolation can be used to estimate travel time values that can be used as a measures of geographic access. These measures can then be used to evaluate the associations between spatial access to services and other variables (e.g. socio-economic variables). These access measures can also be linked with census or patient data to study the populations with access to certain services (Nallamothu *et al.*, 2006; Schuurman *et al.*, 2008). It is important to note that because the method of interpolation focuses on creating continuous surfaces from point data, it can also be used to model the concentration of environmental variables from a point source or the spread of an infectious agent over boundaries (Kistemann *et al.*, 2002).

## **6. Identifying the best location for a new service**

The spatial analysis of health services is based on the principle that populations and their need for healthcare vary across space. People are not located randomly about the earth and it is often observed that different areas are populated by groups with differing characteristics (e.g. socio-economic status, race, age, income, education, etc.); these varying characteristics of a group often influence their need for health services (McLafferty, 2003). Recent studies have used GIS and the principles of location analysis to evaluate optimal locations for pre-hospital helicopter emergency medical services (Schuurman *et al.*, 2009) and to evaluate how patient access can change depending on where cardiac services are located (Pereira *et al.*, 2007). The operations research literature is a strong source of methods papers and includes literature on how models can be used to maximize service area (Mahmud & Indriasari, 2009), how models can simulate different starting point locations for ambulance (Ingolfsson *et al.*, 2003) and how stochastic approaches can be superior to deterministic approaches when planning health service delivery (Harper *et al.*, 2005). It is beyond the scope of this chapter to discuss the mathematical models that are used for location analysis; this information is available elsewhere (de Smith *et al.*, 2007b; ReVelle & Eiselt, 2005). This section will discuss two different approaches for identifying the best location for a new service: coverage and distance problems.

#### **6.1. Coverage problems**

312 Application of Geographic Information Systems

different modes of emergency transport over an area.

**Figure 3.** Using interpolation to model emergency transport (Patel *et al.*, 2007)

There are strengths and limitations to using interpolation as part of a geospatial analysis. One limitation of using this method is that an accurate interpolation requires many points that are evenly distributed over the study area. Another limitation is that depending on the type of interpolation used (e.g. global or local, exact or inexact) the results could vary (Lam, 1983; Mitas & Mitasova, 1999). In spite of these limitations, interpolation is a powerful

interpolation uses mathematical modeling to 'fill in the gaps' between point data values in order to create a continuous surface of values. There are a variety of resources available to researchers that discuss in detail the theoretical basis of the different interpolation methods (Lam, 1983; Mitas & Mitasova, 1999; Waters, 1997a, 1997b). The advantage of interpolation over simple network analysis is that it can allow for a comparison between different modes of transportation to health care facilities. To do this it would be necessary to have point data representing travel times via different modes for this type of analysis. For example, using interpolation, a study conducted by Lerner *et al.* considered how GIS could be used as a tool to help make transport decisions for trauma patients (Lerner *et al.*, 1999). The goal of that study was to create a map that would show where air or ground ambulance should be preferred to transport patients to a trauma center from a given patient starting location. One Canadian study used this method to compare where ground, helicopter or fixed-wing transport was faster when transporting patients to a cardiac catheterization facility (Patel *et al.*, 2007). Figure 3 highlights how interpolation can be used to model travel times via

> There are two types of coverage problems that are commonly referred to in the operations research literature: the location set covering problem (LSCP) and the maximal covering location problem (MCLP). The goal of LSCP is to "find the minimum number of facilities and their locations such that all neighborhoods are covered within the maximal distance or time standard" (Church & Murray, 2009a). In this model, each demand point is covered at least once. The goal of MCLP is to "find a prespecified number of facilities such that coverage within a maximal service distance or time is maximized" (Church & Murray, 2009a). Given these constraints this model seeks to provide the best possible coverage with the limited available resources and does not guarantee service. Geospatial tools available

within GIS software can be used to assemble the needed data to address both types of coverage problems. The three essential data components that need to be represented in a GIS in order to solve this problem are the demand points that are to be served by the facilities, the set of possible facility locations and the service area capabilities of these facilities (Church & Murray, 2009a). Ambulance and fire department response are two examples of services that require complete coverage of an area and thus a LSCP approach to coverage. Other services may strive for as much coverage as possible based on the resources available. One example is the placement of public health care facilities. By creating catchment areas within GIS, the best possible locations for new facilities can be evaluated (based on the most area and population covered) (Murad, 2008). The strength of using these types of coverage models is that they provide a spatial-standard based framework for siting new health service facilities that considers population needs and available resources. While these frameworks can be used to study the best locations based on the coverage of a facility over a given area, in other instances a distance based approach may be required.

Using Geographic Information Systems for Health Research 315

This chapter has shown that the geospatial methods that can be used with GIS extend beyond the simple visualization of patterns. There are powerful geospatial methods available to address problems that are commonly posed by epidemiologists and health services researchers. This chapter goes beyond a simple discussion of the basic methods of data creation and mapping with GIS, to introduce researchers having a basic understanding of GIS and statistical methods to more robust methods for spatial data analysis. It is hoped that a broader understanding of the tools available for spatial analysis will ensure that the most appropriate method to address a spatial problem is used, while a consideration is made of the strengths and limitations of the geospatial methods employed in health research. While sophisticated tools have been introduced to deal with spatial data issues such as spatial autocorrelation, the mapping of spatial data remains a useful way of conveying the results of any spatial analysis. In future, health researchers should aim to work in collaborative teams with geographers and operations research specialists to ensure that the most appropriate geospatial methods are being used to study the distribution of

We would like to thank Jeff Wielki and Chantal Hansen for giving permission to include

Anselin, L. (1995). Local indicators of spatial association - LISA. *Geographical Analysis, 27*(2),

Apparicio, P, Abdelmajid, M, Riva, M*, et al.* (2008). Comparing alternative approaches to measuring the geographical accessibility of urban health services: Distance types and

Brabyn, L, & Skelly, C. (2002). Modelling population access to New Zealand public

Briggs, D, Fecht, D, & De Hoogh, K. (2007a). Census data issues for epidemiology and health risk assessment: Experiences from the small area health statistics unit. *Journal of the* 

aggregation-error issues. *International Journal of Health Geographics,* 7:7.

*Royal Statistical Society: Series A (Statistics in Society), 170*(2), 355-378.

hospitals, *International Journal of Health Geographics*, 1:3.

**7. Conclusions** 

disease and health service delivery.

**Author details** 

*University of Calgary, Canada* 

*George Mason University, USA* 

Figure 1 and Figure 2, respectively.

**Acknowledgement** 

**8. References** 

93-115.

Alka Patel

Nigel Waters

## **6.2. Distance problems**

There are two common approaches to finding an optimal location based on distance for a new facility: p-median and p-centre approaches. The p-median approach focuses on minimizing the average distance/cost to or from patients, while the p-centre approach focuses on minimizing the maximum distance/cost to or from patients. The limitation of solving these problems discretely is that when the number of facilities to be placed increases, the computational intensity of the calculations quickly increases (Church & Murray, 2009b). In order to deal with this limitation, heuristic algorithms are employed that use systematic procedures to trade off the quality of the solution with processing speed (de Smith *et al.*, 2007b). These algorithms strive to find an optimal solution based on specified search strategies (Church & Murray, 2009b). While these methods are founded theoretically in the mathematical modeling and operations research literature, there are a number of components to these strategies that can be evaluated using GIS (Church, 2002). For example, previous studies have used geospatial methods to evaluate the suitability of selected sites as possible locations for a new facility (Cinnamon et al., 2009; Schuurman et al., 2008) and to measure the distance between populations and proposed services (Pereira et al., 2007).

Both the coverage approach and the distance based approach to the placement of new facilities are important. They provide a framework for optimizing the solution to the best location for a new facility. The coverage model focuses on ensuring the population is best served in terms of the placement of a new facility. The location-allocation approach is more focused on the tradeoffs in terms of minimizing the average or maximum distance that patients have to travel in order to access a service. The incorporation of patient service utilization data for both coverage and distance models will add value to the optimal solutions; these data appear to be one component that is lacking in current research that uses GIS and location-allocation approaches to recommend the placement of new services.

## **7. Conclusions**

314 Application of Geographic Information Systems

**6.2. Distance problems** 

et al., 2007).

within GIS software can be used to assemble the needed data to address both types of coverage problems. The three essential data components that need to be represented in a GIS in order to solve this problem are the demand points that are to be served by the facilities, the set of possible facility locations and the service area capabilities of these facilities (Church & Murray, 2009a). Ambulance and fire department response are two examples of services that require complete coverage of an area and thus a LSCP approach to coverage. Other services may strive for as much coverage as possible based on the resources available. One example is the placement of public health care facilities. By creating catchment areas within GIS, the best possible locations for new facilities can be evaluated (based on the most area and population covered) (Murad, 2008). The strength of using these types of coverage models is that they provide a spatial-standard based framework for siting new health service facilities that considers population needs and available resources. While these frameworks can be used to study the best locations based on the coverage of a facility

over a given area, in other instances a distance based approach may be required.

There are two common approaches to finding an optimal location based on distance for a new facility: p-median and p-centre approaches. The p-median approach focuses on minimizing the average distance/cost to or from patients, while the p-centre approach focuses on minimizing the maximum distance/cost to or from patients. The limitation of solving these problems discretely is that when the number of facilities to be placed increases, the computational intensity of the calculations quickly increases (Church & Murray, 2009b). In order to deal with this limitation, heuristic algorithms are employed that use systematic procedures to trade off the quality of the solution with processing speed (de Smith *et al.*, 2007b). These algorithms strive to find an optimal solution based on specified search strategies (Church & Murray, 2009b). While these methods are founded theoretically in the mathematical modeling and operations research literature, there are a number of components to these strategies that can be evaluated using GIS (Church, 2002). For example, previous studies have used geospatial methods to evaluate the suitability of selected sites as possible locations for a new facility (Cinnamon et al., 2009; Schuurman et al., 2008) and to measure the distance between populations and proposed services (Pereira

Both the coverage approach and the distance based approach to the placement of new facilities are important. They provide a framework for optimizing the solution to the best location for a new facility. The coverage model focuses on ensuring the population is best served in terms of the placement of a new facility. The location-allocation approach is more focused on the tradeoffs in terms of minimizing the average or maximum distance that patients have to travel in order to access a service. The incorporation of patient service utilization data for both coverage and distance models will add value to the optimal solutions; these data appear to be one component that is lacking in current research that uses GIS and location-allocation approaches to recommend the placement of new services.

This chapter has shown that the geospatial methods that can be used with GIS extend beyond the simple visualization of patterns. There are powerful geospatial methods available to address problems that are commonly posed by epidemiologists and health services researchers. This chapter goes beyond a simple discussion of the basic methods of data creation and mapping with GIS, to introduce researchers having a basic understanding of GIS and statistical methods to more robust methods for spatial data analysis. It is hoped that a broader understanding of the tools available for spatial analysis will ensure that the most appropriate method to address a spatial problem is used, while a consideration is made of the strengths and limitations of the geospatial methods employed in health research. While sophisticated tools have been introduced to deal with spatial data issues such as spatial autocorrelation, the mapping of spatial data remains a useful way of conveying the results of any spatial analysis. In future, health researchers should aim to work in collaborative teams with geographers and operations research specialists to ensure that the most appropriate geospatial methods are being used to study the distribution of disease and health service delivery.

## **Author details**

Alka Patel *University of Calgary, Canada* 

Nigel Waters *George Mason University, USA* 

## **Acknowledgement**

We would like to thank Jeff Wielki and Chantal Hansen for giving permission to include Figure 1 and Figure 2, respectively.

## **8. References**


Briggs, DJ, Gulliver, J, Fecht, D*, et al.* (2007b). Dasymetric modelling of small-area population distribution using land cover and light emissions data. *Remote Sensing of Environment, 108*(4), 451-466.

Using Geographic Information Systems for Health Research 317

Fotheringham, AS, & Wong, DWS. (1991). The modifiable areal unit problem in multivariate

Gatrell, AC, Bailey, TC, Diggle, PJ*, et al.* (1996). Spatial point pattern analysis and its application in geographical epidemiology. *Transactions of the Institute of British* 

Hansen, C. (2005). *A GIS analysis of reported spousal violence in Calgary, AB.* University of

Hare, TS, & Barcus, HR. (2007). Geographical accessibility and Kentucky's heart-related

Harper, PR, Shahani, AK, Gallagher, JE*, et al.* (2005). Planning health services with explicit geographical considerations: A stochastic location-allocation approach. *Omega-*

Haynes, R, Jones, AP, Sauerzapf, V*, et al.* (2006). Validation of travel times to hospital

Hewko, J, Smoyer-Tomic, KE, & Hodgson, MJ. (2002). Measuring neighbourhood spatial accessibility to urban amenities: Does aggregation error matter? *Environment and* 

Hu, ZY, & Rao, KR. (2009). Particulate air pollution and chronic ischemic heart disease in the eastern United States: A county level ecological study using satellite aerosol data.

Ingolfsson, A, Erkut, E, & Budge, S. (2003). Simulation of single start station for Edmonton

Jetz, W, Rahbek, C, & Lichstein, JW. (2005). Local and global approaches to spatial data

Johnson, S. (2007). *The ghost map: The story of London's most terrifying epidemic--and how it changed science, cities, and the modern world*. London: Riverhead Trade, Penguin Group. Joyce, K. (2009). "To me it's just another tool to help understand the evidence": Public health decision-makers' perceptions of the value of geographical information systems (GIS).

Kistemann, T, Dangendorf, F, & Schweikart, J. (2002). New perspectives on the use of geographical information systems (GIS) in environmental health sciences. *International* 

Kurland, KS, & Gorr, WL. (2009). *GIS tutorial for health* (3rd ed.). Redlands, CA: ESRI Press. Lam, NSN. (1983). Spatial interpolation methods - a review. *American Cartographer, 10*(2),

Langford, M, & Higgs, G. (2006). Measuring potential access to primary healthcare services: The influence of alternative spatial representations of population. *Professional* 

Langford, M, Higgs, G, Radcliffe, J*, et al.* (2008). Urban population distribution models and service accessibility estimation. *Computers Environment and Urban Systems, 32*(1), 66-80. Legendre, P. (1993). Spatial autocorrelation - trouble or new paradigm. *Ecology, 74*(6), 1659-

statistical-analysis. *Environment and Planning A, 23*(7), 1025-1044.

hospital services. *Applied Geography, 27*(3-4), 181-205.

*International Journal of Management Science, 33*(2), 141-152.

estimated by GIS. *International Journal of Health Geographics,* 5:40.

EMS. *The Journal of the Operational Research Society, 54*(7), 736.

*Journal of Hygiene and Environmental Health, 205*(3), 169-181.

analysis in ecology. *Global Ecology and Biogeography, 14*(1), 97-98.

*Geographers, 21*(1), 256-274.

*Planning A, 34*(7), 1185-1206.

*Health & Place, 15*(3), 831-840.

*Geographer, 58*(3), 294-306.

129-149.

1673.

*Environmental Health, 8*.

Calgary, Calgary.


Fotheringham, AS, & Wong, DWS. (1991). The modifiable areal unit problem in multivariate statistical-analysis. *Environment and Planning A, 23*(7), 1025-1044.

316 Application of Geographic Information Systems

*Environment, 108*(4), 451-466.

*Operations Research, 29*(6), 541-562.

*Landscape and Urban Planning, 86*(1), 103-114.

*tools* (2 ed., pp. 71-180). Leicester, UK: Matador.

*tools* (2 ed., pp. 71-180). Leicester, UK: Matador.

162-183). London, UK: SAGE Publications.

*Environment and Planning A, 30*(11), 1905-1927.

*Place, 15*(3), 792-800.

SAGE Publications.

281-298.

Briggs, DJ, Gulliver, J, Fecht, D*, et al.* (2007b). Dasymetric modelling of small-area population distribution using land cover and light emissions data. *Remote Sensing of* 

Brunsdon, C, Fotheringham, AS, & Charlton, ME. (1996). Geographically weighted regression: A method for exploring spatial nonstationarity. *Geographical Analysis, 28*(4),

Chaikaew, N, Tripathi, NK, & Souris, M. (2009). Exploring spatial patterns and hotspots of diarrhea in Chiang Mai, Thailand. *International Journal of Health Geographics,* 8:1. Christie, S, & Fone, D. (2003). Equity of access to tertiary hospitals in Wales: A travel time

Church, RL. (2002). Geographical information systems and location science. *Computers &* 

Church, RL, & Murray, AT. (2009a). Chapter 9: Coverage. In *Business site selection, location* 

Church, RL, & Murray, AT. (2009b). Chapter 11: Location-allocation. In *Business site selection,* 

Cinnamon, J, Schuurman, N, & Crooks, VA. (2008). A method to determine spatial access to specialized palliative care services using GIS. *BMC Health Services Research,* 8:140. Cinnamon, J, Schuurman, N, & Crooks, VA. (2009). Assessing the suitability of host communities for secondary palliative care hubs: A location analysis model. *Health &* 

Comber, A, Brunsdon, C, & Green, E. (2008). Using a GIS-based network analysis to determine urban greenspace accessibility for different ethnic and religious groups.

de Smith, M, Goodchild, M, & Longley, P. (2007b). Chapter 7: Network and location analysis. In *Geospatial analysis: A comprehensive guide to principles, techniques and software* 

Farwell, D. (1999). Specifying the bandwidth function for the kernel density estimator., from

Fotheringham, AS, Brunsdon, C, & Charlton, ME. (2005a). Chapter 5: Local analysis. In *Quantitative geography: Perspectives on spatial data analysis* (pp. 93-129). London, UK:

Fotheringham, AS, Brunsdon, C, & Charlton, ME. (2005b). Chapter 7: Spatial regression and geostatistical models. In *Quantitative geography: Perspectives on spatial data analysis* (pp.

Fotheringham, AS, Charlton, ME, & Brunsdon, C. (1998). Geographically weighted regression: A natural evolution of the expansion method for spatial data analysis.

http://www.mrc-bsu.cam.ac.uk/bugs/documentation/coda03/node44.html

Cromley, EK, & McLafferty, SL. (2002). *GIS and public health*. USA: The Guilford Press. de Smith, M, Goodchild, M, & Longley, P. (2007a). Chapter 4: Building blocks of spatial analysis. In *Geospatial analysis: A comprehensive guide to principles, techniques and software* 

*location analysis, and GIS* (pp. 259-280). Hoboken, NJ: John Wiley & Sons, Inc.

*analysis, and GIS* (pp. 209-233). Hoboken, NJ: John Wiley & Sons, Inc.

analysis. *Journal of Public Health Medicine, 25*(4), 344-350.


Lerner, EB, Billittier, AJ, Sikora, J*, et al.* (1999). Use of a geographic information system to determine appropriate means of trauma patient transport. *Academic Emergency Medicine, 6*(11), 1127-1133.

Using Geographic Information Systems for Health Research 319

Patel, AB, Waters, NM, & Ghali, WA. (2007). Determining geographic areas and populations with timely access to cardiac catheterization facilities for acute myocardial infarction

Peleg, K, & Pliskin, JS. (2004). A geographic information system simulation model of EMS: Reducing ambulance response time. *American Journal of Emergency Medicine, 22*(3), 164-

Pereira, A, Niggebrugge, A, Powels, J*, et al.* (2007). Potential generation of geographical inequities by the introduction of primary percutaneous coronary intervention for the management of st segment elevation myocardial infarction. *International Journal of* 

Pouliou, T, & Elliott, SJ. (2009). An exploratory spatial analysis of overweight and obesity in

Poulsen, E, & Kennedy, LW. (2004). Using dasymetric mapping for spatially aggregated

ReVelle, CS, & Eiselt, HA. (2005). Location analysis: A synthesis and survey - invited review.

Robitaille, E, & Herjean, P. (2008). An analysis of the accessibility of video lottery terminals:

Schuurman, N, Bell, N, Hameed, MS*, et al.* (2008). A model for identifying and ranking need for trauma service in nonmetropolitan regions based on injury risk and access to

Schuurman, N, Bell, NJ, L'Heureux, R*, et al.* (2009). Modelling optimal location for prehospital helicopter emergency medical services. *BMC Emergency Medicine, 9*, 6. Shi, X. (2009). A geocomputational process for characterizing the spatial pattern of lung cancer incidence in New Hampshire. *Annals of the Association of American Geographers,* 

Srivastava, A, Nagpal, BN, Joshi, PL*, et al.* (2009). Identification of malaria hot spots for focused intervention in tribal state of India: A GIS based approach. *International Journal* 

Tobler, W. (2004). On the first law of geography: A reply. *Annals of the Association of* 

Wakefield, J. (2007). Disease mapping and spatial regression with count data. *Biostatistics,* 

Warden, CR, Daya, M, & LeGrady, LA. (2007). Using geographic information systems to

evaluate cardiac arrest survival. *Prehospital Emergency Care, 11*(1), 19-24.

Rogerson, PA. (2010). *Statistical methods for geography* (Third ed.). Los Angeles, CA: Sage. Rushton, G. (2003). Public health, GIS, and spatial analytic tools. *Annual Review of Public* 

care in Alberta, Canada. *International Journal of Health Geographics,* 6:47.

170.

*Health Geographics*, 6:43.

*Health, 24*, 43-56.

*99*(3), 521-533.

*8*(2), 158-183.

*of Health Geographics,* 8:30.

*American Geographers, 94*(2), 304-310.

Waters, NM. (1997a). Unit 40 - spatial interpolation I. from

Waters, NM. (1997b). Unit 41 - spatial interpolation II. from

http://www.geog.ubc.ca/courses/klink/gis.notes/ncgia/u40.html

http://www.geog.ubc.ca/courses/klink/gis.notes/ncgia/u41.html

Canada. *Preventive Medicine, 48*(4), 362-367.

crime data. *Journal of Quantitative Criminology, 20*(3), 243-262.

The case of Montreal. *International Journal of Health Geographics,* 7:2.

services. *Journal of Trauma-Injury Infection & Critical Care, 65*(1), 54-62.

*European Journal of Operational Research, 165*(1), 1-19.


Patel, AB, Waters, NM, & Ghali, WA. (2007). Determining geographic areas and populations with timely access to cardiac catheterization facilities for acute myocardial infarction care in Alberta, Canada. *International Journal of Health Geographics,* 6:47.

318 Application of Geographic Information Systems

*6*(11), 1127-1133.

*Place, 14*(4), 817-828.

*Compass, 3*(2), 727-745.

*Health, 2*(2), 151-160.

*113*(9), 1189-1195.

Wiley & Sons, Inc.

*of Tuberculosis & Lung Disease, 13*(6), 767-774.

84.

Lerner, EB, Billittier, AJ, Sikora, J*, et al.* (1999). Use of a geographic information system to determine appropriate means of trauma patient transport. *Academic Emergency Medicine,* 

Lerner, EB, Fairbanks, RJ, & Shah, MN. (2005). Identification of out-of-hospital cardiac arrest clusters using a geographic information system. *Academic Emergency Medicine, 12*(1), 81-

Levine, N. (2007). CrimeStat: A spatial statistics program for the analysis of crime incident

Mahmud, A, & Indriasari, V. (2009). Facility location models development to maximize total

Maroko, AR, Maantay, JA, Sohler, NL*, et al.* (2009). The complexities of measuring access to parks and physical activity sites in New York City: A quantitative and qualitative

McEntee, JC, & Ogneva-Himmelberger, Y. (2008). Diesel particulate matter, lung cancer, and asthma incidences along major traffic corridors in MA, USA: A GIS analysis. *Health &* 

Meade, MS, & Emch, M. (2010). *Medical geography* (Third ed.). New York: The Guilford Press. Mennis, J. (2009). Dasymetric mapping for estimating population in small areas. *Geography* 

Mitas, L, & Mitasova, H. (1999). Chapter 34: Spatial interpolation. In P. Longley, Goodchild, M. F., Maguire, D. J. and Rhind, D. W. (Ed.), *Geographical information systems: Principles, techniques, applications and management* (Second ed., pp. Pp. 481-492): Wiley, New York. Murad, AA. (2008). Defining health catchment areas in Jeddah city, Saudi Arabia: An example demonstrating the utility of geographical information systems. *Geospatial* 

Murray, EJ, Marais, BJ, Mans, G*, et al.* (2009). A multidisciplinary method to map potential tuberculosis transmission 'hot spots' in high-burden communities. *International Journal* 

Nallamothu, BK, Bates, ER, Wang, YF*, et al.* (2006). Driving times and distances to hospitals with percutaneous coronary intervention in the United States - implications for prehospital triage of patients with ST-elevation myocardial infarction. *Circulation,* 

Ong, ME, Ng, FS, Overton, J*, et al.* (2009). Geographic-time distribution of ambulance calls in Singapore: Utility of geographic information system in ambulance deployment (CARE

Oppenshaw, S. (1996). Chapter 4: Developing GIS-relevent zone-based spatial analysis methods. In Longley & Batty (Eds.), *Spatial analysis: Modelling in a GIS environment*: John

Parchman, ML, Ferrer, RL, & Blanchard, KS. (2002). Geography and geographic information

3). *Annals of the Academy of Medicine, Singapore, 38*(3), 184-191.

systems in family medicine research. *Family Medicine, 34*(2), 132-137.

service area. *Theoretical and Empirical Researches in Urban Management*, 87.

McLafferty, SL. (2003). GIS and health care. *Annual Review of Public Health, 24*, 25-42.

locations. Washington, DC: National Institute of Justice.

approach. *International Journal of Health Geographics,* 8:34.


Wielki, J. (2009). The development of barriadas & access to medical services in Lima, Peru. Retrieved November, 2009, from

**Chapter 17**

© 2012 Chin et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Chin et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**A Primer on Recent Advancement** 

Shih-Miao Chin, Francisco M. Oliveira-Neto, Ho-Ling Hwang,

The efficient and reliable flow of urban goods and services is essential for the economic well-being of the United States (U.S.), particularly the majority of Americans who live in urbanized areas. According to recently published estimates (FAF3, 2007), the U.S. freight system moves about 19 billion tons of goods valued at \$17 trillion in 2007. These shipments result in a total of 5.7 trillion ton-miles of movements during the same year. The performance of the freight system has economic impacts on national productivity, the nation, the costs of goods and services, and the global competiveness of American industries. While demand for freight transportation has been rising steadily and shows little sign of abating, the capacity of the freight system has only modestly increased. Without an efficient freight system, other critical systems, such as energy supply, can be seriously

In order to prepare for future transportation challenges, the transportation agencies at federal, state and local levels are heavily engaged in comprehensive transportation planning efforts. Local, regional, and national planners and policymakers are beginning to more deeply understand freight transportation issues as they apply within urban areas in the United States and around the world. Furthermore, freight mobility is a leading factor in the decision of a private business to locate in or near a particular city. But the goods movement supply chain uses regional, state, and national infrastructure while adopting increasingly complicated intermodal and multimodal transport solutions so it is very complex. Therefore, both the public and private sectors conduct planning and analysis upon which to make critical strategic decisions. This requires that freight data, modelling techniques and other computational and visualization tools be available to aid transportation stakeholders

**on Freight Transportation** 

Diane Davidson, Lee D. Han and Bruce Peterson

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/47955

**1. Introduction** 

impacted.

in their decision-making and analysis.

http://www.focal.ca/pdf/media\_Peru\_Lima%20Hospital%20Access%20Wielki.pdf


## **A Primer on Recent Advancement on Freight Transportation**

Shih-Miao Chin, Francisco M. Oliveira-Neto, Ho-Ling Hwang, Diane Davidson, Lee D. Han and Bruce Peterson

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/47955

## **1. Introduction**

320 Application of Geographic Information Systems

*Epidemiology, 1*(1), 73-84.

Retrieved November, 2009, from

Wielki, J. (2009). The development of barriadas & access to medical services in Lima, Peru.

http://www.focal.ca/pdf/media\_Peru\_Lima%20Hospital%20Access%20Wielki.pdf Yang, TC, Wu, PC, Chen, VYJ*, et al.* (2009). Cold surge: A sudden and spatially varying

Young, LJ, Gotway, CA, Yang, J*, et al.* (2009). Linking health and environmental data in geographical analysis: It's so much more than centroids. *Spatial and Spatio-temporal* 

Zenk, SN, Schulz, AJ, Israel, BA*, et al.* (2005). Neighborhood racial composition, neighborhood poverty, and the spatial accessibility of supermarkets in metropolitan

threat to health? *Science of the Total Environment, 407*(10), 3421-3424.

Detroit. *American Journal of Public Health, 95*(4), 660-667.

The efficient and reliable flow of urban goods and services is essential for the economic well-being of the United States (U.S.), particularly the majority of Americans who live in urbanized areas. According to recently published estimates (FAF3, 2007), the U.S. freight system moves about 19 billion tons of goods valued at \$17 trillion in 2007. These shipments result in a total of 5.7 trillion ton-miles of movements during the same year. The performance of the freight system has economic impacts on national productivity, the nation, the costs of goods and services, and the global competiveness of American industries. While demand for freight transportation has been rising steadily and shows little sign of abating, the capacity of the freight system has only modestly increased. Without an efficient freight system, other critical systems, such as energy supply, can be seriously impacted.

In order to prepare for future transportation challenges, the transportation agencies at federal, state and local levels are heavily engaged in comprehensive transportation planning efforts. Local, regional, and national planners and policymakers are beginning to more deeply understand freight transportation issues as they apply within urban areas in the United States and around the world. Furthermore, freight mobility is a leading factor in the decision of a private business to locate in or near a particular city. But the goods movement supply chain uses regional, state, and national infrastructure while adopting increasingly complicated intermodal and multimodal transport solutions so it is very complex. Therefore, both the public and private sectors conduct planning and analysis upon which to make critical strategic decisions. This requires that freight data, modelling techniques and other computational and visualization tools be available to aid transportation stakeholders in their decision-making and analysis.

© 2012 Chin et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Chin et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

For over a decade, the Center for Transportation Analysis (CTA) of the Oak Ridge National Laboratory (ORNL) has provided assistance to federal transportation agencies in the development of comprehensive national and regional freight databases and network flow models. The objective of this chapter is to provide readers with an understanding of recent ORNL advancements in freight analysis and visualization.

A Primer on Recent Advancement on Freight Transportation 323

network sub-systems. We then address some of modelling techniques and computational tools used by practitioners to estimate current and future freight demand and flows on the network system. Specifically, this chapter focuses on geographic databases and analytical tools that have been developed by the CTA to understand U.S. goods movement. After reading this chapter, the reader will have a general understanding of the current freight transportation system and the several analytical tools that have been developed to estimate

To understand how the movement of goods is shaped we need to define the main agents, or system components, affecting the demand for freight. The resulting freight activity pattern of the interactions between such components can be captured through data surveys. In this section the freight transportation system is introduced and the framework for freight data

The term freight primarily refers to the long-haul component of the supply chain. Freight shipments themselves can move by truck, rail, ocean or air and can be characterized as intercity, port to transport terminal, terminal to terminal, interplant, plant to distribution center, and distribution center to distribution center. The freight transportation system can be seen as a system composed of three main components: decision makers or users with a demand for goods movements (producers, consumers, and transportation companies); the physical system supply (transportation infrastructure); and commodities (all different types

The demand for the movement of freight involves multiple decision: *producers* of goods and services who decide how much and how to produce, and where and at what price to sell; *consumers*, either intermediate (production companies) or final (households, business, public agencies, etc.), who decide what to consume and how much; and *transportation companies* (shippers and carriers) who decide how to provide transportation services. These decision makers are responsible for production, logistics, distribution, and marketing. The location of producers and consumers certainly affects the spatial distribution of transportation supply,

Commodities are the result of production and represent the entity to be transported between geographic locations. The range of products needed and produced is vast and many final products (finished goods) can require a set of other products (semi-finished or raw materials) to be produced. Therefore, in assessing the freight demand, an analyst must be aware of these commodity matrices. Furthermore, a great variety of vehicle types are necessary to match different commodity types and shipment sizes. Depending on the characteristics (e.g. hazardous materials: toxics and flammable substances; high value; perishable) of a particular commodity, its means of transportation may require specific

freight demand and evaluate the system performance.

**3. Understanding the movement of goods** 

developed by CTA group is also presented.

of goods to be produced and transported).

and therefore the pattern of goods movement.

treatments, in terms of routing and vehicle specifications.

**3.1. Freight transportation system** 

For the freight transportation system to move commodities effectively and efficiently, it needs to overcome geographic barriers within specified time constraints. To this end, the freight system must be multidimensional and dynamic, consider different modes of transportation (truck, rail, water, air, and pipeline), involve both public and private sectors, and continue to evolve through time. To understand this complex and complicated system, one needs not only reliable data but also sophisticated computational tools.

Over the years, practitioners within transportation research community have collected and created a wealth of information stored in many database systems throughout their organization. Some of these systems are well establish and maintained, while others involve smaller data sets collected by individuals or by an office for a specific purpose. In either case the data are valuable by themselves, and could potentially offer a broader insight if the users were given the ability to integrate these individual data sources

Because of the spatial and temporal components embedded in almost all forms of freight data, the Geographic Information System (GIS) is an ideal computation framework for freight transportation analysis and modelling. GIS is an excellently suitable tool for managing, planning, evaluating and maintaining the transportation system. Huge geographic databases can be created and maintained as well as many forms of spatial data can be integrated into a GIS platform. It provides the means for researchers to evaluate the system and determine the impacts of capacity enhancements, operational improvements, and public transportation investments. The network representation of the freight system in GIS is invaluable for analysts to visualize critical segments, links or nodes, and makes possible large network analysis that would be computationally intense in other platforms. Therefore, GIS was the selected tool to integrate data within the transportation community.

With the aid of a set of analytical and interactive tools, a GIS framework enables the freight transportation analysts to understand the complex and dynamic spatial interactions among commodities (ranging from raw material, semi-finished and finished products), shipper (farmers, mining companies, factories, and manufactories), carriers (trucking, rail and marine companies, freight forwarder, third-party-logistic provider), consignees, and end consumers.

## **2. Aims of the chapter**

This chapter will provide the reader with an understanding of tools and models developed and used by the CTA to understand multimodal freight commodity flow. To this end, we begin with a description of the components of the transportation system: the mode of transportation, the commodities transported, the transportation industry sector, and the network sub-systems. We then address some of modelling techniques and computational tools used by practitioners to estimate current and future freight demand and flows on the network system. Specifically, this chapter focuses on geographic databases and analytical tools that have been developed by the CTA to understand U.S. goods movement. After reading this chapter, the reader will have a general understanding of the current freight transportation system and the several analytical tools that have been developed to estimate freight demand and evaluate the system performance.

## **3. Understanding the movement of goods**

To understand how the movement of goods is shaped we need to define the main agents, or system components, affecting the demand for freight. The resulting freight activity pattern of the interactions between such components can be captured through data surveys. In this section the freight transportation system is introduced and the framework for freight data developed by CTA group is also presented.

## **3.1. Freight transportation system**

322 Application of Geographic Information Systems

consumers.

**2. Aims of the chapter** 

ORNL advancements in freight analysis and visualization.

For over a decade, the Center for Transportation Analysis (CTA) of the Oak Ridge National Laboratory (ORNL) has provided assistance to federal transportation agencies in the development of comprehensive national and regional freight databases and network flow models. The objective of this chapter is to provide readers with an understanding of recent

For the freight transportation system to move commodities effectively and efficiently, it needs to overcome geographic barriers within specified time constraints. To this end, the freight system must be multidimensional and dynamic, consider different modes of transportation (truck, rail, water, air, and pipeline), involve both public and private sectors, and continue to evolve through time. To understand this complex and complicated system,

Over the years, practitioners within transportation research community have collected and created a wealth of information stored in many database systems throughout their organization. Some of these systems are well establish and maintained, while others involve smaller data sets collected by individuals or by an office for a specific purpose. In either case the data are valuable by themselves, and could potentially offer a broader insight if the

Because of the spatial and temporal components embedded in almost all forms of freight data, the Geographic Information System (GIS) is an ideal computation framework for freight transportation analysis and modelling. GIS is an excellently suitable tool for managing, planning, evaluating and maintaining the transportation system. Huge geographic databases can be created and maintained as well as many forms of spatial data can be integrated into a GIS platform. It provides the means for researchers to evaluate the system and determine the impacts of capacity enhancements, operational improvements, and public transportation investments. The network representation of the freight system in GIS is invaluable for analysts to visualize critical segments, links or nodes, and makes possible large network analysis that would be computationally intense in other platforms. Therefore, GIS was the selected tool to integrate data within the transportation community.

With the aid of a set of analytical and interactive tools, a GIS framework enables the freight transportation analysts to understand the complex and dynamic spatial interactions among commodities (ranging from raw material, semi-finished and finished products), shipper (farmers, mining companies, factories, and manufactories), carriers (trucking, rail and marine companies, freight forwarder, third-party-logistic provider), consignees, and end

This chapter will provide the reader with an understanding of tools and models developed and used by the CTA to understand multimodal freight commodity flow. To this end, we begin with a description of the components of the transportation system: the mode of transportation, the commodities transported, the transportation industry sector, and the

one needs not only reliable data but also sophisticated computational tools.

users were given the ability to integrate these individual data sources

The term freight primarily refers to the long-haul component of the supply chain. Freight shipments themselves can move by truck, rail, ocean or air and can be characterized as intercity, port to transport terminal, terminal to terminal, interplant, plant to distribution center, and distribution center to distribution center. The freight transportation system can be seen as a system composed of three main components: decision makers or users with a demand for goods movements (producers, consumers, and transportation companies); the physical system supply (transportation infrastructure); and commodities (all different types of goods to be produced and transported).

The demand for the movement of freight involves multiple decision: *producers* of goods and services who decide how much and how to produce, and where and at what price to sell; *consumers*, either intermediate (production companies) or final (households, business, public agencies, etc.), who decide what to consume and how much; and *transportation companies* (shippers and carriers) who decide how to provide transportation services. These decision makers are responsible for production, logistics, distribution, and marketing. The location of producers and consumers certainly affects the spatial distribution of transportation supply, and therefore the pattern of goods movement.

Commodities are the result of production and represent the entity to be transported between geographic locations. The range of products needed and produced is vast and many final products (finished goods) can require a set of other products (semi-finished or raw materials) to be produced. Therefore, in assessing the freight demand, an analyst must be aware of these commodity matrices. Furthermore, a great variety of vehicle types are necessary to match different commodity types and shipment sizes. Depending on the characteristics (e.g. hazardous materials: toxics and flammable substances; high value; perishable) of a particular commodity, its means of transportation may require specific treatments, in terms of routing and vehicle specifications.

The physical system is an intermodal and multimodal system composed of transportation modes; sub-networks, where commodities move; and the interfaces between sub-networks, where commodities are transferred or handled between different modes. In a basic sense, freight transportation modes can be classified as truck, rail, water, air, multiple-modes, and pipeline. Each mode can include different types of fleet and equipment classifications from different transportation companies. The concept of mode in freight transportation is distinctly different from the case of passenger transportation. In freight transportation, mode encompasses physical (the sequence of transportation modes used for a consignment) and organizational (the sequence of entities responsible for transportation) aspects of movements.

A Primer on Recent Advancement on Freight Transportation 325

to confidentiality and reliability constraints. Reliability refers to the level of belief on the estimates provided by the sampling method (e.g. estimates from small sample with large variance are less reliable), whereas confidentiality corresponds to the impositions on the survey to avoid disclosure of particular industry activities. Consequently many cells in the public database have no data. In addition the data are also aggregated to higher geographic

The CFS data is used a base source for other freight data sources, either proprietary or of public domain. TranSearch, one of the most well-known commercial freight databases, for example, is compiled based on the CFS data and supplemented by private freight data and updated annually, and is available commercially through IHS Global Insight. The

The Freight Analysis Framework (FAF) is a public database sponsored by the Federal Highway Administration (FHWA), which is composed of U.S. domestic and international (imports and exports) freight flows. FAF integrates CFS data (in-scope to the CFS) from a variety of supplemental sources for industry sectors not in the CFS (out-scope to the CFS) to create a comprehensive picture of freight movement among states and major metropolitan areas by all modes of transportation. FAF also provides forecasts on freight flows up to 30 years in the future as well as providing annual provisional updates for the current year and truck flow assignments for the base year and outlying future year. It is the third database of its kind, with FAF1 providing similar freight data products for calendar year 1997, and FAF2 providing freight data products for calendar year 2002. The CTA was heavily involved in the development and execution of the methodology for FAF2 and FAF3. With data from the 2007 CFS and additional sources, CTA developed estimates for tonnage and value, by commodity type, mode, origin, and destination for 2007 as well as ton-miles by mode.

The movements in FAF are characterized by freight volume (weight in thousand tons and dollar value), geographic dimension, transportation mode, and commodity type. In terms of the geographic dimension, the FAF data provides freight trading between 123 domestic zones and 8 external zones. The geographic level within the national boundaries is based on the CFS geographic strata, as shown in Figure 1, including 74 metropolitan areas, 33 remainder of states, and 16 regions identified as entire states. External zones, or foreign zones, are defined by 8 world regions: Canada, Mexico, and six other regions defined according to United Nations geographic region. Hence a domestic flow is spatially characterized by its origin and destination zones; imports are reported by foreign origin, FAF domestic zone of entry, and FAF destination zone; and exports are reported by FAF domestic origin, FAF domestic zone of exit, and foreign

In terms of commodity classification, FAF reports freight flows using the same 43 2-digit Standard Classification of Transported Goods (SCTG) classes, as reported by the CFS. Figures 2 and 3 illustrate the amount of U.S. foreign trades in 2007 estimated by FAF, by

levels. The most detailed data in this database are at the state level.

TransSearch database estimates data geographically at the county-level.

destination.

tonnage and value of goods, respectively.

In abstract terms sub-networks are composed of subsets of the single-mode network systems, i.e. highway, railroad, waterway and airway; and each of these subsets represents a different reality whether it is a type of service available, a company ownership, or a different type of vehicle (Southworth and Peterson, 2000). The interfaces connecting two or more subnetworks can be either transfer terminals (e.g. seaports and airports, where commodities are transferred between different modes, or different vehicles), and intermodal points (e.g. rail yards, where commodities transported on the same single-mode system are handled by different companies).

## **3.2. Framework for geographic freight data and analysis**

Transportation engineers and policy makers require data and analytical tools to fully understand the complex movement of goods on the freight transportation system. Although considerable mode and shipper specific data is collected for a variety of purposes, the sheer magnitude of the freight data system as well as industrial confidentiality concerns, limit the freight data which is made available to the public. Typically, a considerable portion of the freight data is not disclosed by public agencies.

The Commodity Flow Survey (CFS) is probably the most comprehensive survey of freight activities in the U.S. The survey is conducted every five years as a part of the Economic Census by the U.S. Census Bureau, in partnership with Bureau of Transportation Statistics (BTS) under the Research and Innovative Technology Administration (RITA) of the Department of Transportation (DOT). The last survey was conducted in 2007 and previous years of study include 1993, 1997, and 2002. For 2007 CFS1, approximately 102,000 establishments were selected from a universe of about 760,000 "in-scope"2 establishments.

The BTS consolidates the CFS sample, estimates mileage and ton-miles3 of shipments and publishes versions of survey for public access. The publicly available CFS data are restricted

<sup>1</sup> 2007 Econonic Census – Transportation, 2007 Commodity Flow Survey United States: 2007, EC07TCF-US, issued April 2010

<sup>2</sup> "In-scope" is a term used to refer to industries that are covered by CFS sample. Conversely a "out-scope" industry is not included in the CFS sample.

<sup>3</sup> Ton-miles is a measure of freight movement over the network. It is computed for each link (segment) of the network by simply multiplying the correponding link flow by its length in miles.

to confidentiality and reliability constraints. Reliability refers to the level of belief on the estimates provided by the sampling method (e.g. estimates from small sample with large variance are less reliable), whereas confidentiality corresponds to the impositions on the survey to avoid disclosure of particular industry activities. Consequently many cells in the public database have no data. In addition the data are also aggregated to higher geographic levels. The most detailed data in this database are at the state level.

324 Application of Geographic Information Systems

movements.

different companies).

April 2010

not included in the CFS sample.

The physical system is an intermodal and multimodal system composed of transportation modes; sub-networks, where commodities move; and the interfaces between sub-networks, where commodities are transferred or handled between different modes. In a basic sense, freight transportation modes can be classified as truck, rail, water, air, multiple-modes, and pipeline. Each mode can include different types of fleet and equipment classifications from different transportation companies. The concept of mode in freight transportation is distinctly different from the case of passenger transportation. In freight transportation, mode encompasses physical (the sequence of transportation modes used for a consignment) and organizational (the sequence of entities responsible for transportation) aspects of

In abstract terms sub-networks are composed of subsets of the single-mode network systems, i.e. highway, railroad, waterway and airway; and each of these subsets represents a different reality whether it is a type of service available, a company ownership, or a different type of vehicle (Southworth and Peterson, 2000). The interfaces connecting two or more subnetworks can be either transfer terminals (e.g. seaports and airports, where commodities are transferred between different modes, or different vehicles), and intermodal points (e.g. rail yards, where commodities transported on the same single-mode system are handled by

Transportation engineers and policy makers require data and analytical tools to fully understand the complex movement of goods on the freight transportation system. Although considerable mode and shipper specific data is collected for a variety of purposes, the sheer magnitude of the freight data system as well as industrial confidentiality concerns, limit the freight data which is made available to the public. Typically, a considerable portion of the

The Commodity Flow Survey (CFS) is probably the most comprehensive survey of freight activities in the U.S. The survey is conducted every five years as a part of the Economic Census by the U.S. Census Bureau, in partnership with Bureau of Transportation Statistics (BTS) under the Research and Innovative Technology Administration (RITA) of the Department of Transportation (DOT). The last survey was conducted in 2007 and previous years of study include 1993, 1997, and 2002. For 2007 CFS1, approximately 102,000 establishments were selected from a universe of about 760,000 "in-scope"2 establishments.

The BTS consolidates the CFS sample, estimates mileage and ton-miles3 of shipments and publishes versions of survey for public access. The publicly available CFS data are restricted

<sup>1</sup> 2007 Econonic Census – Transportation, 2007 Commodity Flow Survey United States: 2007, EC07TCF-US, issued

<sup>2</sup> "In-scope" is a term used to refer to industries that are covered by CFS sample. Conversely a "out-scope" industry is

<sup>3</sup> Ton-miles is a measure of freight movement over the network. It is computed for each link (segment) of the network

**3.2. Framework for geographic freight data and analysis** 

freight data is not disclosed by public agencies.

by simply multiplying the correponding link flow by its length in miles.

The CFS data is used a base source for other freight data sources, either proprietary or of public domain. TranSearch, one of the most well-known commercial freight databases, for example, is compiled based on the CFS data and supplemented by private freight data and updated annually, and is available commercially through IHS Global Insight. The TransSearch database estimates data geographically at the county-level.

The Freight Analysis Framework (FAF) is a public database sponsored by the Federal Highway Administration (FHWA), which is composed of U.S. domestic and international (imports and exports) freight flows. FAF integrates CFS data (in-scope to the CFS) from a variety of supplemental sources for industry sectors not in the CFS (out-scope to the CFS) to create a comprehensive picture of freight movement among states and major metropolitan areas by all modes of transportation. FAF also provides forecasts on freight flows up to 30 years in the future as well as providing annual provisional updates for the current year and truck flow assignments for the base year and outlying future year. It is the third database of its kind, with FAF1 providing similar freight data products for calendar year 1997, and FAF2 providing freight data products for calendar year 2002. The CTA was heavily involved in the development and execution of the methodology for FAF2 and FAF3. With data from the 2007 CFS and additional sources, CTA developed estimates for tonnage and value, by commodity type, mode, origin, and destination for 2007 as well as ton-miles by mode.

The movements in FAF are characterized by freight volume (weight in thousand tons and dollar value), geographic dimension, transportation mode, and commodity type. In terms of the geographic dimension, the FAF data provides freight trading between 123 domestic zones and 8 external zones. The geographic level within the national boundaries is based on the CFS geographic strata, as shown in Figure 1, including 74 metropolitan areas, 33 remainder of states, and 16 regions identified as entire states. External zones, or foreign zones, are defined by 8 world regions: Canada, Mexico, and six other regions defined according to United Nations geographic region. Hence a domestic flow is spatially characterized by its origin and destination zones; imports are reported by foreign origin, FAF domestic zone of entry, and FAF destination zone; and exports are reported by FAF domestic origin, FAF domestic zone of exit, and foreign destination.

In terms of commodity classification, FAF reports freight flows using the same 43 2-digit Standard Classification of Transported Goods (SCTG) classes, as reported by the CFS. Figures 2 and 3 illustrate the amount of U.S. foreign trades in 2007 estimated by FAF, by tonnage and value of goods, respectively.

A Primer on Recent Advancement on Freight Transportation 327

freight demand models (see Section 4.2). The CTA team manages detailed geographical data and information of a large multimodal and intermodal freight network, including highways, railroads, waterways, airways, and pipelines, and their associated infrastructure (e.g., intermodal terminals, transfer points, seaports and airports). Section 4.1 describes the construction of geographic representation of the highway-waterway-railway network systems. Among other analysis, this geographic tool makes possible to obtain likely routes for freight movements between geographic zones. The corresponding routing distances in

miles are then used as estimates of mileages for freight movements.

**Figure 3.** U.S. Foreign Trades, value in million dollars

**Figure 1.** FAF Geographic Zones

**Figure 2.** U.S. Foreign Trades, tonnage in 2007

For FAF3, the CTA added the estimations of mileage and ton-miles for freight flows. These ton-miles estimates were derived using models of the network system (see Section 4.1), and freight demand models (see Section 4.2). The CTA team manages detailed geographical data and information of a large multimodal and intermodal freight network, including highways, railroads, waterways, airways, and pipelines, and their associated infrastructure (e.g., intermodal terminals, transfer points, seaports and airports). Section 4.1 describes the construction of geographic representation of the highway-waterway-railway network systems. Among other analysis, this geographic tool makes possible to obtain likely routes for freight movements between geographic zones. The corresponding routing distances in miles are then used as estimates of mileages for freight movements.

326 Application of Geographic Information Systems

Alaska

Hawaii

**Figure 2.** U.S. Foreign Trades, tonnage in 2007

Eastern Asia

South, Central and Western Asia

South-eastern Asia and Oceania

**Figure 1.** FAF Geographic Zones

Canada

Rest of Americas Europe

Africa

Mexico

For FAF3, the CTA added the estimations of mileage and ton-miles for freight flows. These ton-miles estimates were derived using models of the network system (see Section 4.1), and

**Figure 3.** U.S. Foreign Trades, value in million dollars

Because the geographic detail of the network representation is higher than the geographic level of FAF database, certain extrapolation of freight movements from a higher geographic level (state-, metropolitan-, and remainder-level) to a lower geographic (county-level) is required in order to provide ton-mile estimates. This disaggregation of FAF database is done using freight demand models, which associates freight activity to the exogenous variables related to economic activity and network measurements (e.g. travel times, monetary costs, distances, etc). A discussion on freight demand models is presented in Section 4.2, and some models used by CTA are presented in Section 4.3. Section 4.4 presents how ton-miles were estimated for the 2007 FAF.

A Primer on Recent Advancement on Freight Transportation 329

both for-hire and private services. The waterway system was separated into three different sub-networks, each one representing a different type of vessel and/or movement: inland and inter-coastal (largely barge traffic), Great Lakes, and trans-oceanic or "deep sea". This logical separation of the single-mode systems was important to model transfer costs between

Special links were designed to simulate the interfaces between each of the above logical subnetworks. These logical links are named "terminal links" and "interline links". The former simulate unloading /loading operations within terminals to handle goods between different vehicle types. The latter were specially designed to model the locations where goods movements are switched between two railroad companies. It is worth noting that terminals are represented by nodes in the network and their corresponding transfer links are links of zero-length connecting two logical endpoints at the same location. Interline are also links of

Points of origination and termination of freight movements are represented by node centroids4. Such representation is a way to simplify the network model given that it is infeasible to include all actual locations where freight movements are originated and terminated. At the current static CTA network these nodes represent the county centroids.

The connection between generator points, centroids and terminals, and the network system is made by specific "access/egress links". Such links should therefore represent all movements connecting the real freight generators of a geographic area to the network system. For the CTA network a computation routine has been deployed to create these links "on-the-fly", that is every time there is a request to route a movement from an origin centroid to a destination centroid. Figure 4 illustrates how a shipment with mode sequence truck-rail-truck is routing onto the CTA network system. In simple words, a route is generated using a "shortest-path" routing algorithm that executes the following sequence of searches on the network: initiate the route by accessing the highway network (create access links), search for connection of it via truck-rail terminal to the rail sub-network, and return

Although each individual layer (sub-network) of the network system can be stored and maintained in a commercial GIS, the combined multimodal network poses impediments for its use in GIS. Such impediments are related to the overlapping of geographic features (logical links of the same physical link), and the representation of many geographic features (i.e., terminal links and interlines) over a single degenerate point. In many standard GIS software zero-length links cannot be accepted and links that meet at the same location must be connected. Therefore, to use the network in standard GIS a special version of the CTA network has been prepared. In this version, the locations of logical nodes, along with the ending vertices of polylines incident to them, falling at the same geographic location are slightly perturbed preventing spurious link connections and allowing slightly positive lengths for the zero-length links. A view of the CTA network in GIS is shown in Figure 5.

to the highway network via a second multimodal terminal transfer.

4 Centroids are abstract representations of all generators points of a given geographic area.

trucking services, railroad companies, and vessels.

zero-length connected at the same physical node.

## **4. Geographic and analytical tools for freight analysis**

## **4.1. CTA's intermodal and multimodal network**

The CTA team in ORNL maintains a computerized representation of the national intermodal and multimodal network system. This national network system was created by combining earlier digitalized representations of the three single-mode network systems: highway, railway, and waterway (Southworth and Peterson, 2000). The following digital databases were used to construct the network:


The abstract representation of a network system is composed of a set of links and nodes, where links represent events (goods movements) and nodes represent connections as well as starting or ending points for events. A "line haul" link is defined by a unidirectional link with positive length and formed by two endpoints. Another important abstract concept is the definition of route which is defined by a sequence of directed connected links. The geographic scope of the analysis defines the detail of network needed. At the national level only main physical links and routes (e.g. interstates, arterials, railroad mainlines and branch lines, ocean and rivers, etc.) are included in the network representation.

The CTA network system was proposed with the main intention of estimating the routes, and therefore mileages, for the domestic and export shipments reported in the 1997 CFS. Therefore, in an effort to simulate all activities reported by the CFS, physical links are represented by logical links which in turn simulate different realities in each single-mode network. In the railway system the same physical links are represented by different links to simulate not only a railroad owner but all different railroad companies that have trackage rights over the link. Similarly in the highway system logical links were included to represent both for-hire and private services. The waterway system was separated into three different sub-networks, each one representing a different type of vessel and/or movement: inland and inter-coastal (largely barge traffic), Great Lakes, and trans-oceanic or "deep sea". This logical separation of the single-mode systems was important to model transfer costs between trucking services, railroad companies, and vessels.

328 Application of Geographic Information Systems

how ton-miles were estimated for the 2007 FAF.

databases were used to construct the network:

into the rail lines of Canada and Mexico;

the ORNL constructed Trans-Oceanic Network;

a database of 5-Digit Zip-Code area locations.

Canada and Mexico;

**4.1. CTA's intermodal and multimodal network** 

**4. Geographic and analytical tools for freight analysis** 

the U.S. Army Corps of Engineers National Waterways Network;

the ORNL constructed National Intermodal Terminal Database;

lines, ocean and rivers, etc.) are included in the network representation.

Because the geographic detail of the network representation is higher than the geographic level of FAF database, certain extrapolation of freight movements from a higher geographic level (state-, metropolitan-, and remainder-level) to a lower geographic (county-level) is required in order to provide ton-mile estimates. This disaggregation of FAF database is done using freight demand models, which associates freight activity to the exogenous variables related to economic activity and network measurements (e.g. travel times, monetary costs, distances, etc). A discussion on freight demand models is presented in Section 4.2, and some models used by CTA are presented in Section 4.3. Section 4.4 presents

The CTA team in ORNL maintains a computerized representation of the national intermodal and multimodal network system. This national network system was created by combining earlier digitalized representations of the three single-mode network systems: highway, railway, and waterway (Southworth and Peterson, 2000). The following digital

the ORNL National Highway Network and its extensions into the main highways of

the Federal Railroad Administration's (FRA) National Rail Network and its extensions

The abstract representation of a network system is composed of a set of links and nodes, where links represent events (goods movements) and nodes represent connections as well as starting or ending points for events. A "line haul" link is defined by a unidirectional link with positive length and formed by two endpoints. Another important abstract concept is the definition of route which is defined by a sequence of directed connected links. The geographic scope of the analysis defines the detail of network needed. At the national level only main physical links and routes (e.g. interstates, arterials, railroad mainlines and branch

The CTA network system was proposed with the main intention of estimating the routes, and therefore mileages, for the domestic and export shipments reported in the 1997 CFS. Therefore, in an effort to simulate all activities reported by the CFS, physical links are represented by logical links which in turn simulate different realities in each single-mode network. In the railway system the same physical links are represented by different links to simulate not only a railroad owner but all different railroad companies that have trackage rights over the link. Similarly in the highway system logical links were included to represent Special links were designed to simulate the interfaces between each of the above logical subnetworks. These logical links are named "terminal links" and "interline links". The former simulate unloading /loading operations within terminals to handle goods between different vehicle types. The latter were specially designed to model the locations where goods movements are switched between two railroad companies. It is worth noting that terminals are represented by nodes in the network and their corresponding transfer links are links of zero-length connecting two logical endpoints at the same location. Interline are also links of zero-length connected at the same physical node.

Points of origination and termination of freight movements are represented by node centroids4. Such representation is a way to simplify the network model given that it is infeasible to include all actual locations where freight movements are originated and terminated. At the current static CTA network these nodes represent the county centroids.

The connection between generator points, centroids and terminals, and the network system is made by specific "access/egress links". Such links should therefore represent all movements connecting the real freight generators of a geographic area to the network system. For the CTA network a computation routine has been deployed to create these links "on-the-fly", that is every time there is a request to route a movement from an origin centroid to a destination centroid. Figure 4 illustrates how a shipment with mode sequence truck-rail-truck is routing onto the CTA network system. In simple words, a route is generated using a "shortest-path" routing algorithm that executes the following sequence of searches on the network: initiate the route by accessing the highway network (create access links), search for connection of it via truck-rail terminal to the rail sub-network, and return to the highway network via a second multimodal terminal transfer.

Although each individual layer (sub-network) of the network system can be stored and maintained in a commercial GIS, the combined multimodal network poses impediments for its use in GIS. Such impediments are related to the overlapping of geographic features (logical links of the same physical link), and the representation of many geographic features (i.e., terminal links and interlines) over a single degenerate point. In many standard GIS software zero-length links cannot be accepted and links that meet at the same location must be connected. Therefore, to use the network in standard GIS a special version of the CTA network has been prepared. In this version, the locations of logical nodes, along with the ending vertices of polylines incident to them, falling at the same geographic location are slightly perturbed preventing spurious link connections and allowing slightly positive lengths for the zero-length links. A view of the CTA network in GIS is shown in Figure 5.

<sup>4</sup> Centroids are abstract representations of all generators points of a given geographic area.

A Primer on Recent Advancement on Freight Transportation 331

functions to represent the generalized cost of different en-route activities over network facilities (line hauls, terminal links, interlines, access and egress links). This process started with a set of what Southworth and Peterson (2000) termed "native link impedance function". Routes over the highway network are determined according to link operational speed. Therefore, highway impedances are surrogates for link travel time. In the railway network impedances are assigned on the basis of the rail line class, i.e. main lines (long-haul and high capacity lines, thus with lower impedance for movement), and branch-lines (shorthaul and less capacity, and therefore with higher impedances). Waterway links received identical native impedance due to the fact that rarely there is more than one choice in routes

Native impedance values have been allocated to terminal links and interline links in attempt to simulate transfer costs of unload/loading operations in terminals, and the cost of switching between railroad companies in interlines, respectively. Highway and railway access/egress links received impedance value of 5 times their link length. It is worth noting that the lengths of such links were increased by a circuit factor5 of 20%. Lengths for water access/egress links were set to zero by assuming that most of the originators for water movements are near to the dock locations alongside the waterway network. However the real link lengths were used to calculate impedance for water access links so that the waterway network could be accessed at points closest to the

Given the native impedances the next step was to determine the relative costs of transports between different modes. Although transportation costs are in reality affected by many factors (related to both cargo and mode characteristics), if we assume that sequence of modes is known for a given shipment, the routing problem may become one of selecting the most likely transfer points between modes. To this end, in the routing algorithm, it has been also assumed that less expensive modes are preferentially used for as large a proportion of the trip as practicable, relegating more expensive modes to access role. Therefore the native impedances defined above were normalized to reflect differences in transportation costs between modes for multimodal movements. This normalization was such that if the water was used it dominated the route miles. Otherwise rail dominated, with highway usually

As opposed to demand modelling for passenger transportation, there is not a universal paradigm to model freight demand, only individual examples. However, some of the techniques developed specifically to model inter-city and urban travel demand have been also used to model freight demand. Some of these approaches are presented in this section,

5 Circuit factor between two points over a network is defined by the ratio between their distance over the network and

most of which are based on the book written by Cascetta (2009).

between any pair of geographic zones.

used for terminal access and/or egress.

**4.2. Freight demand models** 

their straight-line distance.

centroid locations.

**Figure 4.** Shipment routing on the CTA network system

**Figure 5.** CTA Network System in GIS

A route selection routine was developed to determine likely routes over the network for shipments with known sequence of modes. This required the development of impedance functions to represent the generalized cost of different en-route activities over network facilities (line hauls, terminal links, interlines, access and egress links). This process started with a set of what Southworth and Peterson (2000) termed "native link impedance function". Routes over the highway network are determined according to link operational speed. Therefore, highway impedances are surrogates for link travel time. In the railway network impedances are assigned on the basis of the rail line class, i.e. main lines (long-haul and high capacity lines, thus with lower impedance for movement), and branch-lines (shorthaul and less capacity, and therefore with higher impedances). Waterway links received identical native impedance due to the fact that rarely there is more than one choice in routes between any pair of geographic zones.

Native impedance values have been allocated to terminal links and interline links in attempt to simulate transfer costs of unload/loading operations in terminals, and the cost of switching between railroad companies in interlines, respectively. Highway and railway access/egress links received impedance value of 5 times their link length. It is worth noting that the lengths of such links were increased by a circuit factor5 of 20%. Lengths for water access/egress links were set to zero by assuming that most of the originators for water movements are near to the dock locations alongside the waterway network. However the real link lengths were used to calculate impedance for water access links so that the waterway network could be accessed at points closest to the centroid locations.

Given the native impedances the next step was to determine the relative costs of transports between different modes. Although transportation costs are in reality affected by many factors (related to both cargo and mode characteristics), if we assume that sequence of modes is known for a given shipment, the routing problem may become one of selecting the most likely transfer points between modes. To this end, in the routing algorithm, it has been also assumed that less expensive modes are preferentially used for as large a proportion of the trip as practicable, relegating more expensive modes to access role. Therefore the native impedances defined above were normalized to reflect differences in transportation costs between modes for multimodal movements. This normalization was such that if the water was used it dominated the route miles. Otherwise rail dominated, with highway usually used for terminal access and/or egress.

#### **4.2. Freight demand models**

330 Application of Geographic Information Systems

**Figure 4.** Shipment routing on the CTA network system

**Figure 5.** CTA Network System in GIS

A route selection routine was developed to determine likely routes over the network for shipments with known sequence of modes. This required the development of impedance As opposed to demand modelling for passenger transportation, there is not a universal paradigm to model freight demand, only individual examples. However, some of the techniques developed specifically to model inter-city and urban travel demand have been also used to model freight demand. Some of these approaches are presented in this section, most of which are based on the book written by Cascetta (2009).

<sup>5</sup> Circuit factor between two points over a network is defined by the ratio between their distance over the network and their straight-line distance.

The objective of estimating freight demand models is to represent the production and distribution of goods, either for intermediate use or final consumption, for a given time period. With aim of freight database and economic variables related to the production and consumption of goods, a system of freight demand models can be formally expressed as

$$d\_{\bowtie d} \{ p, \mathfrak{c}, \mathfrak{m}, \mathfrak{k} \} \rightharpoonup d(A, \, T; \mathfrak{k}) \,, \tag{1}$$

A Primer on Recent Advancement on Freight Transportation 333

Equations (2) – (3) below present ways of decomposing the global model into sub-models,

*do.[pc](A)* is the freight production model, which gives the total production of commodity *c*

*p[d/pc](A,T)* is the freight spatial distribution model, which gives the fraction of goods, or the fraction of the total production of *c* of sector *p* in zone *o*, that are transported to zone *d*;

*p[m/pcod](A,T)* is the freight mode choice model, which gives the fraction of the total freight flow of commodity *c* of sector *p* between zones *o* and *d* that is transported by mode *m*;

*d.d[pc](A)* is the freight attraction model, which gives the demand for goods in the

*dod[pc](A,T)* is a joint generation and distribution model which gives the actual spatial

*Cod* is a variable related to the generalized cost of transportation. The generalized cost of transportation is formed by a combination of all attributes of the transportation system (performance variables, monetary costs, and different mode services) that separates zones *o*

*f(Cod)* is an impedance function (sometimes called friction function) that decreases with the

is a constant parameter that should be calibrated to balance out quantities of production

In the following sections, we will briefly describe each of the sub-models in equations (3)-(4) and illustrate how some of the sub-models can be formulated as function of aggregate variables provided by the CFS regional database, as well as variables related to the transportation system from the CTA network, and variables related to a given economic pattern provided by the U.S. economic census. Below it is described aggregate descriptive models for generation, descriptive and behavioural models for distribution and mode choice. A descriptive model for the joint generation and distribution of freight demand is

Generation models, *d.d[pc](A)* and *d.d[pc](A)*, describe how much is produced (supply) and how much is needed (demand) for a given pattern of economic activity in a geographic area. Descriptive generation models represent the empirical relationship between freight demand (produced or attracted) by geographic area and a given pattern of economic activity. Such models can be applied to short term analysis in which the pattern of economic activity is

where

of sector *p* in zone *o;* 

destination zones;

and *d*;

*Cod*;

also presented.

distribution of freight demand within a study area;

and attraction between pair of geographic zones.

*dod[p,c,m] =do. [pc](A). p[d/pco](A, T) . p[m/pcod](A, T) ,* (2)

*dod[p,c,m] =dod [pc](A, T). p[m/pcod](A, T) ,* (4)

*dod[p,c,m] = do. [pc](A) . d.d [pc](A) . f(Cod) . p[m/pcod](A, T) ,* (3)

where *dod* represents a demand flow (usually expressed in tons) between a origin zone *o* and a destination zone *d*. The characteristics *p, c, m* and *k* are associated with *sectors of economic activity*, *commodity types*, *transportation modes* and *routes*, respectively. The *A* variables reflect the economics of production and consumption. *T* are variables related to attributes of the different transportation modes and services (times, costs, service reliability, etc). Vector *β* denotes the model coefficients.

Models can be classified according to the assumptions about modelling approach or to the level of data aggregation (see Cascetta, 2009). Based on the model assumptions, models can be of type *descriptive* if they merely describe empirical relationships between the exogenous (explicative variables related to the economic system) variables and the endogenous variables (response variables related to freight demand); or *behavioural* if they explain the behaviour of decision makers in choosing among the universe of choices involved in the production and distribution of goods. According to the unit of variables available for model calibration/estimation models can also be of type aggregate and disaggregate. Aggregate models use average of variables related to aggregate units (e.g. all companies of a given industry sector) aggregated to the geographic zone level, whereas disaggregate models use variables related of small units, within the geographic zone, such as individual companies or individual shipments.

In this section, we will discuss national freight models (aggregate and disaggregate models) for predicting annually domestic and foreign trade in the U.S. In the following discussion, the sector *p* represents the producer sector (or originated economic sector) which is consistent with the CFS database. In addition, in most of the following discussion, the subscript *p* is also omitted from the equations for simplicity in the notation. The allocation of freight flows on the transportation system (i.e., interaction between demand and transportation supply) is not represented in the models that will be discussed.

The model in equation (1) represents the freight demand resulting from choices made by decision makers of a given economic sector with respect to production, and spatial and modal distribution of freight demand. Such decisions affecting the freight demand are interrelated. However when the system of models of equation (1) represent the complete universe of choices that decision makers should face, a single model may not be feasible either computationally or analytically. In order to simplify the analytical representation, the single model of equation (1) is usually decomposed into sequence of sub-models each representing a step within a sequential decision process. Partial share models can be estimated by the traditional paradigm of transportation demand modelling (or four-stage model) which consists in separately estimating generation, distribution, mode choice models, and route choice models (see Ortuzar and Willumsen, 2011).

Equations (2) – (3) below present ways of decomposing the global model into sub-models,

$$\mathbf{d}\_{\text{od}}[\mathbf{p}, \mathbf{c}, \mathbf{m}] \equiv \mathbf{d}\_{\text{o}} \text{ [pc]}(\mathbf{A}) . \text{ p}[\mathbf{d} \text{/pco}](\mathbf{A}, \mathbf{T}) \; . \; \mathbf{p} [\mathbf{m} \text{/pco}](\mathbf{A}, \mathbf{T}) \; . \tag{2}$$

$$\mathbf{d}\_{\text{ad}}[\mathbf{p}, \mathbf{c}, \mathbf{m}] \equiv \mathbf{u} \mathbf{d}\_{\text{a}} \text{ [pc]} \text{(A)} \text{ . } \mathbf{d}\_{\text{d}} \text{ [pc]} \text{(A)} \text{ . } f(\mathbf{C}\_{\text{ad}}) \text{ . } \mathbf{p}[\mathbf{m} \text{/} \mathbf{p} \text{cod}] \text{(A, T)} \text{ .} \tag{3}$$

$$d\_{\rm od} \{ p, \rm c, m \} \equiv d\_{\rm od} \{ \rm pc \} (A, T). \ p \{ m \/ p \rm cod \} (A, T) \,, \tag{4}$$

where

332 Application of Geographic Information Systems

denotes the model coefficients.

or individual shipments.

The objective of estimating freight demand models is to represent the production and distribution of goods, either for intermediate use or final consumption, for a given time period. With aim of freight database and economic variables related to the production and consumption of goods, a system of freight demand models can be formally expressed as

where *dod* represents a demand flow (usually expressed in tons) between a origin zone *o* and a destination zone *d*. The characteristics *p, c, m* and *k* are associated with *sectors of economic activity*, *commodity types*, *transportation modes* and *routes*, respectively. The *A* variables reflect the economics of production and consumption. *T* are variables related to attributes of the different transportation modes and services (times, costs, service reliability, etc). Vector *β*

Models can be classified according to the assumptions about modelling approach or to the level of data aggregation (see Cascetta, 2009). Based on the model assumptions, models can be of type *descriptive* if they merely describe empirical relationships between the exogenous (explicative variables related to the economic system) variables and the endogenous variables (response variables related to freight demand); or *behavioural* if they explain the behaviour of decision makers in choosing among the universe of choices involved in the production and distribution of goods. According to the unit of variables available for model calibration/estimation models can also be of type aggregate and disaggregate. Aggregate models use average of variables related to aggregate units (e.g. all companies of a given industry sector) aggregated to the geographic zone level, whereas disaggregate models use variables related of small units, within the geographic zone, such as individual companies

In this section, we will discuss national freight models (aggregate and disaggregate models) for predicting annually domestic and foreign trade in the U.S. In the following discussion, the sector *p* represents the producer sector (or originated economic sector) which is consistent with the CFS database. In addition, in most of the following discussion, the subscript *p* is also omitted from the equations for simplicity in the notation. The allocation of freight flows on the transportation system (i.e., interaction between demand and

The model in equation (1) represents the freight demand resulting from choices made by decision makers of a given economic sector with respect to production, and spatial and modal distribution of freight demand. Such decisions affecting the freight demand are interrelated. However when the system of models of equation (1) represent the complete universe of choices that decision makers should face, a single model may not be feasible either computationally or analytically. In order to simplify the analytical representation, the single model of equation (1) is usually decomposed into sequence of sub-models each representing a step within a sequential decision process. Partial share models can be estimated by the traditional paradigm of transportation demand modelling (or four-stage model) which consists in separately estimating generation, distribution, mode choice

transportation supply) is not represented in the models that will be discussed.

models, and route choice models (see Ortuzar and Willumsen, 2011).

*dod[p,c,m,k] =d(A, T; β) ,* (1)

*do.[pc](A)* is the freight production model, which gives the total production of commodity *c* of sector *p* in zone *o;* 

*p[d/pc](A,T)* is the freight spatial distribution model, which gives the fraction of goods, or the fraction of the total production of *c* of sector *p* in zone *o*, that are transported to zone *d*;

*p[m/pcod](A,T)* is the freight mode choice model, which gives the fraction of the total freight flow of commodity *c* of sector *p* between zones *o* and *d* that is transported by mode *m*;

*d.d[pc](A)* is the freight attraction model, which gives the demand for goods in the destination zones;

*dod[pc](A,T)* is a joint generation and distribution model which gives the actual spatial distribution of freight demand within a study area;

*Cod* is a variable related to the generalized cost of transportation. The generalized cost of transportation is formed by a combination of all attributes of the transportation system (performance variables, monetary costs, and different mode services) that separates zones *o* and *d*;

*f(Cod)* is an impedance function (sometimes called friction function) that decreases with the *Cod*;

 is a constant parameter that should be calibrated to balance out quantities of production and attraction between pair of geographic zones.

In the following sections, we will briefly describe each of the sub-models in equations (3)-(4) and illustrate how some of the sub-models can be formulated as function of aggregate variables provided by the CFS regional database, as well as variables related to the transportation system from the CTA network, and variables related to a given economic pattern provided by the U.S. economic census. Below it is described aggregate descriptive models for generation, descriptive and behavioural models for distribution and mode choice. A descriptive model for the joint generation and distribution of freight demand is also presented.

Generation models, *d.d[pc](A)* and *d.d[pc](A)*, describe how much is produced (supply) and how much is needed (demand) for a given pattern of economic activity in a geographic area. Descriptive generation models represent the empirical relationship between freight demand (produced or attracted) by geographic area and a given pattern of economic activity. Such models can be applied to short term analysis in which the pattern of economic activity is given. In this case we try to estimate the amount of goods (in tons) generated due to decisions related to what and how much to produce. The freight demand resulting from long term decisions, such as where to produce, are not considered in this simplified descriptive approach. Modelling long term decisions should incorporate behavioural aspects that may require disaggregate data to be modelled.

Equation (5) shows an example of a linear model for production of goods where the unit of aggregation are economic sectors within regions. The parameters of these models are usually obtained with statistical estimation methods, or multiple regression analysis.

$$\mathbf{d}\_{\alpha}[\mathbf{p}] \triangleq \sum \mathbf{k} \mathbf{\hat{p}} \times \mathbf{k} \mathbf{p} \tag{5}$$

A Primer on Recent Advancement on Freight Transportation 335

*f(Cod) =exp(-β Cod).* (7)

*p[d/po](A,T) = exp(Vd /θd) / ∑i exp(Vi /θi),* (8)

*Vd =∑kβk Xkd.* (9)

decreasing function of the generalized transportation cost *Cod*. One typical expression for this function which can be derived from the entropy maximization problem (see Wilson,

In the context of passenger transportation, the formulation of equation (7) is obtained by finding the most likely distribution pattern, corresponding to maximizing an *entropy function*  (see Subsection 4.3.2) subject to macro constraints on the total number of trips produced and/or attracted as well as on the overall transportation cost. The entropy function is a measure of the number of possible arrangements of individuals that gives rise to a certain distribution pattern. To extend this model to the case of freight, we would have to assume that each trip represents a unit of freight (tons) being transported from an origin to a

The behavioural model, *p[d/po](A,T)*, estimate a probability of choosing a destination *d* for a given industry sector and origin of shipment *o*. Therefore, this model combines the distribution and attraction models into a single formulation. Assuming that the set of alternatives is formed by all zones in the study area a formulation for this model can be estimated on the basis of the random utility theory. Under this paradigm, the most common

*Vd* is the expected utility, or systematic utility, value of choosing a destination *d* for given

*θd* represents the Gumbel distribution parameter of the perceived utility *Ud*, such that *Ud* = *Vd + εd*, where the *εd* values are the random residues, deviations from the mean value *Vd*, which are assumed to be independently identically distributed as Gumbel random variable

The systematic utility is formulated as function of the attributes related to characteristics of the alternatives and the decision makers. Equation (9) shows a typically linear specification.

The attributes of the systematic utility are grouped into attributes of the activity system in zone *d*, or attractiveness attributes; and attributes that quantify the accessibility or cost of travel between zones *o* and *d*. Attractiveness attributes are variables that measure the attractiveness of a zone as destination. As mentioned before, they might be a function of the number of employees or the total payroll for a given consumer industry, or client industry. Demographic variables such as population are also measures of attractiveness. Cost or accessibility attributes are variables reflecting the generalized cost of moving goods between *o* and *d*. They can be a straight-line distance connecting the centroids of zones *o* and *d*, or

model form is the *Multinomial Logit (MNL)* model, as expressed below:

1967) is:

destination point.

where

industry sector *p* and origin zone *o*;

with zero mean and scale parameter *θd*.

Where

Xkpo are exogenous variables related to the economic activity of sector *p* in zone *o*.

*βk* are the model coefficients to be calibrated.

Attraction models are not straightforward to estimate since they should represent the total amount of goods attracted to a zone that are produced by a given sector, from companies located at different zones, and used by multiple sectors in that zone. Therefore, one way to represent aggregate attraction model is:

$$d.d.\{p\} \rightharpoonup d(\mathbf{X}, \mathbf{W}; \beta) \; , \tag{6}$$

where

*W* is a matrix that containing coefficient factors representing the industry-to-industry trade of goods. These are binary values that are equal to one whenever a sector associated with a row can provide goods to a user associated with a column, and equal to zero otherwise. These assignment coefficients can be obtained from regional or national input-output accounts;

**X** are variables related to the economic of those industry sectors in the destination zone that use commodities produced by sector *p*, as well as demographic variables representing final consumption of goods.

Examples of variables related to economic activity are the number of employees or total payroll by industry sector in each geographic zone. Demographic variables are represented by population or number of households within each geographic zone.

*Distribution models* estimate the trade flow between geographic zones (or regions) of the study area. The product of such models is called an origin-destination matrix of freight flows that satisfies the "trip-end" production and attraction constraints.

The descriptive model, �*do.[p]. d.d[p] f(Cod),* corresponds to well-known gravity model, where *do.[p]* is the total production in zone *o* by sector *p* and *d.d[p]* is the total freight demand attracted to *d* that is produced by sector *p*. The name "gravity" came from the model resemblance to Newton's law of gravity. The impedance function *f(Cod)* is a monotonically decreasing function of the generalized transportation cost *Cod*. One typical expression for this function which can be derived from the entropy maximization problem (see Wilson, 1967) is:

$$f(\mathbb{C}\_{\text{ad}}) = \exp(-\beta |\mathbb{C}\_{\text{ad}}).\tag{7}$$

In the context of passenger transportation, the formulation of equation (7) is obtained by finding the most likely distribution pattern, corresponding to maximizing an *entropy function*  (see Subsection 4.3.2) subject to macro constraints on the total number of trips produced and/or attracted as well as on the overall transportation cost. The entropy function is a measure of the number of possible arrangements of individuals that gives rise to a certain distribution pattern. To extend this model to the case of freight, we would have to assume that each trip represents a unit of freight (tons) being transported from an origin to a destination point.

The behavioural model, *p[d/po](A,T)*, estimate a probability of choosing a destination *d* for a given industry sector and origin of shipment *o*. Therefore, this model combines the distribution and attraction models into a single formulation. Assuming that the set of alternatives is formed by all zones in the study area a formulation for this model can be estimated on the basis of the random utility theory. Under this paradigm, the most common model form is the *Multinomial Logit (MNL)* model, as expressed below:

$$p[\text{d}/\text{po}](A,\text{T}) = \exp(\text{V}\_{\text{d}}/\Theta\_{\text{d}}) / \sum\_{i} \exp(\text{V}\_{i}/\Theta\_{i}),\tag{8}$$

where

334 Application of Geographic Information Systems

Where

where

accounts;

consumption of goods.

aspects that may require disaggregate data to be modelled.

*βk* are the model coefficients to be calibrated.

represent aggregate attraction model is:

given. In this case we try to estimate the amount of goods (in tons) generated due to decisions related to what and how much to produce. The freight demand resulting from long term decisions, such as where to produce, are not considered in this simplified descriptive approach. Modelling long term decisions should incorporate behavioural

Equation (5) shows an example of a linear model for production of goods where the unit of aggregation are economic sectors within regions. The parameters of these models are

do.[p] =∑kβk Xkpo (5)

Attraction models are not straightforward to estimate since they should represent the total amount of goods attracted to a zone that are produced by a given sector, from companies located at different zones, and used by multiple sectors in that zone. Therefore, one way to

*W* is a matrix that containing coefficient factors representing the industry-to-industry trade of goods. These are binary values that are equal to one whenever a sector associated with a row can provide goods to a user associated with a column, and equal to zero otherwise. These assignment coefficients can be obtained from regional or national input-output

**X** are variables related to the economic of those industry sectors in the destination zone that use commodities produced by sector *p*, as well as demographic variables representing final

Examples of variables related to economic activity are the number of employees or total payroll by industry sector in each geographic zone. Demographic variables are represented

*Distribution models* estimate the trade flow between geographic zones (or regions) of the study area. The product of such models is called an origin-destination matrix of freight

The descriptive model, �*do.[p]. d.d[p] f(Cod),* corresponds to well-known gravity model, where *do.[p]* is the total production in zone *o* by sector *p* and *d.d[p]* is the total freight demand attracted to *d* that is produced by sector *p*. The name "gravity" came from the model resemblance to Newton's law of gravity. The impedance function *f(Cod)* is a monotonically

by population or number of households within each geographic zone.

flows that satisfies the "trip-end" production and attraction constraints.

*d.d[p] =d(***X**, **W**; *β*) , (6)

usually obtained with statistical estimation methods, or multiple regression analysis.

Xkpo are exogenous variables related to the economic activity of sector *p* in zone *o*.

*Vd* is the expected utility, or systematic utility, value of choosing a destination *d* for given industry sector *p* and origin zone *o*;

*θd* represents the Gumbel distribution parameter of the perceived utility *Ud*, such that *Ud* = *Vd + εd*, where the *εd* values are the random residues, deviations from the mean value *Vd*, which are assumed to be independently identically distributed as Gumbel random variable with zero mean and scale parameter *θd*.

The systematic utility is formulated as function of the attributes related to characteristics of the alternatives and the decision makers. Equation (9) shows a typically linear specification.

$$\mathbf{V}\_d = \sum k \mathfrak{B}\_k \,\, \mathbf{X}\_{kd} \,\, \tag{9}$$

The attributes of the systematic utility are grouped into attributes of the activity system in zone *d*, or attractiveness attributes; and attributes that quantify the accessibility or cost of travel between zones *o* and *d*. Attractiveness attributes are variables that measure the attractiveness of a zone as destination. As mentioned before, they might be a function of the number of employees or the total payroll for a given consumer industry, or client industry. Demographic variables such as population are also measures of attractiveness. Cost or accessibility attributes are variables reflecting the generalized cost of moving goods between *o* and *d*. They can be a straight-line distance connecting the centroids of zones *o* and *d*, or generalized cost variables that take into account the different contributions for each of the modes available between zones *o* and *d*. By explicitly showing the measures of attractiveness and transportation cost, equation (9) becomes

$$V\_d = \sum h \beta\_h \ A\_{hd} \text{--} \beta\_c \ C\_{od} \,\tag{10}$$

where

and *Ad*;

well as other barriers;

**4.3. CTA's freight models** 

system is presented here.

*4.3.1. Generation models* 

modelling process.

April 2010.

\_cove

*ξ<sup>p</sup>* and *ζp* are the elasticities to be estimated.

*<sup>p</sup>* is a multiplier representing the overall scale of industry *p*;

*Ao* and *Ad* are the measures of production and attractiveness of zones *o* and *d*, respectively. The Gross Domestic Product (GDP) can be used as an indicator for variables *Ao*

*fpod* is a distance decay function representing the trade impeding effect or transport cost as

The CTA group developed descriptive freight generation models (production and attraction models) to predict freight demand by U.S. state resultant from domestic interstate commerce as a function of economic activity (Oliveira Neto *et al.*, 2012). Fundamentally, the industry activities, translated into capital and labour by industry sector, should be the main base to explain the economic activity within a geographic area. In addition, the CTA team uses the structure of the national freight database in GIS and network analysis tools to estimate descriptive models for freight distribution. Specifically the process of employing the gravity model for a given origin-destination matrix of freight demand and transportation supply

This section presents the estimation of static freight demand model due to the domestic trade in the U.S. for the year 2007. Two major data sources were used in this effort: 2007 CFS tabulations and the 2007 County Business Pattern (CBP). Specific CFS tabulations requested as supplement information for FAF estimation process were used to obtain sample estimates for freight productions and attractions in weight units. These tables contain the information of domestic movements of goods between U.S. states for 28 industry sectors in 2007, classified by the 3-digit North American Industry Classification System (NAICS), which are responsible for the production freight transportation. The list of these NAICS codes can be found in CFS document referenced above6 and BTS website7. It is worth noting that in this special tabulation only a small number of cells were suppressed due to likely small sample sizes and their resultant large sample variances. These cells were treated as missing information during the

<sup>6</sup> 2007 Econonic Census – Transportation, 2007 Commodity Flow Survey United States: 2007, EC07TCF-US, issued

<sup>7</sup> Source: 2007 Commodity Flow Survey – Survey Overview and Methodology – Industry Coverage, Accessed July, Available from: 2011http://www.bts.gov/publications/commodity\_flow\_survey/methodology/index.html#industry

A Primer on Recent Advancement on Freight Transportation 337

where

*Ahd* are the measures of attractiveness in zone *d.*

*Mode choice model, p[m/pod](A,T),* predict the fraction or choice probability that decision makers of a given sector *p* select mode or service *m* to ship goods from zone *o* to zone *d*. The first random utility models were formulated to analyze transportation mode choice. The MNL formulation can be then applied to predict these mode choice probabilities.

The definition of the mode choice alternatives constitutes the first step in the modelling process. In the context of freight transportation the alternatives of a mode choice model are the individual transportation modes (truck, rail, water, air, pipeline, etc) and combinations of single modes (truck and rail, truck and water, truck and air) as described in the CFS database. Different services provided by carriers – related to delivery time, security, levels of priority, type and capacity of vehicles, suitability of vehicles to certain types of commodities, etc. – can also be used to represent elementary alternatives of transportation.

The next step corresponds to the definition of the choice set, which is the set of mutually exclusively alternatives or group of alternatives available for a given decision maker. In this case it is possible that some alternatives are unavailable for a given decision maker and therefore should be excluded from choice set. It may also happen that the analyst has not exact knowledge of the alternatives available. To handle the mode availability special treatments have been proposed in the literature (see Cascetta, 2009). By assuming that all decision makers face the same set of alternatives, the MNL formulation for the probability choice is

$$\exp\{m(\text{pad})(A,T) = \exp(V\_m/\Theta\_m) / \sum\_i \exp(V\_i/\Theta\_i). \tag{11}$$

An alternative to deal with mode availability would be to force the systematic utility to minus infinity, whenever the mode is not available, which would result in a choice probability of zero.

*The joint generation and distribution model*, *dod[p](A,T)*, predicts the actual demand from a zone *o* to a zone *d* for a given pattern of economic activity and transportation system. This model combines the generation and distribution models into a single functional form. In some sense, such models are analogous to the gravity model, presented before, in which the number of trips (or zone mass) produced or attracted to a zone are replaced by exogenous measures of production and attractiveness of that zone. The descriptive formulation presented bellow is the traditional gravity model formulation in economics (Brocker *et al.*, 2011):

$$\text{Ad}\_{\bullet}[\![p](A,\!T) = \alpha\_{\mathbb{P}} \; A \,\!\_{o}^{\downarrow p} A \,\!\_{f}^{\downarrow p} f\_{\text{pad}} \,\!\_{f} \tag{12}$$

where

336 Application of Geographic Information Systems

where

choice is

probability of zero.

and transportation cost, equation (9) becomes

*Ahd* are the measures of attractiveness in zone *d.*

generalized cost variables that take into account the different contributions for each of the modes available between zones *o* and *d*. By explicitly showing the measures of attractiveness

*Mode choice model, p[m/pod](A,T),* predict the fraction or choice probability that decision makers of a given sector *p* select mode or service *m* to ship goods from zone *o* to zone *d*. The first random utility models were formulated to analyze transportation mode choice. The

The definition of the mode choice alternatives constitutes the first step in the modelling process. In the context of freight transportation the alternatives of a mode choice model are the individual transportation modes (truck, rail, water, air, pipeline, etc) and combinations of single modes (truck and rail, truck and water, truck and air) as described in the CFS database. Different services provided by carriers – related to delivery time, security, levels of priority, type and capacity of vehicles, suitability of vehicles to certain types of commodities,

The next step corresponds to the definition of the choice set, which is the set of mutually exclusively alternatives or group of alternatives available for a given decision maker. In this case it is possible that some alternatives are unavailable for a given decision maker and therefore should be excluded from choice set. It may also happen that the analyst has not exact knowledge of the alternatives available. To handle the mode availability special treatments have been proposed in the literature (see Cascetta, 2009). By assuming that all decision makers face the same set of alternatives, the MNL formulation for the probability

An alternative to deal with mode availability would be to force the systematic utility to minus infinity, whenever the mode is not available, which would result in a choice

*The joint generation and distribution model*, *dod[p](A,T)*, predicts the actual demand from a zone *o* to a zone *d* for a given pattern of economic activity and transportation system. This model combines the generation and distribution models into a single functional form. In some sense, such models are analogous to the gravity model, presented before, in which the number of trips (or zone mass) produced or attracted to a zone are replaced by exogenous measures of production and attractiveness of that zone. The descriptive formulation presented bellow is

the traditional gravity model formulation in economics (Brocker *et al.*, 2011):

*dod[p](A,T) =*α*p Ao*

*p[m/pod](A,T) = exp(Vm /θm) / ∑i exp(Vi /θi)*. (11)

*<sup>ξ</sup>pAdζpfpod* , (12)

MNL formulation can be then applied to predict these mode choice probabilities.

etc. – can also be used to represent elementary alternatives of transportation.

*Vd =∑hβh Ahd -β Cod* , (10)

*<sup>p</sup>* is a multiplier representing the overall scale of industry *p*;

*Ao* and *Ad* are the measures of production and attractiveness of zones *o* and *d*, respectively. The Gross Domestic Product (GDP) can be used as an indicator for variables *Ao* and *Ad*;

*fpod* is a distance decay function representing the trade impeding effect or transport cost as well as other barriers;

*ξ<sup>p</sup>* and *ζp* are the elasticities to be estimated.

## **4.3. CTA's freight models**

The CTA group developed descriptive freight generation models (production and attraction models) to predict freight demand by U.S. state resultant from domestic interstate commerce as a function of economic activity (Oliveira Neto *et al.*, 2012). Fundamentally, the industry activities, translated into capital and labour by industry sector, should be the main base to explain the economic activity within a geographic area. In addition, the CTA team uses the structure of the national freight database in GIS and network analysis tools to estimate descriptive models for freight distribution. Specifically the process of employing the gravity model for a given origin-destination matrix of freight demand and transportation supply system is presented here.

#### *4.3.1. Generation models*

This section presents the estimation of static freight demand model due to the domestic trade in the U.S. for the year 2007. Two major data sources were used in this effort: 2007 CFS tabulations and the 2007 County Business Pattern (CBP). Specific CFS tabulations requested as supplement information for FAF estimation process were used to obtain sample estimates for freight productions and attractions in weight units. These tables contain the information of domestic movements of goods between U.S. states for 28 industry sectors in 2007, classified by the 3-digit North American Industry Classification System (NAICS), which are responsible for the production freight transportation. The list of these NAICS codes can be found in CFS document referenced above6 and BTS website7. It is worth noting that in this special tabulation only a small number of cells were suppressed due to likely small sample sizes and their resultant large sample variances. These cells were treated as missing information during the modelling process.

<sup>6</sup> 2007 Econonic Census – Transportation, 2007 Commodity Flow Survey United States: 2007, EC07TCF-US, issued April 2010.

<sup>7</sup> Source: 2007 Commodity Flow Survey – Survey Overview and Methodology – Industry Coverage, Accessed July, Available from: 2011http://www.bts.gov/publications/commodity\_flow\_survey/methodology/index.html#industry \_cove

The CBP is an annual series that provides sub-national economic data by industry. The series is useful for studying the economic activity of small areas and for analyzing economic changes over time. In addition, it can be used as a benchmark for statistical series, surveys, or other databases between economic censuses. The survey covers most of the U.S. economic activity, except self-employed individuals, employees of private households, railroad employees, agricultural production employees, and most government employees. The database provides information on payroll, number of establishments and number of employees for industries classified, since 1998, according to NAICS. The variable chosen to represent the economic activity of a given geographic area was the annual payroll by industry sector.

In the modelling process, it was assumed that the origin zones (U.S. states) are producers, where the products or commodities (raw materials or final products) are obtained or produced. In contrast, the destination zones are considered the "users" where the products or commodities are used/assembled/modified by intermediate industry sectors. Both production and attraction models by industry sector were specified as a power model with single explanatory variable by the following equation:

$$\mathbf{y}\_{\mathbb{P}^1} \equiv \alpha \mathbf{x}\_{\mathbb{P}^1} \boldsymbol{\vartheta}\_{\boldsymbol{\cdot}} \tag{13}$$

A Primer on Recent Advancement on Freight Transportation 339

over this short time period. In addition, even when no significant change in model structure was detected, the modelling process did not result in reasonable predictions of freight for a future year. After applying 2002 models to predict freight volumes in 2007, it was found that there may be other factors, besides payroll, ought to be included in the modelling process.

**Figure 6.** Fitted production curve for industry sector 311 – Food Manufacturing

freight flows by assuring that *∑jTpij = Ppi* and *∑iTpij = Apj* for all *i* and *j*.

In this section we illustrate how a gravity model can be estimated employing the same 2007 CFS tabulation used for estimating freight generation models earlier. As seen in Section 4.2, in transportation planning gravity model is used to balance an origin-destination trip matrix so that the zone-to-zone flows are consistent with the total trips generated at each origin zone and the total demand terminating at each destination zone. The model form that will

Tpij = αij Ppi Api exp(-βpcij) , (14)

where *Tpij* is the annual freight flow in thousand tons to be estimated; *Ppi* is the total amount of freight (thousand tons) produced in state *i* by sector *p*; *Apj* is the total amount of freight (thousand tons) attracted to state *j* from the originated industry *p*; *cij* is a measure of the distance that separates the states *i* and *j*; *βp* is the specified parameter, the value of which says how important is the distance variable in explaining the trade between zones *i* and *j* for a given producer *p*; and *αij* are adjustment factors that should be calibrated to balance the

*4.3.2. Distribution models* 

be estimated for this exercise is

where,

*ypi* denotes the response variable (shipment tons by industry sector *p* and state *i*);

*xpi* denotes the exogenous variable (annual payroll by industry sector *p* and state *i*);

and *β* are the model parameters, to be estimated.

Production and attraction equations have been estimated for each of the 28 industry sectors. With respect to production equations, the response variables in CFS represent total demand produced by a given industry sector and state. In this case the exogenous variable was the corresponding payroll by industry and state. As for the attraction equations, the response variable represent the total demand attracted by a state that was originally produced by one of the industry sectors. Therefore, the total demand attracted is not separated by the industries that use the goods, rather is classified by industry sectors responsible for production. To be consistent with the CFS data, the single exogenous variable, or explanatory variable, in the attraction equations is composed of the payroll by state for those industries that use the commodities produced by the originated sectors. Figure 6 shows a scatter plot of freight produced versus total payroll for industry sector 311 – Food Manufacturing. The data points are sample observations for the 50 U.S. states and the District of Columbia. The fitted curve and estimated production equation are also presented in the chart.

It is worth noting that Oliveira Neto et al. (2012) compared the structure of the 2007 models with models estimated based on 2002 CFS similar tabulations. The results of their empirical analysis indicated some structural change in the production and attraction models between 2002 and 2007 due to a possible increase in productivity and a significant economic growth over this short time period. In addition, even when no significant change in model structure was detected, the modelling process did not result in reasonable predictions of freight for a future year. After applying 2002 models to predict freight volumes in 2007, it was found that there may be other factors, besides payroll, ought to be included in the modelling process.

**Figure 6.** Fitted production curve for industry sector 311 – Food Manufacturing

#### *4.3.2. Distribution models*

338 Application of Geographic Information Systems

single explanatory variable by the following equation:

and *β* are the model parameters, to be estimated.

industry sector.

where,

in the chart.

The CBP is an annual series that provides sub-national economic data by industry. The series is useful for studying the economic activity of small areas and for analyzing economic changes over time. In addition, it can be used as a benchmark for statistical series, surveys, or other databases between economic censuses. The survey covers most of the U.S. economic activity, except self-employed individuals, employees of private households, railroad employees, agricultural production employees, and most government employees. The database provides information on payroll, number of establishments and number of employees for industries classified, since 1998, according to NAICS. The variable chosen to represent the economic activity of a given geographic area was the annual payroll by

In the modelling process, it was assumed that the origin zones (U.S. states) are producers, where the products or commodities (raw materials or final products) are obtained or produced. In contrast, the destination zones are considered the "users" where the products or commodities are used/assembled/modified by intermediate industry sectors. Both production and attraction models by industry sector were specified as a power model with

ypi = *α* xpi <sup>β</sup> , (13)

Production and attraction equations have been estimated for each of the 28 industry sectors. With respect to production equations, the response variables in CFS represent total demand produced by a given industry sector and state. In this case the exogenous variable was the corresponding payroll by industry and state. As for the attraction equations, the response variable represent the total demand attracted by a state that was originally produced by one of the industry sectors. Therefore, the total demand attracted is not separated by the industries that use the goods, rather is classified by industry sectors responsible for production. To be consistent with the CFS data, the single exogenous variable, or explanatory variable, in the attraction equations is composed of the payroll by state for those industries that use the commodities produced by the originated sectors. Figure 6 shows a scatter plot of freight produced versus total payroll for industry sector 311 – Food Manufacturing. The data points are sample observations for the 50 U.S. states and the District of Columbia. The fitted curve and estimated production equation are also presented

It is worth noting that Oliveira Neto et al. (2012) compared the structure of the 2007 models with models estimated based on 2002 CFS similar tabulations. The results of their empirical analysis indicated some structural change in the production and attraction models between 2002 and 2007 due to a possible increase in productivity and a significant economic growth

*ypi* denotes the response variable (shipment tons by industry sector *p* and state *i*);

*xpi* denotes the exogenous variable (annual payroll by industry sector *p* and state *i*);

In this section we illustrate how a gravity model can be estimated employing the same 2007 CFS tabulation used for estimating freight generation models earlier. As seen in Section 4.2, in transportation planning gravity model is used to balance an origin-destination trip matrix so that the zone-to-zone flows are consistent with the total trips generated at each origin zone and the total demand terminating at each destination zone. The model form that will be estimated for this exercise is

$$\mathbf{T}\_{\mathbb{P}^{\mathbb{N}}} = \alpha \mathbf{n} \, \mathbf{P}\_{\mathbb{P}^{\mathbb{N}}} \mathbf{A}\_{\mathbb{P}^{\mathbb{N}}} \exp(\cdot \boldsymbol{\beta}\_{\mathbb{P}} \mathbf{n}) \, , \tag{14}$$

where *Tpij* is the annual freight flow in thousand tons to be estimated; *Ppi* is the total amount of freight (thousand tons) produced in state *i* by sector *p*; *Apj* is the total amount of freight (thousand tons) attracted to state *j* from the originated industry *p*; *cij* is a measure of the distance that separates the states *i* and *j*; *βp* is the specified parameter, the value of which says how important is the distance variable in explaining the trade between zones *i* and *j* for a given producer *p*; and *αij* are adjustment factors that should be calibrated to balance the freight flows by assuring that *∑jTpij = Ppi* and *∑iTpij = Apj* for all *i* and *j*.

The formulation in Equation (14) is known as the classical gravity model. As demonstrated by Wilson (1967), it is derived by applying the Lagrange multipliers on the following optimization problem:

$$\text{Maximize } \xi = -\sum (\mathbf{T}\_{\mathbb{P}} \mathbf{\hat{n}} \text{log}\, T\_{\mathbb{P}\mathbb{P}} - T\_{\mathbb{P}\mathbb{P}}) \,. \tag{15}$$

*∑jTpij = Ppi* , for all *i* , (16)

where

and *j*;

destination state *j*;

centroids *r* and *s*.

with highway used for short-hauls.

property. It is merely a descriptive measure of goodness-of-fit.

where M is the number of cells in the estimated set {*Tpij*}.

variances cannot be captured by the *R2* statistic.

**4.4. Ton-miles for U.S. freight movements** 

Shipment allocation for a given sequence of modes;

freight movements:

system.

A Primer on Recent Advancement on Freight Transportation 341

*r* denotes a county centroid within origin state *i* and *s* denotes a county centroid within

*n* is the number of possible pair-wise combinations of counties that exists between zones *i*

*drs* is the average distance in miles to travel on the CTA's network between the county

The routes between county centroids are determined in terms of the impedance functions defined in Section 4.1. For the highway and the waterway system, distances between counties are determined on the basis on the minimum shortest paths so that one single route is found for each pair of counties. When railway is available the distance is estimated as an average distance calculated from the set of shortest routes obtained from all possible combinations of available railroad carriers. Recall that since the sequence of modes is unknown in the CFS tabulation used for calibration, the resulting distance between counties will be predominantly on water modes, whenever available, followed by railway distances,

After applying the Hyman's methods using the 2007 CFS tabulation and the average distance matrix described above, we obtained the parameters listed in Table 1. A statistic for comparing the set of estimated flows {*Tpij*} and the set of observed flows {*Npij*} is also provided in Table 1. This measure, Equation (20), dubbed standardized root mean square (s.r.m.s), was proposed by Pitfield (1978) as an alternative to deal with sparse matrices and also to consider the scale of the variables involved. However, it does not have any statistical

s.r.m.sp = [ *∑i∑j*(*Tpij* – *Npij*)2 / *M*] ½ / *∑i∑<sup>j</sup> Npij* / *M* (20)

Table 1 also shows the *R2* statistics of the models and the total freight flows in thousands of tons for each industry sector. Note that the conclusion about the model's goodness-of-fit should not be made solely based on the *R2* statistics, as systematic errors and difference in

This section presents two methods for estimating the ton-miles of freight movement over the U.S. network system (within the U.S. boundaries). Based on the freight network and demand models described so far, two procedures can be identified to estimate ton-miles for

Prediction of freight flows and average mileage between counties using CTA's network

$$\sum \boldsymbol{T}\_{\mathbb{P}^{\vec{\mathsf{p}}}} = \boldsymbol{A}\_{\mathbb{P}^{\vec{\mathsf{p}}}} \text{ for all } \boldsymbol{j} \text{ .} \tag{17}$$

$$\sum\_{i} \sum\_{j} T\_{p\overline{i}|\overline{c}\overline{v}} = C\_p \text{, for all } i \text{ and } j \text{,} \tag{18}$$

where the objective function in equation (15) is a monotonic function, often referred to as *entropy function*; Equations (16) and (17) are constraints representing our knowledge about the total productions and attractions per zone; Equation (18) is a constraint corresponding to our knowledge about the total expense, denoted by *Cp*, in using the network system by industry sector *p*. If we measure all *cij* in miles, *Cp* is therefore a measure of total annual tonmiles loaded on the network system for a given industry sector.

It is important to mention that if the parameter *βp* is known in Equation (14) the values *exp ( βpcij)* become reference factors, or the elements of a priory trip matrix, and the model reduces to what is called in the literature as ordinary gravity model. In this case the adjustment factors *αij* can be estimated by a bi-proportional matrix balancing method, also known as "iterative proportional fitting". This balancing method was apparently first described by Kruithof (1937), who used the model for prediction of telephone traffic distribution (see Lamond and Stewart, 1981). In transportation planning this method is referred to "Fratar Method" in the U.S. or "Furness Method" (see Furness, 1965) elsewhere.

Since we do not have complete information about the total expenditure *Cp*, the estimation of *βp* cannot be done by directly solving the problem (15)-(18). If we had *Cp* it can be shown that a unique solution of the problem could be found by solving a system of linear equations of the flows *Tpij* and the Lagrange multipliers, one for each constraint. Approximation methods have been proposed for estimating *βp* when the knowledge about the overall system expenditure is unknown. The method devised by Hyman (1969) was used to obtain estimations of the parameters *βp* for each industry sector listed in Table 2. Hyman's method is an iterative procedure based on successive applications of matrix balancing techniques for a given sequence of estimates for *βp*, that are appropriately readjusted at each iteration (see Ortúzar and Willumsen, 2011).

To estimate the model of Equation (14), it is first necessary to obtain a matrix of distances *cij* for travelling between origin and destination zones. In this exercise we assumed that the impedance to travel between a pair of U.S. states is determined by a function of the average distance in miles of the set of paths on the CTA network system between all corresponding pair of contiguous counties, as expressed by

$$c\_{\overline{\eta}} = \sum\_{r} d\_{rs} / \eta \quad \text{for all } r \text{ } i \text{ and } s \text{ } i \text{ } \\ \text{ }\tag{19}$$

where

340 Application of Geographic Information Systems

optimization problem:

Ortúzar and Willumsen, 2011).

pair of contiguous counties, as expressed by

The formulation in Equation (14) is known as the classical gravity model. As demonstrated by Wilson (1967), it is derived by applying the Lagrange multipliers on the following

where the objective function in equation (15) is a monotonic function, often referred to as *entropy function*; Equations (16) and (17) are constraints representing our knowledge about the total productions and attractions per zone; Equation (18) is a constraint corresponding to our knowledge about the total expense, denoted by *Cp*, in using the network system by industry sector *p*. If we measure all *cij* in miles, *Cp* is therefore a measure of total annual ton-

It is important to mention that if the parameter *βp* is known in Equation (14) the values *exp ( βpcij)* become reference factors, or the elements of a priory trip matrix, and the model reduces to what is called in the literature as ordinary gravity model. In this case the adjustment factors *αij* can be estimated by a bi-proportional matrix balancing method, also known as "iterative proportional fitting". This balancing method was apparently first described by Kruithof (1937), who used the model for prediction of telephone traffic distribution (see Lamond and Stewart, 1981). In transportation planning this method is referred to "Fratar

Since we do not have complete information about the total expenditure *Cp*, the estimation of *βp* cannot be done by directly solving the problem (15)-(18). If we had *Cp* it can be shown that a unique solution of the problem could be found by solving a system of linear equations of the flows *Tpij* and the Lagrange multipliers, one for each constraint. Approximation methods have been proposed for estimating *βp* when the knowledge about the overall system expenditure is unknown. The method devised by Hyman (1969) was used to obtain estimations of the parameters *βp* for each industry sector listed in Table 2. Hyman's method is an iterative procedure based on successive applications of matrix balancing techniques for a given sequence of estimates for *βp*, that are appropriately readjusted at each iteration (see

To estimate the model of Equation (14), it is first necessary to obtain a matrix of distances *cij* for travelling between origin and destination zones. In this exercise we assumed that the impedance to travel between a pair of U.S. states is determined by a function of the average distance in miles of the set of paths on the CTA network system between all corresponding

*cij = ∑rsdrs / n* for all *r є i* and *s є j* , (19)

miles loaded on the network system for a given industry sector.

Method" in the U.S. or "Furness Method" (see Furness, 1965) elsewhere.

*Maximize ξ = - ∑ij*(Tpijlog*Tpij - Tpij*) , (15)

*∑jTpij = Ppi* , for all *i* , (16)

*∑iTpij = Apj* , for all *j* , (17)

*∑<sup>i</sup> ∑jTpijcij = Cp* , for all *i* and *j* , (18)

*r* denotes a county centroid within origin state *i* and *s* denotes a county centroid within destination state *j*;

*n* is the number of possible pair-wise combinations of counties that exists between zones *i* and *j*;

*drs* is the average distance in miles to travel on the CTA's network between the county centroids *r* and *s*.

The routes between county centroids are determined in terms of the impedance functions defined in Section 4.1. For the highway and the waterway system, distances between counties are determined on the basis on the minimum shortest paths so that one single route is found for each pair of counties. When railway is available the distance is estimated as an average distance calculated from the set of shortest routes obtained from all possible combinations of available railroad carriers. Recall that since the sequence of modes is unknown in the CFS tabulation used for calibration, the resulting distance between counties will be predominantly on water modes, whenever available, followed by railway distances, with highway used for short-hauls.

After applying the Hyman's methods using the 2007 CFS tabulation and the average distance matrix described above, we obtained the parameters listed in Table 1. A statistic for comparing the set of estimated flows {*Tpij*} and the set of observed flows {*Npij*} is also provided in Table 1. This measure, Equation (20), dubbed standardized root mean square (s.r.m.s), was proposed by Pitfield (1978) as an alternative to deal with sparse matrices and also to consider the scale of the variables involved. However, it does not have any statistical property. It is merely a descriptive measure of goodness-of-fit.

$$\mathbf{s.r.m.s.}\_{\mathcal{P}} = \left\{ \sum \sum (T\_{\overline{\mu}\overline{\}} - \mathbf{N}\_{\overline{\mu}\overline{\}})^2 / M \right\}^{\nkern-1.1mu\overline{\mu}} / \sum \sum \mathbf{N}\_{\overline{\mu}\overline{\}} / M \tag{20}$$

where M is the number of cells in the estimated set {*Tpij*}.

Table 1 also shows the *R2* statistics of the models and the total freight flows in thousands of tons for each industry sector. Note that the conclusion about the model's goodness-of-fit should not be made solely based on the *R2* statistics, as systematic errors and difference in variances cannot be captured by the *R2* statistic.

### **4.4. Ton-miles for U.S. freight movements**

This section presents two methods for estimating the ton-miles of freight movement over the U.S. network system (within the U.S. boundaries). Based on the freight network and demand models described so far, two procedures can be identified to estimate ton-miles for freight movements:



A Primer on Recent Advancement on Freight Transportation 343

*Cm = ∑a Tma dma ,* (21)

subsequently, the overall system ton-miles, with Equation (21), by transportation mode. As an illustration, Figure 7 shows an example of freight flows for shipments on the main railroad lines (i.e. Class I railroads), on the highway system, and on the inland waterway

When a detailed description of shipments is not available, which is the case for the public CFS, the second procedure may be used. In general, the method is designed to estimate freight flows between U.S. counties as well as the mileages over the most likely routes connecting county centroids. With respect to the estimation of freight flows, freight models based on aggregated data (freight movements, and economic activity) are projected and applied to estimate freight movements in more disaggregated geographic level, based on economic disaggregated data. This process is called disaggregation of freight data. In our case, such disaggregation procedure relied on the estimation of separated nationwide generation and distribution models by industry sector as discussed in Section 4.3. Note that

*Cm* denotes the overall ton-miles by mode of transportation *m*, as listed in Table 1;

*Tma* denotes the total freight tons through a link *a*;

**Figure 7.** 2007 annual freight flow map by mode in U.S.

*dma* is the distance in miles over a link *a*;

sub-network.

where

342 Application of Geographic Information Systems

**Table 1.** Gravity model parameters and goodness-of-fit statistics

The first alternative was the one proposed by Southworth and Peterson (2000) to allocate CFS shipments on the network system when the detailed sequence of modes is given. This method can be used to allocate freight movements onto the U.S. network system resultant from both U.S. domestic and foreign trade as long as the sequence of modes and the main geographic references (i.e. origin, destination and U.S. ports of entry/exit) are provided. Note that in CFS a shipment is characterized by its volume (weight and monetary value), the Zip Codes for origin and destination, and the sequence of modes used, as well as port of exit and foreign cities for exports. As discussed in Section 4.1 routing procedures were developed to find the most likely route and transfer points for a shipment with given sequence of modes. Such a computational tool can be used to allocate shipper-based databases onto the CTA network system to provide estimates of link ton-miles and, subsequently, the overall system ton-miles, with Equation (21), by transportation mode. As an illustration, Figure 7 shows an example of freight flows for shipments on the main railroad lines (i.e. Class I railroads), on the highway system, and on the inland waterway sub-network.

$$C\_m = \sum\_{a} T\_{ma} \, d\_{ma} \, \tag{21}$$

where

342 Application of Geographic Information Systems

**Table 1.** Gravity model parameters and goodness-of-fit statistics

The first alternative was the one proposed by Southworth and Peterson (2000) to allocate CFS shipments on the network system when the detailed sequence of modes is given. This method can be used to allocate freight movements onto the U.S. network system resultant from both U.S. domestic and foreign trade as long as the sequence of modes and the main geographic references (i.e. origin, destination and U.S. ports of entry/exit) are provided. Note that in CFS a shipment is characterized by its volume (weight and monetary value), the Zip Codes for origin and destination, and the sequence of modes used, as well as port of exit and foreign cities for exports. As discussed in Section 4.1 routing procedures were developed to find the most likely route and transfer points for a shipment with given sequence of modes. Such a computational tool can be used to allocate shipper-based databases onto the CTA network system to provide estimates of link ton-miles and,

**2002 NAICS Industry Total shipment (thousand tons)** *βp R2* **s.r.m.sp** 

212 3,638,114 0.0081 0.82 3.08 311 568,907 0.0027 0.94 2.12 312 143,523 0.0043 0.97 1.75 313 8,937 0.0020 0.81 4.64 314 6,939 0.0016 0.99 2.52 315 1,421 0.0007 0.78 5.57 316 584 0.0002 0.13 11.02 321 218,808 0.0039 0.92 1.74 322 166,443 0.0024 0.87 2.42 323 33,588 0.0022 0.85 2.67 324 1,415,085 0.0072 0.98 2.02 325 594,228 0.0028 0.97 2.42 326 66,709 0.0018 0.90 2.86 327 1,060,895 0.0103 0.97 2.19 331 201,312 0.0024 0.82 3.39 332 118,304 0.0032 0.95 1.87 333 40,484 0.0011 0.62 3.90 334 5,307 0.0009 0.78 4.94 335 18,736 0.0008 0.52 4.18 336 93,962 0.0023 0.92 3.10 337 18,635 0.0017 0.87 3.41 339 10,857 0.0008 0.21 5.38 423 1,361,118 0.0057 0.95 2.02 424 2,244,332 0.0061 0.94 2.15 454 55,654 0.0074 0.77 5.62 493 187,175 0.0049 0.98 1.43 511 11,825 0.0055 0.86 3.38 551 250,216 0.0038 0.89 3.07

*Cm* denotes the overall ton-miles by mode of transportation *m*, as listed in Table 1;

*Tma* denotes the total freight tons through a link *a*;

*dma* is the distance in miles over a link *a*;

**Figure 7.** 2007 annual freight flow map by mode in U.S.

When a detailed description of shipments is not available, which is the case for the public CFS, the second procedure may be used. In general, the method is designed to estimate freight flows between U.S. counties as well as the mileages over the most likely routes connecting county centroids. With respect to the estimation of freight flows, freight models based on aggregated data (freight movements, and economic activity) are projected and applied to estimate freight movements in more disaggregated geographic level, based on economic disaggregated data. This process is called disaggregation of freight data. In our case, such disaggregation procedure relied on the estimation of separated nationwide generation and distribution models by industry sector as discussed in Section 4.3. Note that

those models in Section 4.3 only represent the U.S. domestic trade and may not be used for estimating freight movements on the U.S. network resultant form foreign trade between U.S. and external zones (trading countries). Nevertheless, as we will see, such modes have been used to estimate ton-miles on the U.S. system for the 2007 FAF database resultant from both domestic and foreign trade.

A Primer on Recent Advancement on Freight Transportation 345

*wpur* is the estimated fraction of the total freight produced by sector *p* in FAF zone *u* that is

*wpur* is the estimated fraction of total freight produced by sector *p* and terminated in FAF

*Ppmr* and Apms are the estimates for freight production and attractions in the counties *r* and *s*,

The set of modal distances to travel over the U.S. network system is the second piece of information necessary for estimating freight ton-miles. To this end, the CTA network system is used for estimating the mileages *dmrs* for likely freight movements between county centroids. In this case, distances over the U.S. network are estimated for each mode class

 For *truck*, *rail* and *water* modes, the distance between a pair of county centroids is determined based on the most likely route over the corresponding single-mode systems

 Distances for *air* are estimated on the basis of the Great Circle Distances (GCD), which is the distance along the earth circumference between any two geographic points;

For the *multiple mode & mail* category, distances are estimated by finding the most likely

 For the *Other & Unknown modes* category, distances are determined by finding the most likely routes over the multimodal and intermodal network system, similarly to that

The *no domestic mode* does not use the domestic U.S. network and therefore is ignored in

With the disaggregated estimation of flows between counties and the county distances by

Table 2 shows comparisons of the ton-miles estimates based on FAF3 and other data sources over the major U.S. freight transportation systems: highway, railway, and waterway. In addition to the 2007 CFS, other sources of information include: a) *2007 U.S. freight railroad statistics8* from Association of American Railroads (AAR), and b) *2007 waterborne commerce<sup>9</sup>*

8 For more information see AAR's report at: http://www.aar.org/~/media/aar/Industry%20Info/AAR%20Stats%202010%

*Cm = ∑p∑r∑s Tpmrs dmrs.* (27)

The distances over the highway are used as surrogates for *pipeline* distances;

mode, ton-miles for freight movements by mode can be calculated as follows

generated by county *r*;

zone *v* that is attracted to county *s*;

respectively, shipped and delivered by mode *m*;

*μp, γp* and *θp* are the estimated model parameters by sector *p*;

(see FAF3, 2007, for a description of mode categories), as follows:

(highway, railway and waterway networks);

routes for a given *truck-rail-truck* mode sequence;

9 More information about the waterborne commerce can be found at: http://www.ndc.iwr.usace.army.mil/wcsc/pdf/wcusnatl08.pdf

described in Subsection 4.3.2;

the ton-miles calculation.

200524.ashx

*prs* are the county adjustment factors calibrated for each industry sector *p*; *dmrs* is the distance in miles to travel over the modal network denoted by *m*.

As seen in Subsection 4.3.1, a single variable (i.e. state payroll by industry sector) was used to explain the effect of economic activity on internal freight demand. In that process a set of 28 production and attraction equations were estimated. With these models, freight productions and attractions were estimated at the county level (using the county payroll by industry sector). Such estimates were then used as weights to disaggregate each of the total FAF origin-destination flows to the corresponding county productions and attractions. These final estimates were then used as marginal totals for the distribution models. Regarding to distribution models, gravity models with exponential deterrence function have been estimated for each industry sector. The variable (argument of the deterrence function) used to explain the resistance for travelling between zones was an estimate of the average distance to travel between states on the CTA's network system (see Subsection 4.3.2). Using the estimated model parameters, freight movements between counties by the 7 classes of modes listed in Table 1 can then be obtained.

The following steps describe the FAF disaggregation procedure: a) FAF database is organized by industry sector and mode, and its flows classified by SCTG are grouped to the corresponding 28 NAICS groups; b) for each FAF origin-destination pair (classified by NAICS and mode), productions and attractions by industry sector were estimated (using the freight generation models) for the counties within the corresponding FAF origin and destination zones, respectively; c) using the estimated values from b) as weights, the FAF flow is then disaggregated to generate the productions and attractions by the corresponding counties; d) a matrix balancing procedure (i.e. gravity model by industry sector) is applied to distribute the estimated total productions and attractions from d) in order to generate the freight flows between the corresponding counties. The set of Equations (22)-(26) summarizes the disaggregation process.

$$
\omega\_{\mu\nu} = \chi\_{\mu}{}^{\mu\_{\mathcal{P}}} / \sum\_{I} \left( \chi\_{\mathcal{P}}{}^{\mu\_{\mathcal{P}}} \right) \text{, for all } r \text{ within } u \text{ }. \tag{22}
$$

$$\mathfrak{w}\_{\text{pvs}} = \mathfrak{x}\_{\text{ps}}{}^{\mathfrak{y}p} / \sum\_{s} \left( \mathfrak{x}\_{\text{ps}}{}^{\mathfrak{y}p} \right) \text{ for all } s \text{ within } \mathfrak{w}\_{\text{t}} \tag{23}$$

$$P\_{pwr} = T\_{puniv} \text{ מעטר } \rho \tag{24}$$

$$A\_{\text{pms}} = T\_{\text{pmuav}} \text{ מעט}\_{\text{pus}} \tag{25}$$

$$T\_{\text{purs}} = \alpha\_{\text{prs}} P\_{\text{purr}} A\_{\text{puns}} \exp(-\Theta\_{\text{p}} d\_{\text{mrs}}) \, \, \, \, \, \, \tag{26}$$

where

*Tpmuv* is the FAF freight flow between FAF domestic zones *u* and *v* by mode *m* to be disaggregated;

*wpur* is the estimated fraction of the total freight produced by sector *p* in FAF zone *u* that is generated by county *r*;

*wpur* is the estimated fraction of total freight produced by sector *p* and terminated in FAF zone *v* that is attracted to county *s*;

*Ppmr* and Apms are the estimates for freight production and attractions in the counties *r* and *s*, respectively, shipped and delivered by mode *m*;

*μp, γp* and *θp* are the estimated model parameters by sector *p*;

344 Application of Geographic Information Systems

domestic and foreign trade.

modes listed in Table 1 can then be obtained.

*wpur = xprp / ∑<sup>r</sup>* (*xpr*

*wpvs = xpsp / ∑<sup>s</sup>* (*xps*

the disaggregation process.

where

disaggregated;

those models in Section 4.3 only represent the U.S. domestic trade and may not be used for estimating freight movements on the U.S. network resultant form foreign trade between U.S. and external zones (trading countries). Nevertheless, as we will see, such modes have been used to estimate ton-miles on the U.S. system for the 2007 FAF database resultant from both

As seen in Subsection 4.3.1, a single variable (i.e. state payroll by industry sector) was used to explain the effect of economic activity on internal freight demand. In that process a set of 28 production and attraction equations were estimated. With these models, freight productions and attractions were estimated at the county level (using the county payroll by industry sector). Such estimates were then used as weights to disaggregate each of the total FAF origin-destination flows to the corresponding county productions and attractions. These final estimates were then used as marginal totals for the distribution models. Regarding to distribution models, gravity models with exponential deterrence function have been estimated for each industry sector. The variable (argument of the deterrence function) used to explain the resistance for travelling between zones was an estimate of the average distance to travel between states on the CTA's network system (see Subsection 4.3.2). Using the estimated model parameters, freight movements between counties by the 7 classes of

The following steps describe the FAF disaggregation procedure: a) FAF database is organized by industry sector and mode, and its flows classified by SCTG are grouped to the corresponding 28 NAICS groups; b) for each FAF origin-destination pair (classified by NAICS and mode), productions and attractions by industry sector were estimated (using the freight generation models) for the counties within the corresponding FAF origin and destination zones, respectively; c) using the estimated values from b) as weights, the FAF flow is then disaggregated to generate the productions and attractions by the corresponding counties; d) a matrix balancing procedure (i.e. gravity model by industry sector) is applied to distribute the estimated total productions and attractions from d) in order to generate the freight flows between the corresponding counties. The set of Equations (22)-(26) summarizes

*Tpmuv* is the FAF freight flow between FAF domestic zones *u* and *v* by mode *m* to be

*<sup>p</sup>*) , for all *r* within *u* , (22)

*<sup>p</sup>*), for all *s* within *v* , (23)

*Ppmr =Tpmuv wpur ,* (24)

*Apms =Tpmuv wpus* , (25)

*Tpmrs = αprs Ppmr Apms exp(-θp dmrs)* , (26)

*prs* are the county adjustment factors calibrated for each industry sector *p*;

*dmrs* is the distance in miles to travel over the modal network denoted by *m*.

The set of modal distances to travel over the U.S. network system is the second piece of information necessary for estimating freight ton-miles. To this end, the CTA network system is used for estimating the mileages *dmrs* for likely freight movements between county centroids. In this case, distances over the U.S. network are estimated for each mode class (see FAF3, 2007, for a description of mode categories), as follows:


With the disaggregated estimation of flows between counties and the county distances by mode, ton-miles for freight movements by mode can be calculated as follows

$$
\mathbb{C}\_m = \sum\_{} \sum\_{} \sum\_{} \sum\_{s} \, T\_{pmrs} \, d\_{mrs} \, \tag{27}
$$

Table 2 shows comparisons of the ton-miles estimates based on FAF3 and other data sources over the major U.S. freight transportation systems: highway, railway, and waterway. In addition to the 2007 CFS, other sources of information include: a) *2007 U.S. freight railroad statistics8* from Association of American Railroads (AAR), and b) *2007 waterborne commerce<sup>9</sup>*

<sup>8</sup> For more information see AAR's report at: http://www.aar.org/~/media/aar/Industry%20Info/AAR%20Stats%202010% 200524.ashx

<sup>9</sup> More information about the waterborne commerce can be found at:

http://www.ndc.iwr.usace.army.mil/wcsc/pdf/wcusnatl08.pdf

of the U.S. Army Corp of Engineers (USACE) Navigation Data Center. Note that, the *2007 U.S. freight railroad statistics* reports all freight activities for all North American railroad companies using the U.S. system. The *2007 waterborne commerce* data is based a complete census of all cargo via the U.S. waterway system (except military cargo carried on military ships). Besides CFS, there is no other current source for comparison of highway movements.

A Primer on Recent Advancement on Freight Transportation 347

Estimation of ton-miles is an import application of such tools to gauge the freight system usage and provide insightful information for national political decisions. The methodology for estimating ton-miles over the transportation system was based of prediction of freight flows and distances between U.S. counties. In the demand side, the disaggregation of freight flows was a necessary step to reconcile the freight database with the detailed representation of network system and derive more accurate estimates of ton-miles. As for the transportation system, the geographic representation of multimodal and intermodal interfaces over the transportation network system allows the analyst to predict likely routes that are more representative of real world activities involved in the transportation of goods. In sum, for a given transportation mode ton-miles are estimated by an element-by-element multiplication between a matrix of freight volumes and a matrix of average distances for the

Models of the freight demand and the supply transportation system can also be applied in several other transportation related problems. Besides ton-miles estimation, the analytical and geographic tools described in this paper can help perform impact analysis at macro level such as *energy use and environmental impacts of the transportation activity, as well as effects of external events.* In future work CTA will apply the proposed models to investigate alternatives (e.g. modal shifting, vehicle technologies, etc) for alleviating some of negative effects due to the high use of fossil fuels. In addition, the network models will be used for identifying the system vulnerability and resilience to damage and disruptions caused by natural and manmade events (e.g. hurricane, flooding, terrorist attacks, chemical spills,

The authors are highly indebted to Michael Sprung and Rolf Schmitt of the Federal Highway Administration (FHWA) Office of Freight Operations and Management for their sponsorship of our freight system data and analysis work and their deep commitment to improving the quality and availability of freight data and analytical tools for the transportation community. We also express thanks to our colleagues who comprise the Bureau of Transportation Statistics (BTS) Commodity Flow Survey (CFS) team as well as the staff at the Bureau of the Census who worked on the CFS. Last, we appreciate the excellent maps that were prepared for this Chapter by Rob Taylor and Ryan Parten

set of likely routes.

**Author details** 

**Acknowledgement** 

Lee D. Han

Diane Davidson and Bruce Peterson *Oak Ridge National Laboratory, USA* 

*The University of Tennessee, Knoxville, TN, USA* 

(University of Tennessee) and Frank Xu (ORNL).

Shih-Miao Chin, Francisco M. Oliveira-Neto, Ho-Ling Hwang,

etc).

As expected all FAF3 estimates are larger than the corresponding CFS estimates due to the inclusion of freight activities for out-scope industries and import movements in the FAF3 database. Both FAF3 estimates of ton-miles for rail and water are consistent with estimates reported by modal data programs, AAR and USACE, respectively; with 2% difference in waterway movements and about 5% for railway movements. The discrepancy observed for railway movements is likely due to the following reasons. The FAF3 database does not account for transhipments from Canada (e.g., Canada-Mexico), which is estimated to be about 15 billion ton-miles annually. Furthermore, AAR report includes some movements of empty containers and the weight of containers for mixed freight (estimated to be about 60 billion ton-miles), which are not considered in the FAF3 database. With these two considerations, the gap between FAF3 and the *AAR* can be reduced to less than 1%.


**Table 2.** 2007 ton-miles estimates by truck, rail and water modes

## **5. Concluding remarks**

This chapter presented a framework for national freight data suitable for national decisions with respect to movements of goods in the U.S. The CTA team in the ORNL developed and maintain a comprehensive database with all necessary dimensions (geographic, commodity nature, mode of transportations) to understand how goods move on the U.S. transportation system. To construct such database, geographic representation of the network system and analytical tools were extensively used. The chapter provided a number of analytical examples for modelling the demand for freight, and specifically presents the demand models developed by the CTA in the recent years. Although it is mostly descriptive, the chapter presents a simple application of how to estimate ton-miles over the freight transportation system.

Estimation of ton-miles is an import application of such tools to gauge the freight system usage and provide insightful information for national political decisions. The methodology for estimating ton-miles over the transportation system was based of prediction of freight flows and distances between U.S. counties. In the demand side, the disaggregation of freight flows was a necessary step to reconcile the freight database with the detailed representation of network system and derive more accurate estimates of ton-miles. As for the transportation system, the geographic representation of multimodal and intermodal interfaces over the transportation network system allows the analyst to predict likely routes that are more representative of real world activities involved in the transportation of goods. In sum, for a given transportation mode ton-miles are estimated by an element-by-element multiplication between a matrix of freight volumes and a matrix of average distances for the set of likely routes.

Models of the freight demand and the supply transportation system can also be applied in several other transportation related problems. Besides ton-miles estimation, the analytical and geographic tools described in this paper can help perform impact analysis at macro level such as *energy use and environmental impacts of the transportation activity, as well as effects of external events.* In future work CTA will apply the proposed models to investigate alternatives (e.g. modal shifting, vehicle technologies, etc) for alleviating some of negative effects due to the high use of fossil fuels. In addition, the network models will be used for identifying the system vulnerability and resilience to damage and disruptions caused by natural and manmade events (e.g. hurricane, flooding, terrorist attacks, chemical spills, etc).

## **Author details**

346 Application of Geographic Information Systems

**U.S. Network** 

**5. Concluding remarks** 

transportation system.

of the U.S. Army Corp of Engineers (USACE) Navigation Data Center. Note that, the *2007 U.S. freight railroad statistics* reports all freight activities for all North American railroad companies using the U.S. system. The *2007 waterborne commerce* data is based a complete census of all cargo via the U.S. waterway system (except military cargo carried on military ships). Besides CFS, there is no other current source for comparison of highway movements. As expected all FAF3 estimates are larger than the corresponding CFS estimates due to the inclusion of freight activities for out-scope industries and import movements in the FAF3 database. Both FAF3 estimates of ton-miles for rail and water are consistent with estimates reported by modal data programs, AAR and USACE, respectively; with 2% difference in waterway movements and about 5% for railway movements. The discrepancy observed for railway movements is likely due to the following reasons. The FAF3 database does not account for transhipments from Canada (e.g., Canada-Mexico), which is estimated to be about 15 billion ton-miles annually. Furthermore, AAR report includes some movements of empty containers and the weight of containers for mixed freight (estimated to be about 60 billion ton-miles), which are not considered in the FAF3 database. With these two

considerations, the gap between FAF3 and the *AAR* can be reduced to less than 1%.

**Sub-system Data source / Modes Ton-miles** 

Highway *FAF3 (Truck single mode only)* 2,420

Railway *FAF3 (Rail single mode plus rail portion of multiple modes)* 1,732 *2007 CFS (Rail single mode and portion of mutiple modes* 

Waterway *FAF3 (water and the water portion of multiple modes)* 495 *2007 CFS (water and the portion of multiple modes which* 

This chapter presented a framework for national freight data suitable for national decisions with respect to movements of goods in the U.S. The CTA team in the ORNL developed and maintain a comprehensive database with all necessary dimensions (geographic, commodity nature, mode of transportations) to understand how goods move on the U.S. transportation system. To construct such database, geographic representation of the network system and analytical tools were extensively used. The chapter provided a number of analytical examples for modelling the demand for freight, and specifically presents the demand models developed by the CTA in the recent years. Although it is mostly descriptive, the chapter presents a simple application of how to estimate ton-miles over the freight

**Table 2.** 2007 ton-miles estimates by truck, rail and water modes

*2007 CFS (Truck single mode only)* 1,342

*which includes rail)* 1,530 *2007 AAR report (all rail activities)* 1,820

*indlues water)* <sup>348</sup> *2007 USACE waterborne commerce (all water activities)* 506

**(billions)**

Shih-Miao Chin, Francisco M. Oliveira-Neto, Ho-Ling Hwang, Diane Davidson and Bruce Peterson *Oak Ridge National Laboratory, USA* 

Lee D. Han *The University of Tennessee, Knoxville, TN, USA* 

## **Acknowledgement**

The authors are highly indebted to Michael Sprung and Rolf Schmitt of the Federal Highway Administration (FHWA) Office of Freight Operations and Management for their sponsorship of our freight system data and analysis work and their deep commitment to improving the quality and availability of freight data and analytical tools for the transportation community. We also express thanks to our colleagues who comprise the Bureau of Transportation Statistics (BTS) Commodity Flow Survey (CFS) team as well as the staff at the Bureau of the Census who worked on the CFS. Last, we appreciate the excellent maps that were prepared for this Chapter by Rob Taylor and Ryan Parten (University of Tennessee) and Frank Xu (ORNL).

The contents of this document reflect the views of the authors, who are responsible for the facts and accuracy of the information presented herein. The contents do not necessarily reflect the official views or policies of the Department of Transportation State or the Federal Highway Administration. This report does not constitute a standard, specification or regulation.

**Chapter 18** 

© 2012 Ponce et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Ponce et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Developing Web Geographic Information** 

J. Ponce, A.H. Torres, M.J. Escalona, M. Mejías and F.J. Domínguez-Mayo

The increasing popularity of the Internet has led to the development of Web applications, known as Web GIS. A geographic information system (called GIS from now) is a software system that manages geo-referenced information. GIS systems are an automated system used for storing, analysing and manipulating geographical information. Geographical information represents objects and actions where geographical location is indispensable information (Aronof, 1989; Bull, 1994). In this context, a Web GIS system offers different GIS services for analysis and visualization of geographical information on the Web (Kim,

Thus, there is a growing need for tools and methodologies that support the rapid development of this kind of Web applications and their modification to meet the ever changing business needs. Recently, some frameworks have been developed Autodesk MapGuide, ESRI ArcIMS, gvSIG, Quantum GIS, PostgreSQL + PostGIS that enable users to

In order to develop such systems, it is necessary to follow a proper development process. This process should cover the inherent characteristics of geographic information processing and also is available on Web platform. With this motivation, Escalona (Escalona et al, 2008) made a first approximation of such processes. They introduced a process for developing Web GIS (Geographic Information Systems) applications. This process integrated the NDT (Navigational Development Techniques) (Escalona & Aragon, 2008) approach with some of the Organizational Semiotic models (Liu, 2000). The use of the proposed development process is illustrated for a real application: the construction of the WebMaps system. WebMaps is a Web GIS system whose main goal is to support harvest planning in Brazil

develop and deploy Web GIS applications selecting functionalities a user needs.

(Macário et al., 2007). The process can be seen on the next figure:

**System with the NDT Methodology** 

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/47971

**1. Introduction** 

2002).

## **6. References**

