**4. Combinatorial optimization approach**

The combinatorial optimization approach was materialized by Abraham et al. [20] and is a versatile approach capable of matching targets at multiple agent levels for both household- and individual-level attributes. A combinatorial optimization approach is generally simpler and more direct than IPF. Mostly, it starts by the creation of a trial population from the disaggregate sample data, and then the overall level of fit is assessed across all marginal targets. Units from the trial population are swapped with units chosen from the disaggregate samples, and when the measure of fit improves, the swap is made. This is implemented through a proprietary computer program that first identifies a list of units whose aggregate attribute values match a pre-specified set of corresponding target values and then iteratively performs one of three operations, namely, adding a unit from the sample to the list, subtracting a unit, or swapping a unit between the sample and the previously identified list. That process is performed on a zone-by-zone level with equal probability of the three actions (i.e., add, subtract, or swap) being considered. The developed algorithm was applied to California and Oregon to synthesize populations for their models. The California application served the California Statewide Travel Demand Model including short- and long-distance travel considering personal and commercial vehicles. The Oregon application served the Oregon Statewide Integrated Model, which included employment synthesis for 34 industries. Both model applications resulted in a near-perfect fit for synthesized populations. Generally, the population synthesis procedure using combinatorial optimization has proven to be fast and flexible with the possibility for application to both households and employment scenarios. However, this algorithm can be further improved by using multicore and parallel computing techniques.

### **5. Markov process-based approaches**

As demonstrated, hereinabove, IPF, IPU, and combinatorial optimization approaches rely on cloning attributes that were captured in microdata. In addition, they all share key drawbacks including (a) fitting of a contingency table while ignoring other solutions matching the available data; (b) loss of heterogeneity that has been captured in the microdata due to cloning rather than true population synthesis; (c) dependency on the accuracy of captured data to determine the cloning weights which may replicate inherent inaccuracies; and (d) limited scalability, in terms of the number of attributes of synthesized agents. Hence, Markov process-based approaches were developed to overcome such drawbacks and to offer an approach that truly synthesizes populations instead of cloning them.

The earliest notable effort in this direction was pioneered in 2013 by Farooq et al. [21] who developed a Markov chain Monte Carlo (MCMC) simulation-based approach for synthesizing populations. The proposed approach is a computerbased simulation technique that can be used to simulate a dependent sequence of random draws from complicated stochastic models. To synthesize populations that approach uses three sources of data, namely, (a) zoning systems such as census blocks, census tracts, counties, and states; (b) sample of individuals such as the North American PUMS and the European Sample of Anonymized Records (SARs); and (c) cross-classification tables for socioeconomics and demographics like income by age at a certain zoning level. Assuming that in a given spatial region at any point in time there exists a true population, the MCMC simulation-based approach synthesizes that population by drawing the individual attributes from their uniquely joint distribution using the available partial views while ensuring

**7**

*A Critical Review on Population Synthesis for Activity- and Agent-Based Transportation Models*

that the empirical distribution in the synthetic population is as close as possible to the unique actual distribution of that population. The proposed approach was applied to the Swiss census data, and results were compared against those developed by a conventional IPF approach. Eq. (3) illustrates the standardized root mean square error (SRMSE)-based goodness-of-fit tests that were performed on each case, and results indicated that MCMC simulation-based synthesis outper-

*<sup>m</sup>* …∑*j*=1

∑*i*=1 *<sup>m</sup>* …∑*j*=1

Two years later, in 2015, Casati et al. [22] proposed an extension of the MCMC simulation-based approach to simultaneously combine both individual- and household-level attributes in a process that was named hierarchical MCMC. Furthermore, generalized raking was introduced as a technique to fit the simulated synthetic population to actual observed control totals. The hierarchical MCMC is a combination of two methods: (a) an extension of the original MCMC method that allows producing hierarchies of persons grouped into households and (b) a post-processing method to satisfy known control totals on both the individual- and householdlevel. That extension aimed to synthesize populations with a hierarchical structure that is based upon ordering the agents living in the same household according to their household roles. The general formulation of the extension is based upon the definition of three groups of agent types (viz., owners, intermediate, and others) running Gibbs sampling on the three groups and merging subpopulations. The proposed approach was applied to the 2008 household interview travel survey of Singapore. The application resulted in realistic synthetic populations, and SRMSEbased test confirmed the goodness-of-fit of synthesized populations and their

Saadi et al. [23] proposed an integrated MCMC approach and profiling-based methods to capture the behavioral complexity and heterogeneity of synthesized agents. This approach used two types of datasets, namely, (a) aggregated sociodemographic and transportation-related variables derived from household travel surveys and (b) individual activity-travel diaries collected from travel diary surveys. The integrated approach consists of six steps that run on those two data types. The first step involves performing a MCMC simulation on the sociodemographic dataset. The second step concerns synthesizing population by a Gibbs sampling procedure. The third step selects sociodemographics to compare behaviors in the activity-travel patterns. The fourth step uses results from the previous two steps to cluster synthesized populations according to sociodemographics and related activity sequences. The fifth step utilizes multiple sequence alignments to estimate hidden Markov model (HMM). The final step characterizes clusters including mixed socioeconomic effects. The integrated approach was applied to the 2010 Belgian household daily travel survey. Results indicated that the integrated approach effectively captured the behavioral heterogeneity of travelers. In addition, comparisons against IPF and IPU approaches demonstrated that the proposed integrated approach is adequately adapted to meeting the demand for large-scale

Realizing the advantages of Markov process-based approaches, Saadi et al. [24] developed an extended HMM-based approach which promised better alternatives

microsimulation scenarios of urban transportation systems.

*<sup>n</sup>* (*Ri*…*<sup>j</sup>* − *Ti*…*j*) 2 ⁄*N*) 1 ⁄2

*<sup>n</sup> Ti*…*<sup>j</sup>* <sup>⁄</sup>*<sup>N</sup>* (3)

, is the number of agents with attribute

, is the number of agents with attribute

\_\_\_\_

formed IPF synthesis while featuring a higher level of heterogeneity:

*DOI: http://dx.doi.org/10.5772/intechopen.86307*

*SRMSE* = (<sup>∑</sup>*i*=1

where *N*, is the total number of agents; *Ri*…*<sup>j</sup>*

values *i*…*j* in the population synthesized; *Ti*…*<sup>j</sup>*

values *i*…*j* in the actual population.

generated hierarchical structures.

*A Critical Review on Population Synthesis for Activity- and Agent-Based Transportation Models DOI: http://dx.doi.org/10.5772/intechopen.86307*

that the empirical distribution in the synthetic population is as close as possible to the unique actual distribution of that population. The proposed approach was applied to the Swiss census data, and results were compared against those developed by a conventional IPF approach. Eq. (3) illustrates the standardized root mean square error (SRMSE)-based goodness-of-fit tests that were performed on each case, and results indicated that MCMC simulation-based synthesis outperformed IPF synthesis while featuring a higher level of heterogeneity:

$$\text{SNRMSE} = \frac{\left\{\sum\_{i=1}^{n} \sum\_{j=1}^{n} \{R\_{i,j} - T\_{i,j}\} / N\_N \right\}^{\wedge\_{\Lambda}}}{\sum\_{i=1}^{n} \sum\_{j=1}^{n} T\_{i,j} / N} \tag{3}$$

where *N*, is the total number of agents; *Ri*…*<sup>j</sup>* , is the number of agents with attribute values *i*…*j* in the population synthesized; *Ti*…*<sup>j</sup>* , is the number of agents with attribute values *i*…*j* in the actual population.

Two years later, in 2015, Casati et al. [22] proposed an extension of the MCMC simulation-based approach to simultaneously combine both individual- and household-level attributes in a process that was named hierarchical MCMC. Furthermore, generalized raking was introduced as a technique to fit the simulated synthetic population to actual observed control totals. The hierarchical MCMC is a combination of two methods: (a) an extension of the original MCMC method that allows producing hierarchies of persons grouped into households and (b) a post-processing method to satisfy known control totals on both the individual- and householdlevel. That extension aimed to synthesize populations with a hierarchical structure that is based upon ordering the agents living in the same household according to their household roles. The general formulation of the extension is based upon the definition of three groups of agent types (viz., owners, intermediate, and others) running Gibbs sampling on the three groups and merging subpopulations. The proposed approach was applied to the 2008 household interview travel survey of Singapore. The application resulted in realistic synthetic populations, and SRMSEbased test confirmed the goodness-of-fit of synthesized populations and their generated hierarchical structures.

Saadi et al. [23] proposed an integrated MCMC approach and profiling-based methods to capture the behavioral complexity and heterogeneity of synthesized agents. This approach used two types of datasets, namely, (a) aggregated sociodemographic and transportation-related variables derived from household travel surveys and (b) individual activity-travel diaries collected from travel diary surveys. The integrated approach consists of six steps that run on those two data types. The first step involves performing a MCMC simulation on the sociodemographic dataset. The second step concerns synthesizing population by a Gibbs sampling procedure. The third step selects sociodemographics to compare behaviors in the activity-travel patterns. The fourth step uses results from the previous two steps to cluster synthesized populations according to sociodemographics and related activity sequences. The fifth step utilizes multiple sequence alignments to estimate hidden Markov model (HMM). The final step characterizes clusters including mixed socioeconomic effects. The integrated approach was applied to the 2010 Belgian household daily travel survey. Results indicated that the integrated approach effectively captured the behavioral heterogeneity of travelers. In addition, comparisons against IPF and IPU approaches demonstrated that the proposed integrated approach is adequately adapted to meeting the demand for large-scale microsimulation scenarios of urban transportation systems.

Realizing the advantages of Markov process-based approaches, Saadi et al. [24] developed an extended HMM-based approach which promised better alternatives

*Transportation Systems Analysis and Assessment*

**4. Combinatorial optimization approach**

multicore and parallel computing techniques.

**5. Markov process-based approaches**

The combinatorial optimization approach was materialized by Abraham et al. [20] and is a versatile approach capable of matching targets at multiple agent levels for both household- and individual-level attributes. A combinatorial optimization approach is generally simpler and more direct than IPF. Mostly, it starts by the creation of a trial population from the disaggregate sample data, and then the overall level of fit is assessed across all marginal targets. Units from the trial population are swapped with units chosen from the disaggregate samples, and when the measure of fit improves, the swap is made. This is implemented through a proprietary computer program that first identifies a list of units whose aggregate attribute values match a pre-specified set of corresponding target values and then iteratively performs one of three operations, namely, adding a unit from the sample to the list, subtracting a unit, or swapping a unit between the sample and the previously identified list. That process is performed on a zone-by-zone level with equal probability of the three actions (i.e., add, subtract, or swap) being considered. The developed algorithm was applied to California and Oregon to synthesize populations for their models. The California application served the California Statewide Travel Demand Model including short- and long-distance travel considering personal and commercial vehicles. The Oregon application served the Oregon Statewide Integrated Model, which included employment synthesis for 34 industries. Both model applications resulted in a near-perfect fit for synthesized populations. Generally, the population synthesis procedure using combinatorial optimization has proven to be fast and flexible with the possibility for application to both households and employment scenarios. However, this algorithm can be further improved by using

As demonstrated, hereinabove, IPF, IPU, and combinatorial optimization approaches rely on cloning attributes that were captured in microdata. In addition, they all share key drawbacks including (a) fitting of a contingency table while ignoring other solutions matching the available data; (b) loss of heterogeneity that has been captured in the microdata due to cloning rather than true population synthesis; (c) dependency on the accuracy of captured data to determine the cloning weights which may replicate inherent inaccuracies; and (d) limited scalability, in terms of the number of attributes of synthesized agents. Hence, Markov process-based approaches were developed to overcome such drawbacks and to offer

an approach that truly synthesizes populations instead of cloning them.

The earliest notable effort in this direction was pioneered in 2013 by Farooq et al. [21] who developed a Markov chain Monte Carlo (MCMC) simulation-based approach for synthesizing populations. The proposed approach is a computerbased simulation technique that can be used to simulate a dependent sequence of random draws from complicated stochastic models. To synthesize populations that approach uses three sources of data, namely, (a) zoning systems such as census blocks, census tracts, counties, and states; (b) sample of individuals such as the North American PUMS and the European Sample of Anonymized Records (SARs); and (c) cross-classification tables for socioeconomics and demographics like income by age at a certain zoning level. Assuming that in a given spatial region at any point in time there exists a true population, the MCMC simulation-based approach synthesizes that population by drawing the individual attributes from their uniquely joint distribution using the available partial views while ensuring

**6**

than the existing ones. More specifically, the proposed HMM-based approach promised great flexibility and efficiency in terms of data preparation and model training while being able to reproduce the structural configuration of a given population from an unlimited number of micro samples and a marginal distribution. The HMM-based approach considers population synthesis as a variant of the standard decoding problem, at which the state sequences are supposed to be unknown. Accordingly, the maximum likelihood estimators related to the transition states were determined through the Viterbi algorithm. An important advantage of the HMM-based approach is its ability to handle both continuous and discrete variables, which addresses the inherent issue of loss of information due to aggregation of continuous variables like age. Also, the proposed HMM-based approach satisfies the need to discretize continuous variables to meet the fundamental limitation of Markov process to discrete states. The statistical and machine Learning Toolbox of MATLAB was used to generate sequences from an estimated HMM that were applied to the 2013 Belgian National household travel survey*.* Three simulations were run to illustrate the HMM-based approach. The first simulation tested the combined effects of scalability and dimensionality. The second simulation compared the HMM-based approach against IPF, and the third demonstrated the advantage of the HMM-based approach over IPF using various samples. Simulation results indicated that the proposed HMM-based approach provided accurate results due to its ability to reproduce the marginal distributions and their corresponding multivariate joint distributions with an acceptable error. Furthermore, the HMM-based approach outperformed IPF for small sample sizes while using smaller amount of input data than IPF. In addition, simulation results demonstrated that the HMM-based approach can integrate information provided by several data sources to allow good estimates of synthesized population.
