**2. Iterative proportional fitting approach**

Iterative proportional fitting has been first introduced in 1940 by Deming and Stephan [4]. Since then, it became the foundation of population synthesis for transportation models and sometimes referred to as the Fratar technique [5]. The most notable realization of the IPF technique is attributed to Beckman et al. [6] who pioneered population synthesis efforts through their development of a methodology for creating a synthetic baseline population of individuals and households for microscopic activity-based models. Their technique relied on using census data represented by a Census Standard Tape File and Public Use Microdata Sample (PUMS) for a given Public Use Microdata Area (PUMA) of 100,000 individuals with matching variables. In their case, the marginal totals of a multiway table were known, and a sample from the population which generated those totals was provided; thus, they applied the IPF technique to develop constrained maximum entropy estimates of the true proportions in the population multiway table. Their rationale was built upon the consensus that IPF estimates maintain the same odds ratios as those in the sample table in the absence of any marginal information which was their case. To validate the population synthesis method, they compared demographic characteristics of the synthetic population with those of the true population using variables not involved in the population synthesis. Despite their pioneer effort, Beckman et al. [6] did not provide an answer to the zero-cell problem in the PUMS; instead, they replaced it by 0.01 and imputed the corresponding household size. Müller and Axhausen [3] illustrated this as computing a series of tabulations *nij* (*k*) , starting with the seed at *k* ≔ 0, thus *nij* (0) ≔ *nij* for all *i* rows and *j* columns. Furthermore, they illustrated how that series can be computed as represented by Eq. (1):

$$n\_{\vec{\eta}}^{(k\*1)} := n\_{\vec{\eta}}^{(k)} \cdot \begin{cases} r\_i \star n\_i^{(k)} \\ c\_j \star n\_j^{(k)} \end{cases} \tag{1}$$

**3**

*A Critical Review on Population Synthesis for Activity- and Agent-Based Transportation Models*

biases in the synthetic population on the dimensions of socioeconomic class, the presence of children, and the availability of transport modes. However, they were able to resolve biases in over- or underrepresentation of groups that were related to

Simultaneous to the efforts of Arentze et al. [7], Guo and Bhat [8] addressed the two main drawbacks of IPF approach, namely, the zero-cell problem and the inability to control for statistical distributions of both household- and individual-level attributes. Additionally, their study aimed to enhance the scalability and generality of the IPF method as it required code-level changes that are cumbersome and skills that are not typically found within planning agencies, who are the typical users of such approach. The algorithm developed by Guo and Bhat [8] featured generic data structures and accompanying functions to avoid the zero-cell problem, as well as revisions to the algorithm of Beckman et al. [6] to allow simultaneous control of both household- and individual-level attributes. That generic algorithm was built upon an object-oriented architecture and contained eight major steps and a recurring procedure for merging any two contingency tables with common variables. The proposed approach was used to generate synthetic population for the Dallas-Fort Worth metropolitan area in Texas, and the statistical comparison yielded results that were closer to true population than that of Beckman et al. [6]. In addition, Guo and Bhat [8] concluded that a higher percentage deviation from target size (PDTS) yielded better balance at satisfying the household- and individual-level multiway

Srinivasan et al. [9] went a step further and attempted to fine-tune existing efforts to accommodate the household- and individual-level controls as well as assess the significance of controlling individual-level attributes. That study was performed in support of Florida Department of Transportation (FDOT) efforts to incorporate sociodemographic attributes within the Florida Standard Urban Transportation Model Structure (FSUTMS). The research was motivated by the need for reduced aggregation errors, ensuring sensitivity to demographic shifts like that of aging population, and the ability to accommodate population-specific transportation modes. That fine-tuning effort mainly aimed to address individuallevel attributes of age and gender through the means of a greedy-heuristic datafitting algorithm that was implemented in the matrix programming language GAUSS. Validation of Srinivasan et al. [9] algorithm yielded satisfactory distributions of household, size, age, gender, and employment status; however, the distri-

Given the limited number of attributes that can be synthesized per agent, researchers had to further improve the IPF approach to overcome this limitation. Pritchard and Miller [10] introduced a method that implements IPF approach with sparse list-based data structure that allows more attributes per agent. Additionally, they used both the conventional Monte Carlo integerization procedure and the conditional Monte Carlo to synthesize a list of individual agents from fitted tables. Despite their thorough efforts, the study of Pritchard and Miller [10] had only a

Auld and Mohammadian [11] developed a methodology to improve the basic IPF population synthesis routine in a manner that accounts for multiple levels of analysis units—control variables, which was a limitation to the population synthesizers mentioned hereinabove. Their methodology, named multilevel control, allows population characteristics to be replicated for multilevel synthetic population with one level (such as households) serving as the base level of analysis. After a runtime of 16 hours, the proposed method was able to synthesize a 7.9 million agent population for Chicago, IL, with an improved fit of the synthesized individual-level characteristics when compared with synthesis procedures that do

minor impact on goodness-of-fit, relative to the conventional approach.

age and work status by fitting the relevant tables on these dimensions.

*DOI: http://dx.doi.org/10.5772/intechopen.86307*

distributions than lower values of PDTS.

butions for all other variables did not match well.

where *ni.*, is the row sum; *n.j* , is the column total; *ri*, is the control total for row *i*;*cj*, is the control total for column *j*.

Almost a decade later, Arentze et al. [7] addressed one of the limitations of the IPF methods, that is, generating synthetic households when the demographic data describes population in terms of individual counts. Their solution relied on developing a two-step IPF procedure where, first, known marginal distributions of individuals are converted to marginal distributions of households of similar attributes and, second, the resulting marginal household distributions are used as constraints of a multiway household counts. Additionally, their approach aimed to assess the relevance of spatial heterogeneity across populations. The Dutch Albatross model was used as a case study and proof of concept. The validation results yielded sample

#### *A Critical Review on Population Synthesis for Activity- and Agent-Based Transportation Models DOI: http://dx.doi.org/10.5772/intechopen.86307*

biases in the synthetic population on the dimensions of socioeconomic class, the presence of children, and the availability of transport modes. However, they were able to resolve biases in over- or underrepresentation of groups that were related to age and work status by fitting the relevant tables on these dimensions.

Simultaneous to the efforts of Arentze et al. [7], Guo and Bhat [8] addressed the two main drawbacks of IPF approach, namely, the zero-cell problem and the inability to control for statistical distributions of both household- and individual-level attributes. Additionally, their study aimed to enhance the scalability and generality of the IPF method as it required code-level changes that are cumbersome and skills that are not typically found within planning agencies, who are the typical users of such approach. The algorithm developed by Guo and Bhat [8] featured generic data structures and accompanying functions to avoid the zero-cell problem, as well as revisions to the algorithm of Beckman et al. [6] to allow simultaneous control of both household- and individual-level attributes. That generic algorithm was built upon an object-oriented architecture and contained eight major steps and a recurring procedure for merging any two contingency tables with common variables. The proposed approach was used to generate synthetic population for the Dallas-Fort Worth metropolitan area in Texas, and the statistical comparison yielded results that were closer to true population than that of Beckman et al. [6]. In addition, Guo and Bhat [8] concluded that a higher percentage deviation from target size (PDTS) yielded better balance at satisfying the household- and individual-level multiway distributions than lower values of PDTS.

Srinivasan et al. [9] went a step further and attempted to fine-tune existing efforts to accommodate the household- and individual-level controls as well as assess the significance of controlling individual-level attributes. That study was performed in support of Florida Department of Transportation (FDOT) efforts to incorporate sociodemographic attributes within the Florida Standard Urban Transportation Model Structure (FSUTMS). The research was motivated by the need for reduced aggregation errors, ensuring sensitivity to demographic shifts like that of aging population, and the ability to accommodate population-specific transportation modes. That fine-tuning effort mainly aimed to address individuallevel attributes of age and gender through the means of a greedy-heuristic datafitting algorithm that was implemented in the matrix programming language GAUSS. Validation of Srinivasan et al. [9] algorithm yielded satisfactory distributions of household, size, age, gender, and employment status; however, the distributions for all other variables did not match well.

Given the limited number of attributes that can be synthesized per agent, researchers had to further improve the IPF approach to overcome this limitation. Pritchard and Miller [10] introduced a method that implements IPF approach with sparse list-based data structure that allows more attributes per agent. Additionally, they used both the conventional Monte Carlo integerization procedure and the conditional Monte Carlo to synthesize a list of individual agents from fitted tables. Despite their thorough efforts, the study of Pritchard and Miller [10] had only a minor impact on goodness-of-fit, relative to the conventional approach.

Auld and Mohammadian [11] developed a methodology to improve the basic IPF population synthesis routine in a manner that accounts for multiple levels of analysis units—control variables, which was a limitation to the population synthesizers mentioned hereinabove. Their methodology, named multilevel control, allows population characteristics to be replicated for multilevel synthetic population with one level (such as households) serving as the base level of analysis. After a runtime of 16 hours, the proposed method was able to synthesize a 7.9 million agent population for Chicago, IL, with an improved fit of the synthesized individual-level characteristics when compared with synthesis procedures that do

*Transportation Systems Analysis and Assessment*

notable and well-established efforts.

the seed at *k* ≔ 0, thus *nij*

*nij*

where *ni.*, is the row sum; *n.j*

row *i*;*cj*, is the control total for column *j*.

**2. Iterative proportional fitting approach**

of any population synthesizer, with the main focus on fitting disaggregate sample of agents (represented by tabulated demographics of a representative sample of household and individual data) to aggregate constraints (represented by available aggregate data, such as data available from census). There are several approaches for fitting including iterative proportional fitting (IPF), iterative proportional updating (IPU), combinatorial optimization, Markov-based and fitness-based syntheses (FBS), and other emerging approaches [3]. The following sections present a critical review of each approach in the chronological order by which they were introduced to illustrate the progression and evolution of each approach, with emphasis on

Iterative proportional fitting has been first introduced in 1940 by Deming and

Stephan [4]. Since then, it became the foundation of population synthesis for transportation models and sometimes referred to as the Fratar technique [5]. The most notable realization of the IPF technique is attributed to Beckman et al. [6] who pioneered population synthesis efforts through their development of a methodology for creating a synthetic baseline population of individuals and households for microscopic activity-based models. Their technique relied on using census data represented by a Census Standard Tape File and Public Use Microdata Sample (PUMS) for a given Public Use Microdata Area (PUMA) of 100,000 individuals with matching variables. In their case, the marginal totals of a multiway table were known, and a sample from the population which generated those totals was provided; thus, they applied the IPF technique to develop constrained maximum entropy estimates of the true proportions in the population multiway table. Their rationale was built upon the consensus that IPF estimates maintain the same odds ratios as those in the sample table in the absence of any marginal information which was their case. To validate the population synthesis method, they compared demographic characteristics of the synthetic population with those of the true population using variables not involved in the population synthesis. Despite their pioneer effort, Beckman et al. [6] did not provide an answer to the zero-cell problem in the PUMS; instead, they replaced it by 0.01 and imputed the corresponding household size. Müller and

Axhausen [3] illustrated this as computing a series of tabulations *nij*

illustrated how that series can be computed as represented by Eq. (1):

(*k*+1)

≔ *nij* (*k*) · {

Almost a decade later, Arentze et al. [7] addressed one of the limitations of the IPF methods, that is, generating synthetic households when the demographic data describes population in terms of individual counts. Their solution relied on developing a two-step IPF procedure where, first, known marginal distributions of individuals are converted to marginal distributions of households of similar attributes and, second, the resulting marginal household distributions are used as constraints of a multiway household counts. Additionally, their approach aimed to assess the relevance of spatial heterogeneity across populations. The Dutch Albatross model was used as a case study and proof of concept. The validation results yielded sample

(*k*)

(*k*) (1)

(0) ≔ *nij* for all *i* rows and *j* columns. Furthermore, they

, is the column total; *ri*, is the control total for

*ri* ÷ *ni*· (*k*)

*cj* ÷ *n*.*<sup>j</sup>*

, starting with

**2**

not account for individual-level controls. The study concluded that the improved fit comes at no cost to the fit against household-level controls. However, the developed methodology was never experimented as to synthesizing commercialor business-related agents.

Lee and Fu [12] realized that the IPF-based population synthesis approaches, specifically the original synthetic reconstruction method [6] and the complimentary combinatorial optimization method [13], are not generally applicable to all population synthesis scenarios. Based on a comparison by Ryan et al. [14], Lee and Fu [12] concluded that combinatorial optimization method produces more accurate demographic information for populations over a small area and that the population synthesis problem should be evaluated from an optimization point of view. In addition, they explored how the estimation of a multiway demographic table can be formulated and solved as a constrained optimization problem in full consideration of both household- and individual-level attributes. Accordingly, that study tackled the inconsistency problem through an approach that is based on the minimum cross-entropy theory. The validity of that model was confirmed through a case study in Singapore, through which results from a 10,641 household study area were superior to conventional IPF approaches. However, Lee and Fu [12] did not provide a full-scale application which constrains the applicability of their model to theoretical applications only.

Zhu and Ferreira [15] were intrigued by the inability of the standard IPF algorithm to fit marginal constraints on multiple agent types simultaneously. Hence, they developed a two-stage population synthesizer that utilized IPF on the first stage and then estimated the spatial pattern of household-level attributes through a second stage IPF-based approach. Their two-stage algorithm consisted of four distinctive steps. The first step involved developing an estimate joint distribution of household- and individual-level attributes. In the second step, households and individuals were drawn from microdata samples. The third step consisted of a conventional IPF with household type and parcel capacity marginal constraints. The fourth and last step included an estimated marginal distribution of other attributes from the fitted model. To validate their approach, Zhu and Ferreira [15] generated synthetic population for Singapore. Their evaluation approach involved four comparisons, namely, fitting only for households-level constraints, fitting for both household- and individual-level constraints, allocating households to buildings while constraining building capacity, and repeating the previous comparison with income level constrained. Validation results yielded realistic spatial heterogeneity while preserving some of the joint distribution of household and locational characteristics.

Choupani and Mamdoohi [16] addressed the issue of integerization of IPF results in non-integer values instead of integers, for example, fractions of household- or individual-level attributes for zones. In doing so, they proposed a binary linear programming model for tabular rounding in which the integerized table totals and marginals perfectly fit to input data obtained from the Census Bureau. The main advantages of using tabular rounding were that it did not bias joint or marginal distributions of socioeconomic attributes of minority demographic groups and it minimized the distortion to the correlation structure of household- and individual-level non-integer tables. Furthermore, the tabular rounding approach outperformed all other eight rounding approaches. In addition, sensitivity analysis of tabular rounding demonstrated that small and large values are equally significant when it comes to integerization. Their findings were confirmed by a comprehensive literature review [17] that they performed 1 year later, which concluded that IPF is the most feasible approach for synthesizing populations for agent- and activitybased transportation models, once integer conversion and zero-cell issues were

**5**

*A Critical Review on Population Synthesis for Activity- and Agent-Based Transportation Models*

resolved. In addition, they confirmed that tabular rounding is the most efficient

The iterative proportional updating approach is a heuristic approach that was developed by Ye et al. [19] to address the drawbacks of the IPF approach. Specifically, the IPU approach addresses the issue of control for individual-level attributes and joint distributions of personal characteristics. The IPU algorithm matches both household- and individual-level attributes in a computationally efficient manner by iteratively adjusting and reallocating weights among households of a specific type until both household- and individual-level attributes are matched. Another advantage of the IPU approach is its practicality from the implementation and computational points of view. Eq. (2) represents the mathematical optimization problem as addressed by the IPU approach. In addition, the IPU approach has been generally described in 23 computational steps that can be easily coded in most,

Most recently, in an effort to further enhance the IPF approach, Otani et al. [18] identified an issue that they named the modifiable attribute cell problem (MACP) which arises from combining discrete categories of individual-level attributes or due to the contiguous nature of those attributes. The proposed solution to the MACP issue was identified as "the organized cell set" which is the best combination of categories. The procedure to identify the best organized cell set consists of five steps. The first step involves aggregation of the elemental cell set to find several cases of cell organization that generate large cells. The second step involves constructing base-year data using the conventional IPF approach. The third step focuses on forecasting using microscopic simulation. The fourth step involves identifying the statistically acceptable cell value using a Student's t-test. The fifth and final step involves considering the case with minimum number of cells to be the best cell organization. This method is computationally complex and cannot be performed using conventional optimization algorithms. Yet, it is the sole identifi-

*DOI: http://dx.doi.org/10.5772/intechopen.86307*

and feasible solution for the integerization issue.

able solution to the modifiable attribute cell problem.

**3. Iterative proportional updating approach**

if not all, programming languages:

*j* (

*th* household; *cj*

<sup>∑</sup>*<sup>i</sup> di*,*jwi* <sup>−</sup> *cj* \_\_\_\_\_\_\_\_\_ *cj* )

2 or∑ *j*

where *i*, denotes a household (*i* = 1, 2,…,*n*); *j*, denotes the constraint or population characteristic of interest (*j* = 1, 2,…,*m*); *di*,*j*, represents the frequency of the population characteristic (household/person type *j* in household *i*); *wi*, is the weight

Furthermore, Ye et al. [19] proposed an alternative method to address the zero-cell problem that undermined the IPF practicality. Their method is based on borrowing the prior information for the zero cells from PUMS data for the entire region, where zero cells are not likely to exist as long as the control variables of interest and their categories are defined appropriately. However, that method has the inherent risk of overrepresenting the demographic group of interest. Despite their attempt to overcome the zero-cell problem, the researchers could not overcome the zero-marginal problem that may result due to nonexistence of a certain attribute in households of a certain geographic area, for example, having no lowincome households in a certain census block or tract. Furthermore, a review by Müller and Axhausen [3] pointed to the lack of a theoretical proof of convergence.

(∑*<sup>i</sup> di*,*jwi* − *cj*)

\_\_\_\_\_\_\_\_\_\_\_ *cj*

2

, is the value of the population characteristic *j*.

or∑ *j*

<sup>|</sup>∑*<sup>i</sup> di*,*jwi* <sup>−</sup> *cj*<sup>|</sup> \_\_\_\_\_\_\_\_\_\_ *cj* (2)

Minimize∑

Subject to *wi* ≥ 0

attributed to the *i*

#### *A Critical Review on Population Synthesis for Activity- and Agent-Based Transportation Models DOI: http://dx.doi.org/10.5772/intechopen.86307*

resolved. In addition, they confirmed that tabular rounding is the most efficient and feasible solution for the integerization issue.

Most recently, in an effort to further enhance the IPF approach, Otani et al. [18] identified an issue that they named the modifiable attribute cell problem (MACP) which arises from combining discrete categories of individual-level attributes or due to the contiguous nature of those attributes. The proposed solution to the MACP issue was identified as "the organized cell set" which is the best combination of categories. The procedure to identify the best organized cell set consists of five steps. The first step involves aggregation of the elemental cell set to find several cases of cell organization that generate large cells. The second step involves constructing base-year data using the conventional IPF approach. The third step focuses on forecasting using microscopic simulation. The fourth step involves identifying the statistically acceptable cell value using a Student's t-test. The fifth and final step involves considering the case with minimum number of cells to be the best cell organization. This method is computationally complex and cannot be performed using conventional optimization algorithms. Yet, it is the sole identifiable solution to the modifiable attribute cell problem.
