**Modern Methods of Estimating Biodiversity from Presence-Absence Surveys**

Robert M. Dorazio1, Nicholas J. Gotelli2 and Aaron M. Ellison<sup>3</sup> <sup>1</sup>*U.S. Geological Survey, Southeast Ecological Science Center, Gainesville, Florida* <sup>2</sup>*University of Vermont, Department of Biology, Burlington, Vermont* <sup>3</sup>*Harvard University, Harvard Forest, Petersham, Massachusetts USA*

#### **1. Introduction**

276 Biodiversity Loss in a Changing Planet

Zipperer, W., C. (1993). Deforestation patterns and their effects on forest patches. *Landscape* 

*Ecology,* Vol. 8, No. 3, pp. 177–184, ISSN 1572 - 9761

Communities of species are often sampled using so-called "presence-absence" surveys, wherein the apparent presence or absence of each species is recorded. Whereas counts of individuals can be used to estimate species abundances, apparent presence-absence data are often easier to obtain in surveys of multiple species. Presence-absence surveys also may be more accurate than abundance surveys, particularly in communities that contain highly mobile species.

A problem with presence-absence data is that observations are usually contaminated by zeros that stem from errors in detection of a species. That is, true zeros, which are associated with the absence of a species, cannot be distinguished from false zeros, which occur when species are present in the vicinity of sampling but not detected. Therefore, it is more accurate to describe apparent presence-absence data as detections and non-detections, but this terminology is seldom used in ecology.

Estimates of biodiversity and other community-level attributes can be dramatically affected by errors in detection of each species, particularly since the magnitude of these detection errors generally varies among species (Boulinier et al. 1998). For example, bias in estimates of biodiversity arising from errors in detection is especially pronounced in communities that contain a preponderance of rare or difficult-to-detect species. To eliminate this source of bias, probabilities of species occurrence and detection must be estimated simultaneously using a statistical model of the presence-absence data. Such models require presence-absence surveys to be replicated at some – but not necessarily all – of the locations selected for sampling. Replicate surveys can be obtained using a variety of sampling protocols, including repeated visits to each sample location by a single observer, independent surveys by different observers, or even spatial replicates obtained by placing clusters of quadrats or transects within a sample location. Information in the replicated surveys is crucial because it allows species occurrences to be estimated without bias by using a model-based specification of the observation process, which accounts for the errors in detection that are manifest as false zeros. Several statistical models have been developed for the analysis of replicated, presence-absence data. Each of these models includes parameters for a community's incidence matrix (Colwell

extent of the distribution of the apparent bog-specialist, *Myrmica lobifrons*, in Massachusetts and Vermont. Bogs are not commonly searched for ants, but in 1997 we had identified *M. lobifrons* as a primary component of the diet of the carnivorous pitcher plant, *Sarracenia purpurea*, at Hawley Bog in western Massachuestts. This was the first record for *M. lobifrons* in Massachusetts. At the time the taxonomic status of this species was being re-evaluated (Francoeur 1997), and it was largely unknown in the lower (contiguous) 48 states of the United States. In addition to our interest in *M. lobifrons*, we also wanted to explore whether bogs harbored a distinctive ant fauna or whether the ant faunas of bogs were simply a subset of the ant species found in the surrounding forests. Thus, at each of the sites selected for sampling, we surveyed ants in the target bog and in the upland forest adjacent to the bog (Gotelli &

Modern Methods of Estimating Biodiversity from Presence-Absence Surveys 279

At each of 22 sample sites, we established two 8 × 8 m sampling grids, each containing 25 evenly spaced pitfall traps. One sampling grid was located in the center of the bog; the other was located within intact forest 50-500 m away from the edge of the bog. Each pitfall trap consisted of a 180-ml plastic cup (95 mm in diameter) that was filled with 20 ml of dilute soapy water. Traps were buried so that the upper lip of each trap was flush with the bog or forest-soil surface, and left in place for 48 hours during dry weather. At the end of the 48 hours, trap contents were collected, immediately fixed with 95% ethanol, and returned to the laboratory where all ants were removed and identified to species. Traps were sampled twice in the summer of 1999, and the time between each sampling period was 6 weeks (42 days); therefore, we consider the two sampling periods as early- and late-summer replicates. Locations of traps were flagged so that pitfall traps were placed at identical locations during

The geographic location (latitude (LAT) and longitude (LON)) and elevation (ELEV, meters above sea level) of each bog and forest sample site was determined using a Trimble Global Positioning System (GPS). At each forest sample site we also estimated available light levels beneath the canopy using hemispherical canopy photographs, which were taken on overcast days between 10:00 AM and and 2:00 PM at 1 m above ground level with an 8 mm fish-eye lens on a Nikon F-3 camera. Leaf area index (LAI, dimensionless) was determined from the subsequently digitized photographs using HemiView software (Delta-T, Cambridge, UK). Because there was no canopy over the bog, the LAI of each bog was assigned a value of zero. To compute a global site factor (GSF, total solar radiation) for each forest sample site (Rich et al. 1993), we summed weighted values of direct site factor (DSF, total direct beam solar radiation) and indirect site factor (total diffuse solar radiation). GSF values are expressed as a percentage of total possible solar radiation (i.e., above the canopy) during the growing season (April through October), corrected for latitude and solar track. The GSF of each bog was

Digital aerial photographs were obtained for each sampled bog from state mapping authorities, or, when digital photographs were unavailable (five sites), photographic prints (from USGS-EROS) were scanned and digitized. Aerial photographs were used to construct a set of data layers (Arc-View GIS 3.2) from which bog area (AREA) was calculated. The area of the surrounding forests was not measured, as the forest was generally continuous for at least

Ellison 2002).

the two sampling periods.

assigned a value of one.

several km<sup>2</sup> around each bog.

**2.2 Measurement of site covariates**

et al. 2004, Gotelli 2000), which contains the binary occupancy state (presence or absence) of each species at each sample location. The incidence matrix is only partially observed owing to species- and location-specific errors in detection; however, the incidence matrix can be estimated by fitting these models to the replicated, presence-absence data. Therefore, any function of the incidence matrix – including species richness, alpha diversity, and beta diversity (Magurran 2004)– also can be estimated using these models.

Models for estimating species richness – and other measures of biodiversity – from replicated, presence-absence data were first developed by Dorazio & Royle (2005) and Dorazio et al. (2006). By including spatial covariates of species occurrence and detection probabilities in these models, Kéry & Royle (2009) and Royle & Dorazio (2008) estimated the spatial distribution (or map) of species richness of birds in Switzerland. Similarly, Zipkin et al. (2010) showed that this approach can be used to quantify and assess the effects of conservation or management actions on species richness and other community-level characteristics. More recently, statistical models have been developed to estimate *changes* in communities from a temporal sequence of replicated, presence-absence data. In these models the dynamics of species occurrences are specified using temporal variation in covariates of occurrence (Kéry, Dorazio, Soldaat, van Strien, Zuiderwijk & Royle 2009) or using first-order Markov processes (Dorazio et al. 2010, Russell et al. 2009, Walls et al. 2011), wherein temporal differences in occurrence probabilities are specified as functions of species- and location-specific colonization and extinction probabilities. The latter class of models, which includes the former, is extremely versatile and may be used to confront alternative theories of metacommunity dynamics (Holyoak & Mata 2008, Leibold et al. 2004) with data or to estimate changes in biodiversity. For example, Dorazio et al. (2010) estimated regional levels of biodiversity of butterflies in Switzerland using a model that accounted for seasonal changes in species composition associated with differences in phenology of flight patterns among species. Russell et al. (2009) estimated the effects of prescribed forest fire on the composition and size of an avian community in Washington.

In the present paper we analyze a set of replicated, presence-absence data that previously was analyzed using statistical models that did not account for errors in detection of each species (Gotelli & Ellison 2002). Our objective is to illustrate the inferential benefits of using modern methods to analyze these data. In the analysis we model occurrence probabilities in assemblages of ant species as a function of large-scale, geographic covariates (latitude, elevation) and small-scale, site covariates (habitat area, vegetation composition, light availability). We fit several models, each identified by a specific combination of covariates, to assess the relative contribution of these potential sources of variation in species occurrence and to estimate the effect of these contributions on geographic differences in ant species richness and other measures of biodiversity. We also provide the data and source code used in our analysis to allow comparisons between our results and those obtained using alternative methods of analysis.

#### **2. Study area and sampling methods**

#### **2.1 Ant sampling**

The data in our analysis were obtained by sampling assemblages of ant species found in New England bogs and forests. The initial motivation for sampling was to determine the extent of the distribution of the apparent bog-specialist, *Myrmica lobifrons*, in Massachusetts and Vermont. Bogs are not commonly searched for ants, but in 1997 we had identified *M. lobifrons* as a primary component of the diet of the carnivorous pitcher plant, *Sarracenia purpurea*, at Hawley Bog in western Massachuestts. This was the first record for *M. lobifrons* in Massachusetts. At the time the taxonomic status of this species was being re-evaluated (Francoeur 1997), and it was largely unknown in the lower (contiguous) 48 states of the United States. In addition to our interest in *M. lobifrons*, we also wanted to explore whether bogs harbored a distinctive ant fauna or whether the ant faunas of bogs were simply a subset of the ant species found in the surrounding forests. Thus, at each of the sites selected for sampling, we surveyed ants in the target bog and in the upland forest adjacent to the bog (Gotelli & Ellison 2002).

At each of 22 sample sites, we established two 8 × 8 m sampling grids, each containing 25 evenly spaced pitfall traps. One sampling grid was located in the center of the bog; the other was located within intact forest 50-500 m away from the edge of the bog. Each pitfall trap consisted of a 180-ml plastic cup (95 mm in diameter) that was filled with 20 ml of dilute soapy water. Traps were buried so that the upper lip of each trap was flush with the bog or forest-soil surface, and left in place for 48 hours during dry weather. At the end of the 48 hours, trap contents were collected, immediately fixed with 95% ethanol, and returned to the laboratory where all ants were removed and identified to species. Traps were sampled twice in the summer of 1999, and the time between each sampling period was 6 weeks (42 days); therefore, we consider the two sampling periods as early- and late-summer replicates. Locations of traps were flagged so that pitfall traps were placed at identical locations during the two sampling periods.

#### **2.2 Measurement of site covariates**

2 Will-be-set-by-IN-TECH

et al. 2004, Gotelli 2000), which contains the binary occupancy state (presence or absence) of each species at each sample location. The incidence matrix is only partially observed owing to species- and location-specific errors in detection; however, the incidence matrix can be estimated by fitting these models to the replicated, presence-absence data. Therefore, any function of the incidence matrix – including species richness, alpha diversity, and beta

Models for estimating species richness – and other measures of biodiversity – from replicated, presence-absence data were first developed by Dorazio & Royle (2005) and Dorazio et al. (2006). By including spatial covariates of species occurrence and detection probabilities in these models, Kéry & Royle (2009) and Royle & Dorazio (2008) estimated the spatial distribution (or map) of species richness of birds in Switzerland. Similarly, Zipkin et al. (2010) showed that this approach can be used to quantify and assess the effects of conservation or management actions on species richness and other community-level characteristics. More recently, statistical models have been developed to estimate *changes* in communities from a temporal sequence of replicated, presence-absence data. In these models the dynamics of species occurrences are specified using temporal variation in covariates of occurrence (Kéry, Dorazio, Soldaat, van Strien, Zuiderwijk & Royle 2009) or using first-order Markov processes (Dorazio et al. 2010, Russell et al. 2009, Walls et al. 2011), wherein temporal differences in occurrence probabilities are specified as functions of species- and location-specific colonization and extinction probabilities. The latter class of models, which includes the former, is extremely versatile and may be used to confront alternative theories of metacommunity dynamics (Holyoak & Mata 2008, Leibold et al. 2004) with data or to estimate changes in biodiversity. For example, Dorazio et al. (2010) estimated regional levels of biodiversity of butterflies in Switzerland using a model that accounted for seasonal changes in species composition associated with differences in phenology of flight patterns among species. Russell et al. (2009) estimated the effects of prescribed forest fire on the composition

In the present paper we analyze a set of replicated, presence-absence data that previously was analyzed using statistical models that did not account for errors in detection of each species (Gotelli & Ellison 2002). Our objective is to illustrate the inferential benefits of using modern methods to analyze these data. In the analysis we model occurrence probabilities in assemblages of ant species as a function of large-scale, geographic covariates (latitude, elevation) and small-scale, site covariates (habitat area, vegetation composition, light availability). We fit several models, each identified by a specific combination of covariates, to assess the relative contribution of these potential sources of variation in species occurrence and to estimate the effect of these contributions on geographic differences in ant species richness and other measures of biodiversity. We also provide the data and source code used in our analysis to allow comparisons between our results and those obtained using alternative

The data in our analysis were obtained by sampling assemblages of ant species found in New England bogs and forests. The initial motivation for sampling was to determine the

diversity (Magurran 2004)– also can be estimated using these models.

and size of an avian community in Washington.

**2. Study area and sampling methods**

methods of analysis.

**2.1 Ant sampling**

The geographic location (latitude (LAT) and longitude (LON)) and elevation (ELEV, meters above sea level) of each bog and forest sample site was determined using a Trimble Global Positioning System (GPS). At each forest sample site we also estimated available light levels beneath the canopy using hemispherical canopy photographs, which were taken on overcast days between 10:00 AM and and 2:00 PM at 1 m above ground level with an 8 mm fish-eye lens on a Nikon F-3 camera. Leaf area index (LAI, dimensionless) was determined from the subsequently digitized photographs using HemiView software (Delta-T, Cambridge, UK). Because there was no canopy over the bog, the LAI of each bog was assigned a value of zero. To compute a global site factor (GSF, total solar radiation) for each forest sample site (Rich et al. 1993), we summed weighted values of direct site factor (DSF, total direct beam solar radiation) and indirect site factor (total diffuse solar radiation). GSF values are expressed as a percentage of total possible solar radiation (i.e., above the canopy) during the growing season (April through October), corrected for latitude and solar track. The GSF of each bog was assigned a value of one.

Digital aerial photographs were obtained for each sampled bog from state mapping authorities, or, when digital photographs were unavailable (five sites), photographic prints (from USGS-EROS) were scanned and digitized. Aerial photographs were used to construct a set of data layers (Arc-View GIS 3.2) from which bog area (AREA) was calculated. The area of the surrounding forests was not measured, as the forest was generally continuous for at least several km<sup>2</sup> around each bog.

Site *k* Observed Partially observed species *i* 1 2 ··· *R* 1 2 ··· *R wi* 1 *y*<sup>11</sup> *y*<sup>12</sup> ··· *y*1*<sup>R</sup> z*<sup>11</sup> *z*<sup>12</sup> ··· *z*1*<sup>R</sup> w*<sup>1</sup> 2 *y*<sup>21</sup> *y*<sup>22</sup> ··· *y*2*<sup>R</sup> z*<sup>21</sup> *z*<sup>22</sup> ··· *z*2*<sup>R</sup> w*<sup>2</sup>

Modern Methods of Estimating Biodiversity from Presence-Absence Surveys 281

*n yn*<sup>1</sup> *yn*<sup>2</sup> ··· *ynR zn*<sup>1</sup> *zn*<sup>2</sup> ··· *znR wn n* + 1 0 0 ··· 0 *zn*+1,1 *zn*+1,2 ··· *zn*+1,*<sup>R</sup> wn*+<sup>1</sup>

*N* 0 0 ··· 0 *zN*<sup>1</sup> *zN*<sup>2</sup> ··· *zNR wN N* + 1 0 0 ··· 0 *zN*+1,1 *zN*+1,2 ··· *zN*+1,*<sup>R</sup> wN*+<sup>1</sup>

*M* 0 0 ··· 0 *zM*<sup>1</sup> *zM*<sup>2</sup> ··· *zMR wM* Table 1. Conceptualization of the supercommunity of *M* species used in parameter-expanded data augmentation. *Y* comprises a matrix of *n* rows of observed trap frequencies and *M* − *n*

∼ Bernoulli(Ω) where the parameter Ω denotes the probability that a species in the augmented data set is a member of the community of *N* species that are present and vulnerable to capture. Note that the community's species richness *N* is not a formal parameter of the model. Instead, *N* is a

estimation of Ω and *w* is essentially equivalent to estimation of *N* (Royle & Dorazio 2011). The incidence matrix of the community (Colwell et al. 2004, Gotelli 2000) is a parameter of the model that is embedded in an *M* × *R* matrix of parameters *Z*, whose elements indicate the presence (*z* = 1) or absence (*z* = 0) of species *i* at sample site *k*. Although *Z* is treated as a random variable of the model, each element associated with species that are not members of the community is equal to zero because *zik* is defined conditional on the value of *wi* as follows:

where *ψik* denotes the probability that species *i* is present at sample site *k*. Thus, if species *i* is not a member of the community, then *wi* = 0 and Pr(*zik* = 0|*wi* = 0) = 1; otherwise, *wi* = 1 and Pr(*zik* = 1|*wi* = 1) = *ψik*. For purposes of computing estimates of community-level characteristics, *Z* may be treated as the incidence matrix itself because the *M* − *N* rows associated with species not in the community contain only zeros and make no contribution

rows of unobserved (all-zero) trap frequencies. *Z* denotes a matrix of species- and site-specific occurrence parameters. *w* denotes a vector of parameters that indicate

> *wi iid*

derived parameter to be computed as a function of *w* as follows: *N* = ∑*<sup>M</sup>*

membership in the community of *N* species vulnerable to sampling.

. .

. .

. . . . .

. .

. . . . .

*zik*|*wi* ∼ Bernoulli(*wiψik*) (1)

. .

*<sup>i</sup>*=<sup>1</sup> *wi*. Therefore,

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

distributed (iid) as follows:

to the estimates.

#### **3. Statistical analysis**

We analyzed the captures of ant species observed at our sample sites using a modification of the multi-species model of occurrence and detection that includes site-specific covariates (Kéry & Royle 2009, Royle & Dorazio 2008). This modification allows a finite set of candidate models to be specified and fit to the data simultaneously such that prior beliefs in each model's utility can be updated (using Bayes' rule) to compute the posterior probability of each model. The resulting set of posterior model probabilities can be used to select a single ("best") model for inference or to estimate scientifically relevant quantities while averaging over the posterior uncertainty of the models (Draper 1995).

To compare our results with previous analyses (Gotelli & Ellison 2002), we analyzed the data observed in bogs and forests separately. These two habitats are sufficiently distinct that differences in species occurrence – and possibly capture rates – are expected a priori. Furthermore, the potential covariates of occurrence differ between the two habitats, adding another reason to analyze the bog and forest data separately.

#### **3.1 Hierarchical model of species occurrence and capture**

We summarize here the assumptions made in our analysis of the ant captures. Let *yik* ∈ {0, 1, . . . , *Jk*} denote the number of pitfall traps located at site *k* that contained the *i*th of *n* distinct species of ants captured in the entire sample of *R* = 22 sites. At each site 25 pitfall traps were deployed during each of 2 sampling periods (early- and late-season replicates); therefore, the total number of replicate observations per site was constant (*Jk* = 50). While constant replication among sites simplifies implementation of the model, it is not required. However, it *is* essential that *Jk* > 1 for some (ideally all) sample sites because information from within-site replicates allows both occurrence and detection probabilities to be estimated for each species. In the absence of this replication these two parameters are confounded.

The observed data form an *n* × *R* matrix *Yobs* of pitfall trap frequencies, so that rows are associated with distinct species and columns are associated with distinct sample sites. Note that *n*, the number of distinct ant species observed among all *R* sample sites, is a random outcome. In the analysis we want to estimate the total number of species *N* that are present and vulnerable to capture. Although *N* is unknown, we know that *n* ≤ *N*, i.e., we know that the number of species observed in the samples provides a lower bound for an estimate of *N*.

To estimate *N*, we use a technique called parameter-expanded data augmentation (Dorazio et al. 2006, Royle & Dorazio 2011), wherein rows of all-zero trap frequencies are added to the observed data *Yobs* and the model for the observed data is appropriately expanded to analyze the augmented data matrix *Y* = (*Yobs*, **0**). The technical details underlying this technique are described by Royle & Dorazio (2008, 2011), so we won't repeat them here. Briefly, however, the idea is to embed the unobserved, all-zero trap frequencies of the *N* − *n* species in the community within a larger data set of fixed, but known size (say, *M* species, where *M* > *N*) for the purpose of simplifying the analysis. The conventional model for the community of *N* species is necessarily modified so that each of the *M* − *n* rows of augmented data can be estimated as either belonging to the community of *N* species (and containing sampling zeros) or not (and containing structural zeros). In particular, we add a vector of parameters *w* = (*w*1,..., *wM*) to the model to indicate whether each species is a member of the community (*w* = 1) or not (*w* = 0). The elements of *w* are assumed to be independentally and identically


Table 1. Conceptualization of the supercommunity of *M* species used in parameter-expanded data augmentation. *Y* comprises a matrix of *n* rows of observed trap frequencies and *M* − *n* rows of unobserved (all-zero) trap frequencies. *Z* denotes a matrix of species- and site-specific occurrence parameters. *w* denotes a vector of parameters that indicate membership in the community of *N* species vulnerable to sampling.

distributed (iid) as follows:

4 Will-be-set-by-IN-TECH

We analyzed the captures of ant species observed at our sample sites using a modification of the multi-species model of occurrence and detection that includes site-specific covariates (Kéry & Royle 2009, Royle & Dorazio 2008). This modification allows a finite set of candidate models to be specified and fit to the data simultaneously such that prior beliefs in each model's utility can be updated (using Bayes' rule) to compute the posterior probability of each model. The resulting set of posterior model probabilities can be used to select a single ("best") model for inference or to estimate scientifically relevant quantities while averaging over the posterior

To compare our results with previous analyses (Gotelli & Ellison 2002), we analyzed the data observed in bogs and forests separately. These two habitats are sufficiently distinct that differences in species occurrence – and possibly capture rates – are expected a priori. Furthermore, the potential covariates of occurrence differ between the two habitats, adding

We summarize here the assumptions made in our analysis of the ant captures. Let *yik* ∈ {0, 1, . . . , *Jk*} denote the number of pitfall traps located at site *k* that contained the *i*th of *n* distinct species of ants captured in the entire sample of *R* = 22 sites. At each site 25 pitfall traps were deployed during each of 2 sampling periods (early- and late-season replicates); therefore, the total number of replicate observations per site was constant (*Jk* = 50). While constant replication among sites simplifies implementation of the model, it is not required. However, it *is* essential that *Jk* > 1 for some (ideally all) sample sites because information from within-site replicates allows both occurrence and detection probabilities to be estimated for each species. In the absence of this replication these two parameters are confounded. The observed data form an *n* × *R* matrix *Yobs* of pitfall trap frequencies, so that rows are associated with distinct species and columns are associated with distinct sample sites. Note that *n*, the number of distinct ant species observed among all *R* sample sites, is a random outcome. In the analysis we want to estimate the total number of species *N* that are present and vulnerable to capture. Although *N* is unknown, we know that *n* ≤ *N*, i.e., we know that the number of species observed in the samples provides a lower bound for an estimate of *N*. To estimate *N*, we use a technique called parameter-expanded data augmentation (Dorazio et al. 2006, Royle & Dorazio 2011), wherein rows of all-zero trap frequencies are added to the observed data *Yobs* and the model for the observed data is appropriately expanded to analyze the augmented data matrix *Y* = (*Yobs*, **0**). The technical details underlying this technique are described by Royle & Dorazio (2008, 2011), so we won't repeat them here. Briefly, however, the idea is to embed the unobserved, all-zero trap frequencies of the *N* − *n* species in the community within a larger data set of fixed, but known size (say, *M* species, where *M* > *N*) for the purpose of simplifying the analysis. The conventional model for the community of *N* species is necessarily modified so that each of the *M* − *n* rows of augmented data can be estimated as either belonging to the community of *N* species (and containing sampling zeros) or not (and containing structural zeros). In particular, we add a vector of parameters *w* = (*w*1,..., *wM*) to the model to indicate whether each species is a member of the community (*w* = 1) or not (*w* = 0). The elements of *w* are assumed to be independentally and identically

**3. Statistical analysis**

uncertainty of the models (Draper 1995).

another reason to analyze the bog and forest data separately.

**3.1 Hierarchical model of species occurrence and capture**

*wi iid* ∼ Bernoulli(Ω)

where the parameter Ω denotes the probability that a species in the augmented data set is a member of the community of *N* species that are present and vulnerable to capture. Note that the community's species richness *N* is not a formal parameter of the model. Instead, *N* is a derived parameter to be computed as a function of *w* as follows: *N* = ∑*<sup>M</sup> <sup>i</sup>*=<sup>1</sup> *wi*. Therefore, estimation of Ω and *w* is essentially equivalent to estimation of *N* (Royle & Dorazio 2011). The incidence matrix of the community (Colwell et al. 2004, Gotelli 2000) is a parameter of the model that is embedded in an *M* × *R* matrix of parameters *Z*, whose elements indicate the presence (*z* = 1) or absence (*z* = 0) of species *i* at sample site *k*. Although *Z* is treated as a random variable of the model, each element associated with species that are not members of the community is equal to zero because *zik* is defined conditional on the value of *wi* as follows:

$$|z\_{ik}|w\_i \sim \text{Bernoulli}(w\_i\psi\_{ik})\tag{1}$$

where *ψik* denotes the probability that species *i* is present at sample site *k*. Thus, if species *i* is not a member of the community, then *wi* = 0 and Pr(*zik* = 0|*wi* = 0) = 1; otherwise, *wi* = 1 and Pr(*zik* = 1|*wi* = 1) = *ψik*. For purposes of computing estimates of community-level characteristics, *Z* may be treated as the incidence matrix itself because the *M* − *N* rows associated with species not in the community contain only zeros and make no contribution to the estimates.

of occurrence of species *i* at the average value of the covariates. This scaling of covariates also

Modern Methods of Estimating Biodiversity from Presence-Absence Surveys 283

The additional parameter *δ* = (*δ*1,..., *δp*) in Eq. 2 is used to specify whether each covariate is

∼ Bernoulli(0.5)

which implies an equal prior probability (0.5*p*) for each of the 2*<sup>p</sup>* distinct values of *δ*. This approach, originally developed by Kuo & Mallick (1998), allows several regression models to be considered simultaneously and yields the posterior distribution of *δ*. After all models have been considered (as described in Section 3.2), the posterior probability Pr(*δ*|*Y*, *X*) of each model (vis a vis, each distinct value of *δ*) can be computed. In our analyses the model with the highest posterior probability is used to compute estimates of species occurrence and

We assume a relatively simple model of the pitfall trap frequencies *yik*, owing to the simplicity of our sampling design. Specifically, we assume that if ants of species *i* are present at site *k* (i.e., *zik* = 1), their probability of capture *pik* is the same in each of the *Jk* replicated traps. This

*yik*|*zik* ∼ Binomial(*Jk*, *zik pik*)

where *pik* denotes the conditional probability of capture of species *i* at site *k* (given *zik* = 1). Note that if species *i* is absent at site *k*, then Pr(*yik* = 0|*zik* = 0) = 1. In other words, if a species is absent at sample site *k*, then none of the *Jk* pitfall traps will contain ants of that

None of the covariates observed in our samples is thought to be informative of ant capture probabilities; therefore, rather than using a logistic-regression formulation of *pik* (as in Eq. 2),

logit(*pik*) = *a*0*<sup>i</sup>*

In order to estimate the occurrences of species not observed in any of our traps, a modeling assumption is needed to specify a relationship among all species-specific probabilities of occurrence and detection. Therefore, we assume that the ant species in each community are ecologically similar in the sense that these species are likely to respond similarly, but not identically, to changes in their environment or habitat, to changes in resources, or to changes in predation. The assumption of ecological similarity seems reasonable for the species we sampled owing to their overlapping diets, habitats, and life history characteristics. As a point of emphasis, we would *not* assume ecological similarity if our assemblage had included species of tigers and mice! The idea of ecological similarity has been used previously to analyze assemblages of songbird, butterfly, and amphibian species (Dorazio et al. 2006,

assumption implies the following binomial model of the pitfall trap frequencies:

we assume that the logit-scale probability of capture of each species is constant:

improves the stability of calculations involved in estimating *b<sup>i</sup>* = (*b*0*i*, *b*1*i*,..., *bpi*).

(*δ* = 1) or is not (*δ* = 0) included in the model. Specifically, we assume

biodiversity.

**3.1.2 Modeling species captures**

species under our modeling assumptions.

**3.1.3 Modeling heterogeneity among species**

at each of the *R* sample sites.

*δl iid*

The matrix of augmented data *Y* and the parameters *Z* and *w* may be conceptualized as characteristics of a supercommunity of *M* species (Table 1). This supercommunity includes *N* species that are members of the community vulnerable to sampling and *M* − *N* other species that are added to simplify the analysis. The parameters *Z* and *w* are paramount in terms of estimating measures of biodiversity. We have shown already that estimates of *w* are used to compute estimates of species richness *N* (a measure of gamma diversity). Similarly, *Z* may be used to estimate measures of alpha diversity, beta diversity, and other community-level characteristics. For example, summing the columns of *Z* yields the number of species present at each sample site (alpha diversity). Similarly, different columns of *Z* may be compared to express differences in species composition among sites (beta diversity). For example, the Jaccard index, a commonly used measure of beta diversity (Anderson et al. 2011), is easily computed from *Z*. The Jaccard index requires the number of species from two distinct sites, say *k* and *l*, that occur at both sites. Off-diagonal elements of the *R* × *R* matrix *Z*� *Z* contain the numbers of species shared between different sites. Therefore, the proportion of all species present at two sites, say *k* and *l*, that are common to both sites is

$$J\_{kl} = \frac{z\_k' z\_l}{z\_k' \mathbf{1} + z\_l' \mathbf{1} - z\_k' z\_l}$$

where **1** denotes a *M* × 1 vector of ones, and *z<sup>k</sup>* and *z<sup>l</sup>* denote the *k*th and *l*th columns of *Z*. Note that *Jkl* is a measure of the similarity in species present at sites *k* and *l*; its complement, 1 − *Jkl*, corresponds to the dissimilarity – or beta diversity – between sites.

In Section 4 we provide estimates of gamma diversity, alpha diversity, and beta diversity in our analyses of the ant data sets. In these analyses we assume that the community of ants contains a maximum of *M* = 75 species in the forest habitat and a maximum of *M* = 25 species in the bog habitat. The lower maximum is based on five years of collecting ants in New England bogs that yielded only 21 distinct species (Ellison and Gotelli, *personal observations*). The total number of ant species in all of New England is somewhere between 130 and 140 (Ellison et al. 2012); however, many of these species are field or grassland species, and six species, which are not indigenous to New England, are restricted mainly to warm indoors. By excluding these species and those found only in bogs, we obtain the upper limit for the number of ant species in the forest habitat.

#### **3.1.1 Modeling species occurrence probabilities**

Equation 1 implies that each element of the incidence matrix is assumed to be independent given *ψik*, the probability of occurrence of species *i* at sample site *k*. Let *x<sup>k</sup>* = (*x*1*k*, *x*2*k*,..., *xpk*) denote the observed value of *p* covariates at site *k*. We assume that each of these covariates potentially affects the species-specific probability of occurrence at site *k*. Naturally, the effects of these covariates may differ among species, so their contributions are modeled on the logit-scale as follows:

$$\text{logit}(\psi\_{ik}) = b\_{0i} + \delta\_1 b\_{1i} \mathbf{x}\_{1k} + \dots + \delta\_p b\_{pi} \mathbf{x}\_{pk} \tag{2}$$

where *b*0*<sup>i</sup>* denotes a logit-scale, intercept parameter for species *i* and *bli* denotes the effect of covariate *xl* on the probability of occurrence of species *i* (*l* = 1, . . . , *p*). If each covariate is centered and scaled to have zero mean and unit variance, *b*0*<sup>i</sup>* denotes the logit-scale probability of occurrence of species *i* at the average value of the covariates. This scaling of covariates also improves the stability of calculations involved in estimating *b<sup>i</sup>* = (*b*0*i*, *b*1*i*,..., *bpi*). The additional parameter *δ* = (*δ*1,..., *δp*) in Eq. 2 is used to specify whether each covariate is (*δ* = 1) or is not (*δ* = 0) included in the model. Specifically, we assume

$$\delta\_l \stackrel{iid}{\sim} \text{Bernoulli}(0.5)$$

which implies an equal prior probability (0.5*p*) for each of the 2*<sup>p</sup>* distinct values of *δ*. This approach, originally developed by Kuo & Mallick (1998), allows several regression models to be considered simultaneously and yields the posterior distribution of *δ*. After all models have been considered (as described in Section 3.2), the posterior probability Pr(*δ*|*Y*, *X*) of each model (vis a vis, each distinct value of *δ*) can be computed. In our analyses the model with the highest posterior probability is used to compute estimates of species occurrence and biodiversity.

#### **3.1.2 Modeling species captures**

6 Will-be-set-by-IN-TECH

The matrix of augmented data *Y* and the parameters *Z* and *w* may be conceptualized as characteristics of a supercommunity of *M* species (Table 1). This supercommunity includes *N* species that are members of the community vulnerable to sampling and *M* − *N* other species that are added to simplify the analysis. The parameters *Z* and *w* are paramount in terms of estimating measures of biodiversity. We have shown already that estimates of *w* are used to compute estimates of species richness *N* (a measure of gamma diversity). Similarly, *Z* may be used to estimate measures of alpha diversity, beta diversity, and other community-level characteristics. For example, summing the columns of *Z* yields the number of species present at each sample site (alpha diversity). Similarly, different columns of *Z* may be compared to express differences in species composition among sites (beta diversity). For example, the Jaccard index, a commonly used measure of beta diversity (Anderson et al. 2011), is easily computed from *Z*. The Jaccard index requires the number of species from two distinct sites,

say *k* and *l*, that occur at both sites. Off-diagonal elements of the *R* × *R* matrix *Z*�

*Jkl* <sup>=</sup> *<sup>z</sup>*�

*z*� *<sup>k</sup>***1** + *z*� *l* **1** − *z*� *kzl*

1 − *Jkl*, corresponds to the dissimilarity – or beta diversity – between sites.

present at two sites, say *k* and *l*, that are common to both sites is

number of ant species in the forest habitat.

logit-scale as follows:

**3.1.1 Modeling species occurrence probabilities**

the numbers of species shared between different sites. Therefore, the proportion of all species

where **1** denotes a *M* × 1 vector of ones, and *z<sup>k</sup>* and *z<sup>l</sup>* denote the *k*th and *l*th columns of *Z*. Note that *Jkl* is a measure of the similarity in species present at sites *k* and *l*; its complement,

In Section 4 we provide estimates of gamma diversity, alpha diversity, and beta diversity in our analyses of the ant data sets. In these analyses we assume that the community of ants contains a maximum of *M* = 75 species in the forest habitat and a maximum of *M* = 25 species in the bog habitat. The lower maximum is based on five years of collecting ants in New England bogs that yielded only 21 distinct species (Ellison and Gotelli, *personal observations*). The total number of ant species in all of New England is somewhere between 130 and 140 (Ellison et al. 2012); however, many of these species are field or grassland species, and six species, which are not indigenous to New England, are restricted mainly to warm indoors. By excluding these species and those found only in bogs, we obtain the upper limit for the

Equation 1 implies that each element of the incidence matrix is assumed to be independent given *ψik*, the probability of occurrence of species *i* at sample site *k*. Let *x<sup>k</sup>* = (*x*1*k*, *x*2*k*,..., *xpk*) denote the observed value of *p* covariates at site *k*. We assume that each of these covariates potentially affects the species-specific probability of occurrence at site *k*. Naturally, the effects of these covariates may differ among species, so their contributions are modeled on the

where *b*0*<sup>i</sup>* denotes a logit-scale, intercept parameter for species *i* and *bli* denotes the effect of covariate *xl* on the probability of occurrence of species *i* (*l* = 1, . . . , *p*). If each covariate is centered and scaled to have zero mean and unit variance, *b*0*<sup>i</sup>* denotes the logit-scale probability

logit(*ψik*) = *b*0*<sup>i</sup>* + *δ*1*b*1*ix*<sup>1</sup>*<sup>k</sup>* + ··· + *δpbpixpk* (2)

*kzl*

*Z* contain

We assume a relatively simple model of the pitfall trap frequencies *yik*, owing to the simplicity of our sampling design. Specifically, we assume that if ants of species *i* are present at site *k* (i.e., *zik* = 1), their probability of capture *pik* is the same in each of the *Jk* replicated traps. This assumption implies the following binomial model of the pitfall trap frequencies:

$$y\_{ik}|z\_{ik} \sim \text{Binomial}(f\_{k'}z\_{ik}p\_{ik})$$

where *pik* denotes the conditional probability of capture of species *i* at site *k* (given *zik* = 1). Note that if species *i* is absent at site *k*, then Pr(*yik* = 0|*zik* = 0) = 1. In other words, if a species is absent at sample site *k*, then none of the *Jk* pitfall traps will contain ants of that species under our modeling assumptions.

None of the covariates observed in our samples is thought to be informative of ant capture probabilities; therefore, rather than using a logistic-regression formulation of *pik* (as in Eq. 2), we assume that the logit-scale probability of capture of each species is constant:

$$\text{logit}(p\_{ik}) = a\_{0i}$$

at each of the *R* sample sites.

#### **3.1.3 Modeling heterogeneity among species**

In order to estimate the occurrences of species not observed in any of our traps, a modeling assumption is needed to specify a relationship among all species-specific probabilities of occurrence and detection. Therefore, we assume that the ant species in each community are ecologically similar in the sense that these species are likely to respond similarly, but not identically, to changes in their environment or habitat, to changes in resources, or to changes in predation. The assumption of ecological similarity seems reasonable for the species we sampled owing to their overlapping diets, habitats, and life history characteristics. As a point of emphasis, we would *not* assume ecological similarity if our assemblage had included species of tigers and mice! The idea of ecological similarity has been used previously to analyze assemblages of songbird, butterfly, and amphibian species (Dorazio et al. 2006,

Posterior probability

Habitat Covariates Uniform prior Jeffreys' prior Forest LAT, LAI, GSF, ELEV 0.818 0.767 Forest LAT, LAI, ELEV 0.177 0.229 Forest LAT, ELEV 0.005 0.003 Forest LAT, GSF, ELEV < 0.001 0.001 Bog ELEV 0.424 0.416 Bog None 0.342 0.412 Bog LAT 0.082 0.070 Bog AREA, ELEV 0.060 0.034 Bog LAT, ELEV 0.045 0.029 Bog AREA 0.038 0.036 Bog LAT, AREA 0.006 0.003 Bog LAT, AREA, ELEV 0.004 0.001

Modern Methods of Estimating Biodiversity from Presence-Absence Surveys 285

Table 2. Posterior probabilities of models containing different covariates of species occurrence probabilities. Covariates include latitude (LAT), leaf area index (LAI), light availability (GSF), elevation (ELEV), and bog area (AREA). Models with less than 0.001

in evaluating the marginal likelihood function. We therefore adopted a Bayesian approach to inference and used Markov chain Monte Carlo methods (Robert & Casella 2004) to fit the model. In the appendix (Section 7) we describe our choice of prior distributions for the model's parameters. We also provide the data and the computer code that was used to calculate the joint posterior distribution of the model's parameters. All parameter estimates

The posterior model probabilities calculated in our analysis of forest and bog data sets are only mildly sensitive to our choice of priors for the logit-scale parameters of the model (Table 2). Recall that these parameters are of primary interest in assessing the relative contributions of geographic- and site-level covariates. Regardless of the prior distribution used (Uniform or Jeffreys' (see appendix)), the model with highest probability includes all four covariates (LAT, LAI, GSF, ELEV) in the analysis of data observed at forest sample sites and a single covariate (ELEV) in the analysis of data observed at bog sample sites. However, the model without any covariates has nearly equal probability to the favored model of the bog data, and the combined probability of these two models far exceeds the probabilities of all other models. These results suggest that occurrence probabilities of ant species found in the bog habitat are not strongly influenced by the LAT or AREA covariates, either alone or in combination with

posterior probability are not shown.

**4. Results**

other covariates.

and credible intervals are based on this distribution.

**4.1 Effects of covariates on species occurrence**

Kéry, Royle, Plattner & Dorazio 2009, Walls et al. 2011); however, this idea is not universally applicable. For example, if the occurrence of one species depends on the presence or absence of another species (as might occur between a predator and prey species or between strongly competing species), then ecological similarity would not be a reasonable assumption. In this case a model must be formulated to specify the pattern of co-occurrence that arises from interspecific interactions (MacKenzie et al. 2004, Waddle et al. 2010). The formulation of statistical models for inferring interspecific interactions in communities of species is an important and developing area of research (Dorazio et al. 2010).

In assemblages of ecologically similar species, it seems reasonable to use distributional assumptions to model unobserved sources of heterogeneity in probabilities of species occurrence and detection. For example, occurrence probabilities may be low for some species (the rare ones) and high for others, but all species are related in the sense that they belong to a larger community of ecologically similar species. By modeling the heterogeneity among species in this way, the data observed for any individual species influence the parameter estimates of every other species in the community. In other words, inferences about an individual species do not depend solely on the observations of that species because the inferences borrow strength from the observations of other species. A practical manifestation of this multispecies approach is that the estimate of a parameter (e.g., occurrence probability) of a single species reflects a compromise between the estimate that would be obtained by analyzing the data from each species separately and the average value of that parameter among all species in the community. In the statistical literature this phenomenon is called "shrinkage" (Gelman et al. 2004) because each species-specific estimate is shrunk in the direction of the estimated average parameter value. Of course, the amount of shrinkage depends on the relative amount of information about the parameter in the observations of each species versus the information about the mean value of that parameter. An important benefit of shrinkage is that it allows parameters to be estimated for a species that is detected with such low frequency that its parameters could otherwise not be estimated. Such species are often the rarest members of the community, and it is crucial that these species be included in the analysis to ensure that estimates of biodiversity are accurate.

In the present analysis we use a normal distribution

$$
\begin{bmatrix} b\_{0i} \\ a\_{0i} \end{bmatrix} \stackrel{iid}{\sim} \text{Normal}\left( \begin{bmatrix} \beta\_0 \\ \alpha\_0 \end{bmatrix}, \begin{bmatrix} \sigma\_{b\_0}^2 & \rho \sigma\_{b\_0} \sigma\_{a\_0} \\ \rho \sigma\_{b\_0} \sigma\_{a\_0} & \sigma\_{a\_0}^2 \end{bmatrix} \right) \tag{3}
$$

to specify the variation in occurrence and detection probabilities among ant species. The parameters *σb*<sup>0</sup> and *σa*<sup>0</sup> denote the magnitude of this variation, and *ρ* parameterizes the extent to which species occurrence and detection probabilities are correlated.

We also use the normal distribution to specify variation among the species-specific effects of covariates on occurrence. Specifically, we assume *bli iid* <sup>∼</sup> Normal(*βl*, *<sup>σ</sup>*<sup>2</sup> *bl* ) (for *l* = 1, . . . , *p*), so that the effects of different covariates are assumed to be mutually independent and uncorrelated.

#### **3.2 Parameter estimation**

The hierarchical model described in Section 3.1 would be impossible to fit using classical methods owing to the high-dimensional and analytically intractable integrations involved


Table 2. Posterior probabilities of models containing different covariates of species occurrence probabilities. Covariates include latitude (LAT), leaf area index (LAI), light availability (GSF), elevation (ELEV), and bog area (AREA). Models with less than 0.001 posterior probability are not shown.

in evaluating the marginal likelihood function. We therefore adopted a Bayesian approach to inference and used Markov chain Monte Carlo methods (Robert & Casella 2004) to fit the model. In the appendix (Section 7) we describe our choice of prior distributions for the model's parameters. We also provide the data and the computer code that was used to calculate the joint posterior distribution of the model's parameters. All parameter estimates and credible intervals are based on this distribution.
