**Part 2**

**Evaluating Analytical Data** 

56 Modern Approaches To Quality Control

Clasby Ginger (2005). Good Laboratory Practice CFR 21 Part 58. A Review for OCRA US

http://www.google.com.tr/search?hl=tr&source=hp&q=A+Review+for+OCRA+

Cobb Jean ( 2007). GLP: Good Laboratory Practice for Field and Research. ALS 52 04

CWIS (2000), L1 – Good Laboratory Practice, Liverpool John Moores University, Campus

Fox Arlene (2011). GLP Regulations vs. ISO 17025 Requirements: How do they differ? In

Gladney Lori, Osakwe Izabella, Ford Endia (2009). Good Laboratory Practices. Available at:

OECD.(1998) OECD series on Principles of Good Laboratory Practice and Compliance

http://www.oecd.org/officialdocuments/displaydocumentpdf/?cote=env/mc/ch

Seiler Jürg P (2005) Good Laboratory Practice. The why and the how. ISBN 3-540-25348-3,

Springer-Verlag Berlin Heidelberg, Printed in the European Union.

https://springerlink3.metapress.com/content/mr20ux0343141g4k/resource-

http://www.docstoc.com/docs/18191459/Good-Laboratory-Practices

court, Nature (news) 402, 16 December, p. 709. Available at:

http://science.kennesaw.edu/~jhendrix/regs/GLP.ppt

10.1007/s00769-011-0759-0. Available at:

US+RAC+Study+Group+September+2005+++&rlz=1W1ADFA\_tr&aq=f&aqi=&aql

Wide Information Service, Health and Safety Unit, Laboratory and Workshop Safety Notes Dalton, R (1999), Roche's Taq patent "obtained by deceit", rules US

Accreditation and Quality Assurance: Journal for Quality, Comparability, and Reliability in Chemical measurement. Volume 1/1996-volume 16/2011. DOI:

secured/?target=fulltext.pdf&sid=sbx4al45ojtfu3vvjzteu045&sh=www.springerlin

RAC Study Group September 2005. Available at :

**13. References** 

=&oq

k.com

Available at:

(www.cwis.livjm.ac.uk/hse)

Monitoring. Available at:

em(98)17&doclanguage=en

**4** 

*USA* 

James B. Stribling

*error..." (Taylor 1997: p. 97)* 

*(Deming 1986: p. i)* 

**Partitioning Error Sources for Quality Control** 

*"…measurements are not passive accountings of an objective world but active interactions in which the thing measured and the way it is measured contribute inseparably to the* 

*"The experienced scientist has to learn to anticipate the possible sources of systematic* 

*"No simple theory tells us what to do about systematic errors. In fact, the only theory of systematic errors is that they must be identified and reduced...." (Taylor 1997: p. 106)* 

 *"…the only reason to carry out a test is to improve a process, to improve the quality..."* 

Rationally, as scientists, we recognize that documented standard procedures constitute the first requirement for developing consistency within and among datasets; the second step is putting the procedures into practice. If the procedures were implemented as perfectly as they are written, there would be no need to question data. However, we are also cognizant of the fact that humans (a group of organisms to which we cannot deny holding membership) are called upon to use the procedures, and the consistency and rigor with which the procedures are applied are directly affected by an individual's skill, training, attention span, energy, and focus (Edwards, 2004). In fact, we fully expect inconsistency due to human foibles, and often substantial portions of careers are spent in efforts to recognize,

Many public and private organizations in the United States (US) and other countries collect aquatic biological data using a variety of sampling and analysis methods (Gurtz & Muir, 1994; ITFM, 1995a; Carter & Resh, 2001), often for meeting regulatory requirements, for example, by the United States' Clean Water Act (CWA) of 1972 (USGPO, 1989). While the information collected by an individual organization is usually directly applicable to a specific question or site-specific issue, the capacity for using it more broadly for comprehensive assessment has been problematic due to unknown data quality produced by

isolate, correct, and minimize future occurrences of, error.

**1. Introduction** 

**and Comparability Analysis in Biological** 

*Tetra Tech, Inc., Center for Ecological Sciences, Owings Mills, Maryland* 

*outcome." (Lindley 2007: p. 154, attributing the concept to Neils Bohr)* 

**Monitoring and Assessment** 

## **Partitioning Error Sources for Quality Control and Comparability Analysis in Biological Monitoring and Assessment**

James B. Stribling *Tetra Tech, Inc., Center for Ecological Sciences, Owings Mills, Maryland USA* 

*"…measurements are not passive accountings of an objective world but active interactions in which the thing measured and the way it is measured contribute inseparably to the outcome." (Lindley 2007: p. 154, attributing the concept to Neils Bohr)* 

*"The experienced scientist has to learn to anticipate the possible sources of systematic error..." (Taylor 1997: p. 97)* 

*"No simple theory tells us what to do about systematic errors. In fact, the only theory of systematic errors is that they must be identified and reduced...." (Taylor 1997: p. 106)* 

 *"…the only reason to carry out a test is to improve a process, to improve the quality..." (Deming 1986: p. i)* 

### **1. Introduction**

Rationally, as scientists, we recognize that documented standard procedures constitute the first requirement for developing consistency within and among datasets; the second step is putting the procedures into practice. If the procedures were implemented as perfectly as they are written, there would be no need to question data. However, we are also cognizant of the fact that humans (a group of organisms to which we cannot deny holding membership) are called upon to use the procedures, and the consistency and rigor with which the procedures are applied are directly affected by an individual's skill, training, attention span, energy, and focus (Edwards, 2004). In fact, we fully expect inconsistency due to human foibles, and often substantial portions of careers are spent in efforts to recognize, isolate, correct, and minimize future occurrences of, error.

Many public and private organizations in the United States (US) and other countries collect aquatic biological data using a variety of sampling and analysis methods (Gurtz & Muir, 1994; ITFM, 1995a; Carter & Resh, 2001), often for meeting regulatory requirements, for example, by the United States' Clean Water Act (CWA) of 1972 (USGPO, 1989). While the information collected by an individual organization is usually directly applicable to a specific question or site-specific issue, the capacity for using it more broadly for comprehensive assessment has been problematic due to unknown data quality produced by

Partitioning Error Sources for Quality Control and

objective communication of uncertainty.

quality characteristics might not even be considered data.

error is QC.

Comparability Analysis in Biological Monitoring and Assessment 61

The programmatic system that contains not only a series of QC tests and analyses, but also provides for organization and management of personnel, acquisition and maintenance of equipment and supplies essential to data collection, information management, information technology resources, safety protocols and facilities, enforcement of corrective actions, and budgetary support, is quality assurance (QA). It is acceptable to use the two terms jointly in reference to an overall quality program, as they often are, as QA/QC, but they should not be used interchangeably. The overall program is QA; the process for identifying and reducing

Overall variability of data (= total uncertainty, or error) from any measurement system results from accumulation of error from multiple sources (Taylor 1988; Taylor & Kuyatt, 1994; Diamond et al., 1996; Taylor, 1997). Error can generally be divided into two types: systematic and random. Systematic error is the type of variability that results from a method and its application or mis-application; it is composed of bias that can, in part, be mediated by using an appropriate quality assurance program of training, audits, and documentation. Random error results from the sample itself or the population from which it is derived, and can only partly be controlled through a careful sampling design. It is often not possible to separate the effects of the two types of error, and they can directly influence each other (Taylor, 1988). The overall magnitude of error associated with a dataset is known as data quality; how statements of data quality are made and communicated are critical for data users and decision makers to properly evaluate the extent to which they should rely on technical, scientific, information (Keith, 1988; Peters, 1988; Costanza et al., 1992). Thus, an effective set of QC procedures helps not only reduce error in datasets, it provides tools for

Biological assessment protocols are measurement systems consisting of a series of methods, each of which contribute to overall variability (Diamond et al., 1996; Cao et al., 2003; Brunialti et al., 2004; Flotemersch et al., 2006; Haase et al., 2006; Nichols et al., 2006; Blocksom & Flotemersch, 2008) (Figure 2). Our capacity as practitioners to control rates and magnitudes of error requires some attention be given to each component of the protocol. While it could be argued that error arising from any single component has only trivial effects on the overall indicator, lack of testing and documentation can substantially weaken that assertion, and opens the results to question. In fact, information without associated data

Fig. 2. Total error or variability (s2) associated with a biological assessment is a combined

result of that for each component of the process" (Flotemersch et al. 2006).

different methods or programs (ITFM, 1995a; Diamond et al., 1996; NWQMC, 2001; The Heinz Center, 2002; GAO, 2004). If the occurrence and magnitude of error in datasets is unknown, a supportable conclusion based solely (or even in part) on those data is problematic at best. These datasets are more difficult to justify for analyses, communicate to broader audiences, base policy decisions on, and defend against potential misuse (Costanza et al., 1992; Edwards, 2004). To ensure the measurement system produces data that can be defended requires understanding the potential error sources that can affect variability of the data and approaches for monitoring the magnitude of error expression.

The purpose of this chapter is to communicate the concept of biological monitoring and assessment as a series of methods, each of which produces data and are as subject to error as any other measurement system. It will describe specific QC techniques and analyses that can be used to monitor variability (i.e., error), identify causes, and develop corrective actions to reduce or otherwise control error rates within acceptable limits. The chapter concludes by demonstrating that comparability analysis for biological data and assessment results is a two-step process, including 1) characterizing data quality, or the magnitude of error rates, associated with each method or dataset, and 2) determining acceptability. It should also be recognized that specific methods are not recommended in the chapter, but rather, emphasis is given that whatever methods are used, data quality and performance should be quantified. Additionally, special emphasis is given to biological monitoring where benthic macroinvertebrate sampling provides the primary data, but conceptually, this approach to QC is also applicable to other organism groups.

### **2. Quality control**

Quality control (QC) is a process by which tests are designed and performed to document the existence and causes of error (=variability) in data, as well as helping determine what can be done to minimize or eliminate them, and developing, communicating, and monitoring corrective actions (CA). Further, it should also be possible to implement the QC process (Figure 1) in a routine manner such that, when those causes are not present, the cost of searching for them does not exceed budgetary constraints (Shewhart, 1939).

Fig. 1. Quality control (QC) process for determining the presence of and managing error rates, and thus, the acceptability of data quality.

different methods or programs (ITFM, 1995a; Diamond et al., 1996; NWQMC, 2001; The Heinz Center, 2002; GAO, 2004). If the occurrence and magnitude of error in datasets is unknown, a supportable conclusion based solely (or even in part) on those data is problematic at best. These datasets are more difficult to justify for analyses, communicate to broader audiences, base policy decisions on, and defend against potential misuse (Costanza et al., 1992; Edwards, 2004). To ensure the measurement system produces data that can be defended requires understanding the potential error sources that can affect variability of the

The purpose of this chapter is to communicate the concept of biological monitoring and assessment as a series of methods, each of which produces data and are as subject to error as any other measurement system. It will describe specific QC techniques and analyses that can be used to monitor variability (i.e., error), identify causes, and develop corrective actions to reduce or otherwise control error rates within acceptable limits. The chapter concludes by demonstrating that comparability analysis for biological data and assessment results is a two-step process, including 1) characterizing data quality, or the magnitude of error rates, associated with each method or dataset, and 2) determining acceptability. It should also be recognized that specific methods are not recommended in the chapter, but rather, emphasis is given that whatever methods are used, data quality and performance should be quantified. Additionally, special emphasis is given to biological monitoring where benthic macroinvertebrate sampling provides the primary data, but conceptually, this approach to

Quality control (QC) is a process by which tests are designed and performed to document the existence and causes of error (=variability) in data, as well as helping determine what can be done to minimize or eliminate them, and developing, communicating, and monitoring corrective actions (CA). Further, it should also be possible to implement the QC process (Figure 1) in a routine manner such that, when those causes are not present, the cost

of searching for them does not exceed budgetary constraints (Shewhart, 1939).

Fig. 1. Quality control (QC) process for determining the presence of and managing error

data and approaches for monitoring the magnitude of error expression.

QC is also applicable to other organism groups.

rates, and thus, the acceptability of data quality.

**2. Quality control** 

The programmatic system that contains not only a series of QC tests and analyses, but also provides for organization and management of personnel, acquisition and maintenance of equipment and supplies essential to data collection, information management, information technology resources, safety protocols and facilities, enforcement of corrective actions, and budgetary support, is quality assurance (QA). It is acceptable to use the two terms jointly in reference to an overall quality program, as they often are, as QA/QC, but they should not be used interchangeably. The overall program is QA; the process for identifying and reducing error is QC.

Overall variability of data (= total uncertainty, or error) from any measurement system results from accumulation of error from multiple sources (Taylor 1988; Taylor & Kuyatt, 1994; Diamond et al., 1996; Taylor, 1997). Error can generally be divided into two types: systematic and random. Systematic error is the type of variability that results from a method and its application or mis-application; it is composed of bias that can, in part, be mediated by using an appropriate quality assurance program of training, audits, and documentation. Random error results from the sample itself or the population from which it is derived, and can only partly be controlled through a careful sampling design. It is often not possible to separate the effects of the two types of error, and they can directly influence each other (Taylor, 1988). The overall magnitude of error associated with a dataset is known as data quality; how statements of data quality are made and communicated are critical for data users and decision makers to properly evaluate the extent to which they should rely on technical, scientific, information (Keith, 1988; Peters, 1988; Costanza et al., 1992). Thus, an effective set of QC procedures helps not only reduce error in datasets, it provides tools for objective communication of uncertainty.

Biological assessment protocols are measurement systems consisting of a series of methods, each of which contribute to overall variability (Diamond et al., 1996; Cao et al., 2003; Brunialti et al., 2004; Flotemersch et al., 2006; Haase et al., 2006; Nichols et al., 2006; Blocksom & Flotemersch, 2008) (Figure 2). Our capacity as practitioners to control rates and magnitudes of error requires some attention be given to each component of the protocol. While it could be argued that error arising from any single component has only trivial effects on the overall indicator, lack of testing and documentation can substantially weaken that assertion, and opens the results to question. In fact, information without associated data quality characteristics might not even be considered data.

Fig. 2. Total error or variability (s2) associated with a biological assessment is a combined result of that for each component of the process" (Flotemersch et al. 2006).

Partitioning Error Sources for Quality Control and

taxon, of everything contained within the sample.

dwelling veliids (Heteroptera)

Non-headed worm fragments

tadpoles, snakes, or other)

Exuviae (molted "skins")

Empty mollusk shells (Mollusca: Bivalvia and Gastropoda)

placed in the vial and not added to the rough count total.

Damaged insects and crustaceans that lack at least a head and thorax

Non-macroinvertebrates, such as copepods, cladocera, and ostracods

Larvae or pupae where internal tissue has broken down to point of floppiness

Incidental collections, such as terrestrial insects or aquatic vertebrates (fish, frogs or

If a sorter is uncertain about whether an organism is countable, the specimen should be

The sorting/subsampling process is based on randomly selecting portions of the sample detritus spread over a gridded Caton screen (Caton, 1991; Barbour et al., 1999; see also Figures 6-4a, b of Flotemersch et al., 2006 [note that an individual grid square is 6 cm x 6 cm, or 36 cm2, *not* 6 cm2 as indicated in Figure 6-4b]). Prior to beginning the sorting/subsampling process, it is important that the sample be mixed thoroughly and distributed evenly across the sorting tray to reduce the effect of organism clumping that may have occurred in the sample container. The grids are randomly selected, individually removed from the screen, placed in a sorting tray, and all organisms removed with forceps;

**4.2 Laboratory processing** 

Comparability Analysis in Biological Monitoring and Assessment 63

Processing of benthic macroinvertebrate samples is a 3-step process. Sorting and subsampling serves to 1) isolate individual organisms from nontarget material, such as leaf litter and other detritus, bits of woody material, silt, and sand, and 2) prepare the sample (or subsample) for taxonomic identification. Taxonomic identification serves to match nomenclature to specimens in the sample, and enumeration provides the actual counts, by

Although it is widely recognized that subsampling helps to manage the level of effort associated with bioassessment laboratory work (Carter & Resh, 2001), the practice has been the subject of much debate (Courtemanch, 1996; Barbour & Gerritsen, 1996; Vinson & Hawkins, 1996). Fixed organism counts vary among monitoring programs (Carter & Resh, 2001), with 100, 200, 300 and 500 counts being most often used (Barbour et al., 1999; Cao & Hawkins, 2005; Flotemersch et al., 2006). Flotemersch & Blocksom (2005) concluded that a 500-organism count was most appropriate for large/nonwadeable river systems, based on examination of the relative increase in richness metric values (< 2%) between successive 100 organism counts. However, they also suggested that 300-organism count is sufficient for most study needs. Others have recommended higher fixed counts, including a minimum of 600 in wadeable streams (Cao & Hawkins, 2005). The subsample count used for the USEPA national surveys is 500 organisms (USEPA, 2004b); many states use 200 or 300 counts. If organisms are missed during the sorting process, bias is introduced in the resulting data. Thus, the primary goal of sorting is to completely separate organisms from organic and inorganic material (e.g., detritus, sediment) in the sample. A secondary goal of sorting is to provide the taxonomist with a sample for which the majority of specimens are identifiable. Note that the procedure described here assumes that the sorter and the taxonomist are different personnel. Although it is not the decision of the sorter whether an organism is identifiable, straightforward rules can be applied that minimize specimen loss. For example, "counting rules" can be part of the standard operating procedures (SOP) for both the sorting/subsampling and taxonomic identification, such as specifying what not to count: Non-benthic organisms, such as free-swimming gyrinid adults (Coleoptera) or surface-

## **3. Indicators**

All aquatic ecosystems are susceptible to cumulative impacts from human-induced disturbances including inorganic and organic chemical pollution, hydrologic alteration, channelization, overharvest, invasive species, and land cover conversion. Because they live in the presence of existing water chemistry and physical habitat conditions, the aquatic life of these systems (fish, insects, plants, shellfish, amphibians, reptiles, etc.) integrates cumulative effects of multiple stressors that are produced by both point and non-point source (NPS) pollution. The most common organism groups that are used by routine biological monitoring and assessment programs are benthic macroinvertebrates (aquatic insects, snails, mollusks, crustaceans, worms, and mites), fish, and/or algae, with indicators most often taking the form of a multimetric Index of Biological Integrity (IBI; Karr et al., 1986; Hughes et al., 1998; Barbour et al., 1999; Hill et al., 2000, 2003) or a predictive observed/expected (O/E) model based on the River Invertebrate Prediction and Classification System (RIVPACS; Clarke et al., 1996, 2003; Hawkins et al., 2000; Hawkins, 2006). Of these latter three groups, benthic macroinvertebrates (BM) are commonly used because the protocols are most well-established, the level of effort required for field sampling is reasonable (Barbour et al., 1999), and taxonomic expertise is relatively easily accessible. Thus, examples of QC tests and corrective actions discussed in this chapter are largely focused on benthic macroinvertebrates in the context of multimetric indexes, though, similar procedures for routine monitoring with algae and fish could be developed. Stribling et al. (2008) also used some of these procedures for documenting performance of O/E models.

## **4. Potential error sources in indicators**

#### **4.1 Field sampling**

Whether the target assemblage is benthic macroinvertebrates, fish, or algae, the first step of biological assessment is to use standard field methods to gather a sample representing the taxonomic diversity and functional composition of a reach, zone, or other stratum of a waterbody. The actual dimensions of the sampling area ultimately depend on technical objectives and programmatic goals of the monitoring activity (Flotemersch et al., 2010). The spatial area from which the biological sample is drawn is that segment or portion of the waterbody the sample is intended to represent; for analyses and higher level interpretation, biological indicators are considered equivalent to the site. For its national surveys of lotic waters (streams and rivers), the U. S. Environmental Protection Agency defines a sample reach as 40x the mean wetted width (USEPA, 2004a); many individual states use a fixed 100m as the sampling reach.

Benthic macroinvertebrate samples are collected along 11 transects evenly distributed throughout the reach length, and a D-frame net with 500-µm mesh openings used to sample multiple habitats (Klemm et al., 1998; USEPA, 2004a; Flotemersch et al., 2006). An alternative approach to transects is to estimate the proportion of different habitat types in a defined reach (e.g., 100m), and distribute a fixed level of sampling effort in proportion to their frequency of occurrence throughout the reach (Barbour et al., 1999, 2006). For both approaches, organic and inorganic sample material (leaf litter, small woody twigs, silt, and sand) are composited in one or more containers, preserved with 95% denatured ethanol, and delivered to laboratories for processing. A composite sample over multiple habitats in a reach is a common protocol feature of many monitoring program throughout the US (Carter & Resh, 2001).

## **4.2 Laboratory processing**

62 Modern Approaches To Quality Control

All aquatic ecosystems are susceptible to cumulative impacts from human-induced disturbances including inorganic and organic chemical pollution, hydrologic alteration, channelization, overharvest, invasive species, and land cover conversion. Because they live in the presence of existing water chemistry and physical habitat conditions, the aquatic life of these systems (fish, insects, plants, shellfish, amphibians, reptiles, etc.) integrates cumulative effects of multiple stressors that are produced by both point and non-point source (NPS) pollution. The most common organism groups that are used by routine biological monitoring and assessment programs are benthic macroinvertebrates (aquatic insects, snails, mollusks, crustaceans, worms, and mites), fish, and/or algae, with indicators most often taking the form of a multimetric Index of Biological Integrity (IBI; Karr et al., 1986; Hughes et al., 1998; Barbour et al., 1999; Hill et al., 2000, 2003) or a predictive observed/expected (O/E) model based on the River Invertebrate Prediction and Classification System (RIVPACS; Clarke et al., 1996, 2003; Hawkins et al., 2000; Hawkins, 2006). Of these latter three groups, benthic macroinvertebrates (BM) are commonly used because the protocols are most well-established, the level of effort required for field sampling is reasonable (Barbour et al., 1999), and taxonomic expertise is relatively easily accessible. Thus, examples of QC tests and corrective actions discussed in this chapter are largely focused on benthic macroinvertebrates in the context of multimetric indexes, though, similar procedures for routine monitoring with algae and fish could be developed. Stribling et al. (2008) also used some of these procedures for documenting

Whether the target assemblage is benthic macroinvertebrates, fish, or algae, the first step of biological assessment is to use standard field methods to gather a sample representing the taxonomic diversity and functional composition of a reach, zone, or other stratum of a waterbody. The actual dimensions of the sampling area ultimately depend on technical objectives and programmatic goals of the monitoring activity (Flotemersch et al., 2010). The spatial area from which the biological sample is drawn is that segment or portion of the waterbody the sample is intended to represent; for analyses and higher level interpretation, biological indicators are considered equivalent to the site. For its national surveys of lotic waters (streams and rivers), the U. S. Environmental Protection Agency defines a sample reach as 40x the mean wetted width (USEPA, 2004a); many individual states use a fixed

Benthic macroinvertebrate samples are collected along 11 transects evenly distributed throughout the reach length, and a D-frame net with 500-µm mesh openings used to sample multiple habitats (Klemm et al., 1998; USEPA, 2004a; Flotemersch et al., 2006). An alternative approach to transects is to estimate the proportion of different habitat types in a defined reach (e.g., 100m), and distribute a fixed level of sampling effort in proportion to their frequency of occurrence throughout the reach (Barbour et al., 1999, 2006). For both approaches, organic and inorganic sample material (leaf litter, small woody twigs, silt, and sand) are composited in one or more containers, preserved with 95% denatured ethanol, and delivered to laboratories for processing. A composite sample over multiple habitats in a reach is a common protocol feature of many monitoring program throughout the US (Carter

**3. Indicators** 

performance of O/E models.

100m as the sampling reach.

& Resh, 2001).

**4.1 Field sampling** 

**4. Potential error sources in indicators** 

Processing of benthic macroinvertebrate samples is a 3-step process. Sorting and subsampling serves to 1) isolate individual organisms from nontarget material, such as leaf litter and other detritus, bits of woody material, silt, and sand, and 2) prepare the sample (or subsample) for taxonomic identification. Taxonomic identification serves to match nomenclature to specimens in the sample, and enumeration provides the actual counts, by taxon, of everything contained within the sample.

Although it is widely recognized that subsampling helps to manage the level of effort associated with bioassessment laboratory work (Carter & Resh, 2001), the practice has been the subject of much debate (Courtemanch, 1996; Barbour & Gerritsen, 1996; Vinson & Hawkins, 1996). Fixed organism counts vary among monitoring programs (Carter & Resh, 2001), with 100, 200, 300 and 500 counts being most often used (Barbour et al., 1999; Cao & Hawkins, 2005; Flotemersch et al., 2006). Flotemersch & Blocksom (2005) concluded that a 500-organism count was most appropriate for large/nonwadeable river systems, based on examination of the relative increase in richness metric values (< 2%) between successive 100 organism counts. However, they also suggested that 300-organism count is sufficient for most study needs. Others have recommended higher fixed counts, including a minimum of 600 in wadeable streams (Cao & Hawkins, 2005). The subsample count used for the USEPA national surveys is 500 organisms (USEPA, 2004b); many states use 200 or 300 counts.

If organisms are missed during the sorting process, bias is introduced in the resulting data. Thus, the primary goal of sorting is to completely separate organisms from organic and inorganic material (e.g., detritus, sediment) in the sample. A secondary goal of sorting is to provide the taxonomist with a sample for which the majority of specimens are identifiable. Note that the procedure described here assumes that the sorter and the taxonomist are different personnel. Although it is not the decision of the sorter whether an organism is identifiable, straightforward rules can be applied that minimize specimen loss. For example, "counting rules" can be part of the standard operating procedures (SOP) for both the sorting/subsampling and taxonomic identification, such as specifying what not to count:


If a sorter is uncertain about whether an organism is countable, the specimen should be placed in the vial and not added to the rough count total.

The sorting/subsampling process is based on randomly selecting portions of the sample detritus spread over a gridded Caton screen (Caton, 1991; Barbour et al., 1999; see also Figures 6-4a, b of Flotemersch et al., 2006 [note that an individual grid square is 6 cm x 6 cm, or 36 cm2, *not* 6 cm2 as indicated in Figure 6-4b]). Prior to beginning the sorting/subsampling process, it is important that the sample be mixed thoroughly and distributed evenly across the sorting tray to reduce the effect of organism clumping that may have occurred in the sample container. The grids are randomly selected, individually removed from the screen, placed in a sorting tray, and all organisms removed with forceps;

Partitioning Error Sources for Quality Control and

**4.4 Data reduction/indicator calculation** 

in a multimetric index comprised of seven metrics (Table 2).

backup.

Comparability Analysis in Biological Monitoring and Assessment 65

*Hydropsyche* or *Hydrophilus*, and the data entry technician on autopilot might continue as normal. There are also, increasingly, uses of e-tablets for entering field observation data, or direct entry of laboratory data into spreadsheets, obviating the need for hardcopy paper

There is a large number of potential metrics that monitoring programs can use (Barbour et al., 1999; Blocksom & Flotemersch, 2005; Flotemersch et al., 2006), requiring testing, calibration, and final selection before being appropriate for routine application. Blocksom & Flotemersch (2005) tested 42 metrics relative to different sampling methods, mesh sizes, and habitat types, some of which are based on taxonomic information, as well as stressor tolerance, functional feeding group, and habit. Other workers and programs have tested more and different ones. For example, the US state of Montana calibrated a biological indicator for wadeable streams of the "mountains" site class (Montana DEQ 2006), resulting

**Taxon Target**

Dolichopodidae (Dolichopodidae) Phoridae (Phoridae) Scathophagidae (Scathophagidae) Syrphidae (Syrphidae) Decapoda Family Hirudinea Family Hydrobiidae (Hydrobiidae) Nematoda (Nematoda) Nematomorpha (Nematomorpha) Nemertea (Nemertea) Turbellaria (Turbellaria) **Chironomidae, the following genera are combined under**  *Cricotopus/Orthocladius Cricotopus Cricotopus/Orthocladius Orthocladius Cricotopus/Orthocladius Cricotopus/Orthocladius Cricotopus/Orthocladius Orthocladius/Cricotopus Cricotopus/Orthocladius* **Chironomidae, the following genera are combined under**  *Thienemannimyia genus group Conchapelopia Thienemannimyia* genus group *Rheopelopia Thienemannimyia* genus group *Helopelopia Thienemannimyia* genus group *Telopelopia Thienemannimyia* genus group *Meropelopia Thienemannimyia* genus group *Hayesomia Thienemannimyia* genus group *Thienemannimyia Thienemannimyia* genus group **Hydropsychidae, the following genera are combined under**  *Hydropsyche Hydropsyche Hydropsyche Ceratopsyche Hydropsyche Hydropsyche/Ceratopsyche Hydropsyche Ceratopsyche/Hydropsyche Hydropsyche*

Table 1. In this example list of hierarchical target levels, all taxa are targeted for identification to genus level, unless otherwise noted. Taxa with target levels in parentheses are left at that level.

Ceratopogonidae Ceratopogoninae, leave at subfamily;

all others, genus level

the process is completed until the rough count by the sorter exceeds the target subsample size. There should be at least three containers produced per sample, all of which should be clearly labeled: 1) subsample to be given to taxonomist, 2) sort residue to be checked for missed specimens, and 3) unsorted sample remains to be used for additional sorting, if necessary.

The next step of the laboratory process is identifying the organisms within the subsample. A major question associated with taxonomy for biological assessments is the hierarchical target levels required of the taxonomist, including order, family, genus, species or the lowest practical taxonomic level (LPTL). While family level is used effectively in some monitoring programs (Carter & Resh 2001), the taxonomic level primarily used in most routine monitoring programs is genus. However, even with genus as the target, many programs often treat selected groups differently, such as midges (Chironomidae) and worms (Oligochaeta), due to the need for slide-mounting. Slide-mounting specimens in these two groups is usually (though, not always) necessary to attain genus level nomenclature, and sometimes even tribal level for midges. Because taxonomy is a major potential source of error in any kind of biological monitoring data sets (Stribling et al., 2003, 2008a; Milberg et al., 2008; Bortolus, 2008), it is critical to define taxonomic expectations and to treat all samples consistently, both by a single taxonomist and among multiple taxonomists. This, in part, requires specifying both hierarchical targets and counting rules. An example list of taxonomic target levels is shown in Table 1. These target levels define the level of effort that should be applied to each specimen. If it is not possible to attain these levels for certain specimens due to, for example, the presence of early instars, damage, or poor slide mounts, the taxonomist provides a more coarse-level identification. When a taxonomist receives samples for identification, depending upon the rigor of the sorting process (see above), the samples may contain specimens that either cannot be identified, or non-target taxa that should not be included in the sample. The final screen of sample integrity is the responsibility of the taxonomist, who determines which specimens should remain unrecorded (for any of the reasons stated above). Beyond this, the principal responsibility of the taxonomist is to record and report the taxa in the sample and the number of individuals of each taxon. Programs should use the most current and accepted keys and nomenclature. *An Introduction to the Aquatic Insects of North America* (Merritt et al., 2008) is useful for identifying the majority of aquatic insects in North America to genus level. By their very nature, most taxonomic keys are obsolete soon after publication; however, research taxonomists do not discontinue research once keys are available. Thus, it is often necessary to have access to and be familiar with ongoing research in different taxonomic groups. Other keys are also necessary for non-insect benthic macroinvertebrates that will be encountered, such as Oligochaeta, Mollusca, Acari, Crustacea, Platyhelminthes, and others. Klemm et al. (1990) and Merritt et al. (2008) provide an exhaustive list of taxonomic literature for all major groups of freshwater benthic macroinvertebrates. Although it is not current for all taxa, the integrated taxonomic information system (ITIS; http://www.itis.usda.gov/) has served as a clearinghouse for accepted nomenclature, including validity, authorship and spelling.

#### **4.3 Data entry**

Taxonomic nomenclature and counts are usually entered into the data management system directly from handwritten bench or field sheets. Depending on the system used, there may be an autocomplete function that helps prevent misspellings, but which can also contribute to errors. For example, entering the letters 'hydro' could potentially autocomplete as either *Hydropsyche* or *Hydrophilus*, and the data entry technician on autopilot might continue as normal. There are also, increasingly, uses of e-tablets for entering field observation data, or direct entry of laboratory data into spreadsheets, obviating the need for hardcopy paper backup.

## **4.4 Data reduction/indicator calculation**

64 Modern Approaches To Quality Control

the process is completed until the rough count by the sorter exceeds the target subsample size. There should be at least three containers produced per sample, all of which should be clearly labeled: 1) subsample to be given to taxonomist, 2) sort residue to be checked for missed specimens, and 3) unsorted sample remains to be used for additional sorting, if

The next step of the laboratory process is identifying the organisms within the subsample. A major question associated with taxonomy for biological assessments is the hierarchical target levels required of the taxonomist, including order, family, genus, species or the lowest practical taxonomic level (LPTL). While family level is used effectively in some monitoring programs (Carter & Resh 2001), the taxonomic level primarily used in most routine monitoring programs is genus. However, even with genus as the target, many programs often treat selected groups differently, such as midges (Chironomidae) and worms (Oligochaeta), due to the need for slide-mounting. Slide-mounting specimens in these two groups is usually (though, not always) necessary to attain genus level nomenclature, and sometimes even tribal level for midges. Because taxonomy is a major potential source of error in any kind of biological monitoring data sets (Stribling et al., 2003, 2008a; Milberg et al., 2008; Bortolus, 2008), it is critical to define taxonomic expectations and to treat all samples consistently, both by a single taxonomist and among multiple taxonomists. This, in part, requires specifying both hierarchical targets and counting rules. An example list of taxonomic target levels is shown in Table 1. These target levels define the level of effort that should be applied to each specimen. If it is not possible to attain these levels for certain specimens due to, for example, the presence of early instars, damage, or poor slide mounts, the taxonomist provides a more coarse-level identification. When a taxonomist receives samples for identification, depending upon the rigor of the sorting process (see above), the samples may contain specimens that either cannot be identified, or non-target taxa that should not be included in the sample. The final screen of sample integrity is the responsibility of the taxonomist, who determines which specimens should remain unrecorded (for any of the reasons stated above). Beyond this, the principal responsibility of the taxonomist is to record and report the taxa in the sample and the number of individuals of each taxon. Programs should use the most current and accepted keys and nomenclature. *An Introduction to the Aquatic Insects of North America* (Merritt et al., 2008) is useful for identifying the majority of aquatic insects in North America to genus level. By their very nature, most taxonomic keys are obsolete soon after publication; however, research taxonomists do not discontinue research once keys are available. Thus, it is often necessary to have access to and be familiar with ongoing research in different taxonomic groups. Other keys are also necessary for non-insect benthic macroinvertebrates that will be encountered, such as Oligochaeta, Mollusca, Acari, Crustacea, Platyhelminthes, and others. Klemm et al. (1990) and Merritt et al. (2008) provide an exhaustive list of taxonomic literature for all major groups of freshwater benthic macroinvertebrates. Although it is not current for all taxa, the integrated taxonomic information system (ITIS; http://www.itis.usda.gov/) has served as a clearinghouse for accepted nomenclature,

Taxonomic nomenclature and counts are usually entered into the data management system directly from handwritten bench or field sheets. Depending on the system used, there may be an autocomplete function that helps prevent misspellings, but which can also contribute to errors. For example, entering the letters 'hydro' could potentially autocomplete as either

necessary.

including validity, authorship and spelling.

**4.3 Data entry** 

There is a large number of potential metrics that monitoring programs can use (Barbour et al., 1999; Blocksom & Flotemersch, 2005; Flotemersch et al., 2006), requiring testing, calibration, and final selection before being appropriate for routine application. Blocksom & Flotemersch (2005) tested 42 metrics relative to different sampling methods, mesh sizes, and habitat types, some of which are based on taxonomic information, as well as stressor tolerance, functional feeding group, and habit. Other workers and programs have tested more and different ones. For example, the US state of Montana calibrated a biological indicator for wadeable streams of the "mountains" site class (Montana DEQ 2006), resulting in a multimetric index comprised of seven metrics (Table 2).


Table 1. In this example list of hierarchical target levels, all taxa are targeted for identification to genus level, unless otherwise noted. Taxa with target levels in parentheses are left at that level.

Partitioning Error Sources for Quality Control and

Shewhart (1939) concept of process control.

applicable (na).

**5.1 Field sampling** 

Component method or activity

Comparability Analysis in Biological Monitoring and Assessment 67

below) which most observed values fall (Diamond et al., 2006; Flotemersch et al., 2006; Stribling et al., 2003, 2008a, b; Herbst & Silldorf, 2006), and are roughly analogous to the

1. Field sampling na **∆ ∆**

2. Laboratory sorting/subsampling na **∆**

3. Taxonomy na na

4. Enumeration **∆** na

5. Data entry na na na

6. Data reduction (e. g., metric calculation) na **∆** na na

7. Site assessment and interpretation **∆ ∆**

Table 3. Error partitioning framework for biological assessments and biological assessment

Specific MQO should be selected based on the distribution of values attained, particularly the minima and maxima. Importantly, for environmental monitoring programs, special studies should never be the basis upon which a particular MQO is selected; rather, they should reflect performance expectations when *routine* techniques and monitoring personnel are used. Consider MQO that are established using data from the best field team, or the taxonomist with the most years of experience, or the dissolved oxygen measurements taken using the most expensive field probes. When those people or equipment are no longer available to the program, how useful would the database be to future or secondary users? Defensibility would potentially be diminished. Values that are >MQO are not automatically taken to be unacceptable data points; rather, such values are targeted for closer scrutiny to determine possible reasons for exceedence and might indicate a need for corrective actions (Stribling et al. 2003, Montana DEQ 2006). Simultaneously, they can be used to help quantify

Quantitative performance characteristics for field sampling are *precision* and *completeness*  (Table 3). Repeat samples for purposes of calculating precision of field sampling are

protocols for benthic macroinvertebrates. There may be additional activities and performance characteristics, and they may be quantitative (), qualitative (**∆**) or not

performance of the field teams in consistently applying the methods.

**Performance characteristics** 

**Precision Accuracy Bias Representativeness Completeness** 

This discussion assumes that the indicator terms have already been calibrated and selected, and deals specifically with their calculation. For this purpose, the raw data are taxa lists and counts; their conversion into metrics is data reduction usually performed with computer spreadsheets or in relational databases.

To ensure that database queries are correct and result in the intended metric values, a subset of values should be recalculated by hand. One metric is calculated for all samples, all metrics are calculated for one sample. When recalculated values differ from those values in the matrix, the reasons for the disagreement are determined and corrections are made. Reports on performance include the total number of reduced values as a percentage of the total, how many errors were found in the queries, and the corrective actions specifically documented.

## **4.5 Indicator reporting**

Regardless of whether the indicator is based on a multimetric framework or multivariate predictive model, the ultimate goal is to translate the quantitative, numeric result, the score, into some kind of narrative that provides the capacity for broad communication. The final assessment for a site is usually determined based on a site score relative to the distribution of reference site scores to reflect degrees of biological degradation, the more similar a test site is to reference less degradation is being exhibited. Depending on the calibration process and how many condition categories are structured, narratives for individual sites can come from two categories (degraded, nondegraded), three (good, fair, poor), four (good, fair, poor, very poor), or five (very good, good, fair, poor, or very poor). There also may be other frameworks a program chooses to use, but the key is to have the individual categories quantitatively-defined.


Table 2. Sample-based metrics calculated for benthic macroinvertebrates. Shown are those developed and calibrated for streams in the "mountains" site class of the state of Montana, USA (Montana DEQ 2006, Stribling et al. 2008b).

## **5. Measurement quality objectives (MQO)**

For each step of the biological assessment process there are different performance characteristics that can be documented, some of which are quantitative and others that are qualitative (Table 3). Measurement quality objectives (MQO) are control points above (or

This discussion assumes that the indicator terms have already been calibrated and selected, and deals specifically with their calculation. For this purpose, the raw data are taxa lists and counts; their conversion into metrics is data reduction usually performed with computer

To ensure that database queries are correct and result in the intended metric values, a subset of values should be recalculated by hand. One metric is calculated for all samples, all metrics are calculated for one sample. When recalculated values differ from those values in the matrix, the reasons for the disagreement are determined and corrections are made. Reports on performance include the total number of reduced values as a percentage of the total, how many errors were found in the queries, and the corrective actions specifically

Regardless of whether the indicator is based on a multimetric framework or multivariate predictive model, the ultimate goal is to translate the quantitative, numeric result, the score, into some kind of narrative that provides the capacity for broad communication. The final assessment for a site is usually determined based on a site score relative to the distribution of reference site scores to reflect degrees of biological degradation, the more similar a test site is to reference less degradation is being exhibited. Depending on the calibration process and how many condition categories are structured, narratives for individual sites can come from two categories (degraded, nondegraded), three (good, fair, poor), four (good, fair, poor, very poor), or five (very good, good, fair, poor, or very poor). There also may be other frameworks a program chooses to use, but the key is to have the individual categories

**Metric Description** 

respectively) % individuals as non-insects Percent of individuals in sample as non-insects % individuals as predators Percent of individuals in sample as predators % of taxa as burrowers Percent of taxa in sample as burrower habit

taxa in the sample

USA (Montana DEQ 2006, Stribling et al. 2008b).

**5. Measurement quality objectives (MQO)** 

taxa Count of the number of distinct taxa of mayflies in sample Number of Plecoptera taxa Count of the number of distinct taxa of stoneflies in sample

Hilsenhoff Biotic Index Abundance-weighted mean of stressor tolerance values for

Table 2. Sample-based metrics calculated for benthic macroinvertebrates. Shown are those developed and calibrated for streams in the "mountains" site class of the state of Montana,

For each step of the biological assessment process there are different performance characteristics that can be documented, some of which are quantitative and others that are qualitative (Table 3). Measurement quality objectives (MQO) are control points above (or

Percent of individuals in sample that is mayflies, stoneflies, or caddisflies (Ephemeroptera, Plecoptera, or Trichoptera,

spreadsheets or in relational databases.

documented.

**4.5 Indicator reporting** 

quantitatively-defined.

% individuals as EPT

Number of Ephemeroptera

below) which most observed values fall (Diamond et al., 2006; Flotemersch et al., 2006; Stribling et al., 2003, 2008a, b; Herbst & Silldorf, 2006), and are roughly analogous to the Shewhart (1939) concept of process control.


Table 3. Error partitioning framework for biological assessments and biological assessment protocols for benthic macroinvertebrates. There may be additional activities and performance characteristics, and they may be quantitative (), qualitative (**∆**) or not applicable (na).

Specific MQO should be selected based on the distribution of values attained, particularly the minima and maxima. Importantly, for environmental monitoring programs, special studies should never be the basis upon which a particular MQO is selected; rather, they should reflect performance expectations when *routine* techniques and monitoring personnel are used. Consider MQO that are established using data from the best field team, or the taxonomist with the most years of experience, or the dissolved oxygen measurements taken using the most expensive field probes. When those people or equipment are no longer available to the program, how useful would the database be to future or secondary users? Defensibility would potentially be diminished. Values that are >MQO are not automatically taken to be unacceptable data points; rather, such values are targeted for closer scrutiny to determine possible reasons for exceedence and might indicate a need for corrective actions (Stribling et al. 2003, Montana DEQ 2006). Simultaneously, they can be used to help quantify performance of the field teams in consistently applying the methods.

## **5.1 Field sampling**

Quantitative performance characteristics for field sampling are *precision* and *completeness*  (Table 3). Repeat samples for purposes of calculating precision of field sampling are

Partitioning Error Sources for Quality Control and

a unit-less measure, by the formula:

Comparability Analysis in Biological Monitoring and Assessment 69

Also called standard error of estimate, **root mean square error (RMSE)** is an estimate of the

1 1

*j i*

where yij is the ith individual observation in group j, j = 1…k (Zar 1999). Lower values indicate better consistency; and are used in calculation of the **coefficient of variability (CV)**,

<sup>100</sup> *RMSE CV*

where *Y* is the mean of the dependent variable (e.g., metric, index across all sample pairs;

**Confidence intervals (CI)** (or detectable differences) are used to indicate the magnitude of separation of 2 values before the values can be considered different with statistical significance. A 90% significance level for the CI (i.e., the range around the observed value within which the true mean is likely to fall 90% of the time, or a 10% probability of type I error [α]). The 90% confidence interval (CI90) is calculated using RMSE by the formula:

*CI RMSE z* 90 ([ ][ ])

where *zα* is the *z*-value for 90% confidence (i.e., p = 0.10) with degrees of freedom set at infinity. In this analysis, *zα* = 1.64 (appendix 17 in Zar 1999). For CI95, the *z*-value would be 1.96. As the number of sample repeats increases, CI becomes narrower; we provide CI that

**Relative percent difference (RPD)** is the proportional difference between 2 measures, and is

 100 ( )/2 *A B RPD x A B* 

where A is the metric or index value of the 1st sample and B is the metric or index value of the 2nd sample (Keith, 1991; APHA, 2005; Smith, 2000). Lower RPD values indicate

**Percent completeness (%C)** is a measure of the number of valid samples that were obtained

% 100 *C x v*

where *v* is the number of valid samples, and *T* is the total number of planned samples

**Percent sorting efficiency (PSE)** describes how well a sample sorter has done in finding and

removing all specimens from isolated sample material, and is calculated as:

*k nj*

2

(1)

*<sup>Y</sup>* (2)

*<sup>T</sup>* (5)

(3)

(4)

1...

*df*

*k*

( )

*y y*

*ij j*

standard deviation of a population of observations and is calculated by:

*RMSE*

Zar 1999). It is also known as relative standard deviation (RSD).

would be associated with 1, 2, and 3 samples per site.

improved precision (as repeatability) over higher values.

as a proportion of what was planned, and is calculated as:

calculated as:

(Flotemersch et al., 2006).

obtained by sampling two adjacent reaches, shown as 500 m in this example (Figure 3), and can be done by the same field team for intra-team precision, or by different teams for interteam precision. For benthic macroinvertebrates, samples from the adjacent reaches (also called duplicate or quality control [QC] samples) must be laboratory-processed prior to data being available for precision calculations. Assuming acceptable laboratory error, these precision values are statements of the consistency with which the sampling protocols 1) characterized the biology of the stream or river and 2) were applied by the field team, and thus, reflect a combination of natural variability and systematic error inherent in the dataset.

Fig. 3. Adjacent reaches (primary and repeat) for calculating precision estimates (Flotemersch et al. 2006).

The number of reaches for which repeat samples are taken varies, but a rule-of-thumb is 10%, randomly-selected from the total number of sampling reaches constituting a sampling effort (whether yearly, programmatic routine, or individual project). Because they are the ultimate indicators to be used in address the question of ecological conditions, the metric and index values are used to calculate different precision estimates. Root-mean square error (RMSE) (formula 1), coefficient of variability (CV) (formula 2), and confidence intervals (formula 3) (Table 4) are calculated on multiple sample pairs, and are meaningful in that context. Documented values for field sampling precision (Table 5) demonstrate differences among individual metrics and the overall multimetric index (Montana MMI; mountain site class). Relative percent difference (RPD) (formula 4) (Table 4) can have meaning for individual sample pairs. For example, for the composite index, median relative percent difference (RPD) was 8.0 based on 40 sample pairs (Stribling et al., 2008b). MQO recommendations for that routine field sampling for that biological monitoring program were a CV of 10% and a median RPD of 15.0. Sets of sample pairs having with CV>10% would be subjected to additional scrutiny to determine what might be the cause of increased variability. Similarly, individual RPD values for sample pairs would be more specifically examined.

*Percent completeness* (formula 5) (Table 3, 4) is calculated to communicate the number of valid samples collected as a proportion of those that were originally planned. This value serves as one summary of data quality over the dataset and it demonstrates an aspect of confidence in the overall dataset.

68 Modern Approaches To Quality Control

obtained by sampling two adjacent reaches, shown as 500 m in this example (Figure 3), and can be done by the same field team for intra-team precision, or by different teams for interteam precision. For benthic macroinvertebrates, samples from the adjacent reaches (also called duplicate or quality control [QC] samples) must be laboratory-processed prior to data being available for precision calculations. Assuming acceptable laboratory error, these precision values are statements of the consistency with which the sampling protocols 1) characterized the biology of the stream or river and 2) were applied by the field team, and thus, reflect a combination of natural variability and systematic error inherent in the dataset.

Fig. 3. Adjacent reaches (primary and repeat) for calculating precision estimates

The number of reaches for which repeat samples are taken varies, but a rule-of-thumb is 10%, randomly-selected from the total number of sampling reaches constituting a sampling effort (whether yearly, programmatic routine, or individual project). Because they are the ultimate indicators to be used in address the question of ecological conditions, the metric and index values are used to calculate different precision estimates. Root-mean square error (RMSE) (formula 1), coefficient of variability (CV) (formula 2), and confidence intervals (formula 3) (Table 4) are calculated on multiple sample pairs, and are meaningful in that context. Documented values for field sampling precision (Table 5) demonstrate differences among individual metrics and the overall multimetric index (Montana MMI; mountain site class). Relative percent difference (RPD) (formula 4) (Table 4) can have meaning for individual sample pairs. For example, for the composite index, median relative percent difference (RPD) was 8.0 based on 40 sample pairs (Stribling et al., 2008b). MQO recommendations for that routine field sampling for that biological monitoring program were a CV of 10% and a median RPD of 15.0. Sets of sample pairs having with CV>10% would be subjected to additional scrutiny to determine what might be the cause of increased variability. Similarly, individual RPD values for sample pairs would be more specifically

*Percent completeness* (formula 5) (Table 3, 4) is calculated to communicate the number of valid samples collected as a proportion of those that were originally planned. This value serves as one summary of data quality over the dataset and it demonstrates an aspect of

(Flotemersch et al. 2006).

examined.

confidence in the overall dataset.

Also called standard error of estimate, **root mean square error (RMSE)** is an estimate of the standard deviation of a population of observations and is calculated by:

$$RMSE = \sqrt{\frac{\sum\_{j=1}^{k} (y\_{ij} - \overline{y}\_j)^2}{\sum df\_{1\dots k}}} \tag{1}$$

where yij is the ith individual observation in group j, j = 1…k (Zar 1999). Lower values indicate better consistency; and are used in calculation of the **coefficient of variability (CV)**, a unit-less measure, by the formula:

$$\text{CV} = \frac{\text{RMSE}}{\overline{Y}} \times 100\tag{2}$$

where *Y* is the mean of the dependent variable (e.g., metric, index across all sample pairs; Zar 1999). It is also known as relative standard deviation (RSD).

**Confidence intervals (CI)** (or detectable differences) are used to indicate the magnitude of separation of 2 values before the values can be considered different with statistical significance. A 90% significance level for the CI (i.e., the range around the observed value within which the true mean is likely to fall 90% of the time, or a 10% probability of type I error [α]). The 90% confidence interval (CI90) is calculated using RMSE by the formula:

$$CI90 = \pm ([RMSE][za])\tag{3}$$

where *zα* is the *z*-value for 90% confidence (i.e., p = 0.10) with degrees of freedom set at infinity. In this analysis, *zα* = 1.64 (appendix 17 in Zar 1999). For CI95, the *z*-value would be 1.96. As the number of sample repeats increases, CI becomes narrower; we provide CI that would be associated with 1, 2, and 3 samples per site.

**Relative percent difference (RPD)** is the proportional difference between 2 measures, and is calculated as:

$$RPD = \left(\frac{|A - B|}{(A + B)/2}\right) \ge 100\tag{4}$$

where A is the metric or index value of the 1st sample and B is the metric or index value of the 2nd sample (Keith, 1991; APHA, 2005; Smith, 2000). Lower RPD values indicate improved precision (as repeatability) over higher values.

**Percent completeness (%C)** is a measure of the number of valid samples that were obtained as a proportion of what was planned, and is calculated as:

$$\% \text{C} = \bigvee\_{\text{T}} \text{x100} \tag{5}$$

where *v* is the number of valid samples, and *T* is the total number of planned samples (Flotemersch et al., 2006).

**Percent sorting efficiency (PSE)** describes how well a sample sorter has done in finding and removing all specimens from isolated sample material, and is calculated as:

Partitioning Error Sources for Quality Control and

(Figure 4).

Number of Ephemeroptera

Multimetric index (7-metric

Comparability Analysis in Biological Monitoring and Assessment 71

through two components of the field method. First, the approaches are not limited to one or a few habitat types; they are focused on sampling stable undercut banks, macrophyte beds, root wads, snags, gravel, sand, and/or cobble. Second, allocation of the sampling effort is distributed throughout the entire reach, thus preventing the entire sample from being taken in a shortened portion of the reach. Further, if the predominant habitat in a sample reach is poor or degraded, that habitat would be sampled as well. These field sampling methods are intended to depict the benthic macroinvertebrate assemblage that the physical habitat in the streams and rivers has the capacity to support. Another note about representativeness is to be cognizant that, while a method might effectively depict the property it is intended to depict (Flotemersch et al., 2006), it could be interpreted differently at different spatial scales

taxa 0.94 5.25 17.9 1.55 1.1 0.89 Number of Plecoptera taxa 0.9 2.42 37.3 1.48 1.05 0.85 % individuals as EPT 8.86 47.98 18.5 14.53 10.27 8.39 % individuals as non-insects 3 7.3 41.1 4.93 3.49 2.85 % individuals as predators 5.32 16.91 31.4 8.72 6.17 5.03 % of taxa as burrowers 3.93 28.91 13.6 6.45 4.56 3.72 Hilsenhoff Biotic Index 0.47 4.27 10.9 0.76 0.54 0.44

composite) 3.80 55.6 6.8 6.23 4.41 3.60

Fig. 4. Defining representativeness of a sample or datum first requires specifying the spatial

and/or temporal scale of the feature it is intended to depict.

Table 5. Precision estimates for sample-based benthic macroinvertebrate metrics, and composite multimetric index (Stribling et al., 2008b). Data shown are from the US state of Montana, and performance calculations are based on 40 sample pairs from the "mountain" site class (a*bbreviations* - RMSE, root mean square error; CV, coefficient of variation; CI90, 90

percent confidence interval; EPT, Ephemeroptera, Plecoptera, Trichoptera).

**CI90 1 sample 2 samples 3 samples** 

**Metric RMSE Mean CV** 

$$PSE = \frac{A}{A+B} \text{x100} \tag{6}$$

where *A* is the number of organisms found by the original sorter, and *B* is the number of missed organisms recovered (specimen recoveries) by the QC laboratory sort checker.

**Percent taxonomic disagreement (PTD)** quantifies the sample-based precision of taxonomic identifications by comparing target level taxonomic results from two independent taxonomists, using the formula:

$$PTD = \left[1 - \left(\frac{a}{N}\right)\right] \times 100\tag{7}$$

where *a* is the number of agreements, and *N* is the total number of organisms in the larger of the two counts (Stribling et al., 2003, 2008a).

**Percent difference in enumeration (PDE)** quantifies the consistency of specimen counts in samples, and is determined by calculating a comparison of results from two independent laboratories or taxonomists using the formula:

$$PDE = \frac{\left| n\_1 - n\_2 \right|}{n\_1 + n\_2} \ge 100 \tag{8}$$

where *n1* is the number of organisms in a sample counted by the first laboratory, and *n2*, the second (Stribling et al. 2003).

**Percent taxonomic completeness (PTC)** describes the proportion of specimens in a sample that meet the target identification level (Stribling et al. 2008) and is calculated as:

$$\text{PTC} = \bigvee\_{\text{N}}^{} \text{x100} \tag{9}$$

where *x* is the number of individuals in a sample for which the identification meets the target level, and *N* is the total number of individuals in the sample.

**Discrimination efficiency (DE)** is an estimate of the accuracy of multimetric indexes and individual metrics is characterized as their capacity to correctly identify stressor conditions (physical, chemical, hydrologic, and land use/land cover) and is quantified as discrimination efficiency using the formula:

$$DE = \bigvee\_{\mathfrak{h}} \mathbf{x}100\tag{10}$$

where a is the number of *a priori* stressor sites identified as being below the quantified biological impairment threshold of the reference distribution (25th percentile, 10th, or other), and b is the total number of stressor sites (Flotemersch et al., 2006).

Table 4. Explanations and formulas for quantifying 10 different performance characteristics for different steps of the biological assessment process.

Qualitative performance characteristics for field sampling are *bias* and *representativeness*  (Table 3). Programs that use multihabitat sampling, either transect-based similar to that used by the US national surveys (USEPA 2004a), or distributing sampling effort among different habitat types (Barbour et al., 1999, 2006), are attempting to minimize the bias

<sup>100</sup> *<sup>A</sup> PSE x*

where *A* is the number of organisms found by the original sorter, and *B* is the number of missed organisms recovered (specimen recoveries) by the QC laboratory sort checker. **Percent taxonomic disagreement (PTD)** quantifies the sample-based precision of taxonomic identifications by comparing target level taxonomic results from two independent

> 1 100 *<sup>a</sup> PTD N*

where *a* is the number of agreements, and *N* is the total number of organisms in the larger of

**Percent difference in enumeration (PDE)** quantifies the consistency of specimen counts in samples, and is determined by calculating a comparison of results from two independent

> 1 2 1 2

*n n PDE x n n*

where *n1* is the number of organisms in a sample counted by the first laboratory, and *n2*, the

**Percent taxonomic completeness (PTC)** describes the proportion of specimens in a sample

100 *PTC x x*

where *x* is the number of individuals in a sample for which the identification meets the

**Discrimination efficiency (DE)** is an estimate of the accuracy of multimetric indexes and individual metrics is characterized as their capacity to correctly identify stressor conditions (physical, chemical, hydrologic, and land use/land cover) and is quantified as

100 *DE x a*

where a is the number of *a priori* stressor sites identified as being below the quantified biological impairment threshold of the reference distribution (25th percentile, 10th, or other),

Table 4. Explanations and formulas for quantifying 10 different performance characteristics

Qualitative performance characteristics for field sampling are *bias* and *representativeness*  (Table 3). Programs that use multihabitat sampling, either transect-based similar to that used by the US national surveys (USEPA 2004a), or distributing sampling effort among different habitat types (Barbour et al., 1999, 2006), are attempting to minimize the bias

that meet the target identification level (Stribling et al. 2008) and is calculated as:

target level, and *N* is the total number of individuals in the sample.

and b is the total number of stressor sites (Flotemersch et al., 2006).

for different steps of the biological assessment process.

100

taxonomists, using the formula:

second (Stribling et al. 2003).

the two counts (Stribling et al., 2003, 2008a).

laboratories or taxonomists using the formula:

discrimination efficiency using the formula:

*A B* (6)

(7)

(8)

*<sup>N</sup>* (9)

*<sup>b</sup>* (10)

through two components of the field method. First, the approaches are not limited to one or a few habitat types; they are focused on sampling stable undercut banks, macrophyte beds, root wads, snags, gravel, sand, and/or cobble. Second, allocation of the sampling effort is distributed throughout the entire reach, thus preventing the entire sample from being taken in a shortened portion of the reach. Further, if the predominant habitat in a sample reach is poor or degraded, that habitat would be sampled as well. These field sampling methods are intended to depict the benthic macroinvertebrate assemblage that the physical habitat in the streams and rivers has the capacity to support. Another note about representativeness is to be cognizant that, while a method might effectively depict the property it is intended to depict (Flotemersch et al., 2006), it could be interpreted differently at different spatial scales (Figure 4).


Table 5. Precision estimates for sample-based benthic macroinvertebrate metrics, and composite multimetric index (Stribling et al., 2008b). Data shown are from the US state of Montana, and performance calculations are based on 40 sample pairs from the "mountain" site class (a*bbreviations* - RMSE, root mean square error; CV, coefficient of variation; CI90, 90 percent confidence interval; EPT, Ephemeroptera, Plecoptera, Trichoptera).

Fig. 4. Defining representativeness of a sample or datum first requires specifying the spatial and/or temporal scale of the feature it is intended to depict.

Partitioning Error Sources for Quality Control and

USA.

Comparability Analysis in Biological Monitoring and Assessment 73

disagreement (PTD) and percent difference in enumeration (PDE), both of which rely on the raw data (list of taxa and number of individuals) from whole-sample re-identifications (Stribling et al., 2003, 2008a). These two values are evaluated individually, and are used to indicate the overall quality of the taxonomic data. They can also be used to help identify the source of a problem. Percent taxonomic completeness (PTC) is calculated to document how consistently the taxonomist is able to attain the targeted taxonomic levels as specified in the SOP. It is important to note that the purpose of this evaluation approach is not to say that one taxonomist is correct over the other, but rather to make an effort to understand what is causing differences where they exist. The primary taxonomy is completed by one or more project taxonomists (T1); the re-identifications are completed as blind samples by one or

The number of samples for which this analysis is performed will vary, but 10% of the total sample lot (project, program, year, or other) is an acceptable rule-of-thumb. Exceptions are that large programs (>~500 samples) may not need to do >50 samples; small programs (<~30 samples) will likely still need to do at least 3 samples. In actuality, the number of reidentified samples will be program-specific and influenced by multiple factors, such as, how

more secondary, or QC taxonomists (T2) in a separate independent laboratory.

**Sample no. Number of specimens PSE Original Recovered Total**  1 208 5 213 97.7 2 202 8 210 96.2 3 227 1 228 99.6 4 200 12 212 94.3 5 208 7 215 96.7 6 222 2 224 99.1 7 220 24 244 90.2 8 21 6 27 77.8a 9 215 22 237 90.7 10 220 25 245 89.8b 11 220 3 223 98.7 12 211 24 235 89.8b 13 205 12 217 94.5 14 213 24 237 89.9b 15 205 11 216 94.9 16 222 15 237 93.7 17 203 10 213 95.3 18 158 16 174 90.8 a Low PSE is due to there being small total number of specimens in the sample (n=27); this sample was also whole-pick (all 30 grid squares); b PSE values taken as passing, only ≤0.2 percentage points below MQO. Table 6. Percent sorting efficiency (PSE) as laboratory sorting/ subsample quality control check. Results from 2006-2008 sampling for a routine monitoring program in north Georgia,

*Accuracy* is considered "not applicable" to field sampling (Table 3), because efforts to define analytical truth would necessitate a sampling effort excessive beyond any practicality. That is, the analytical truth would be all benthic macroinvertebrates that exist in the river (shore zone to 1-m depth). There is no sampling approach that will collect all individual benthic macroinvertebrate organisms.

### **5.2 Sorting/subsampling**

*Bias*, *precision*, and, in part, *completeness,* are quantitative characteristics of performance for laboratory sorting and subsampling (Table 3). Bias is the most critical performance characteristic of the sorting process, and is evaluated by checking for specimens that may have been overlooked or otherwise missed by the primary sorter (Flotemersch et al., 2006). Checking of the sort residue is performed by an independent sort checker in a separate laboratory using the same procedures as primary, specifically, the same magnification and lighting as called for in the SOP. The number of specimens found by the checker as a proportion of the total number of originally found specimens is the percent sorting efficiency (PSE; formula 6) (Table 4), and quantifies sorting bias. This exercise is performed on a randomly-selected subset of sort residues (generally 10% of total sample lot), the selection of which is stratified by individual sorters, by projects, or by programs. As a ruleof-thumb, an MQO could be "less than 10% of all samples checked will have a PSE ≤90%". Table 6 shows PSE results from sort rechecks for a project within the state of Georgia (US). One sample (no. 8) exhibited a substantial failure with a PSE of 77.8, which became an immediate flag for a potential problem. Further evaluation of the results showed that the sample was fully sorted (100%), and still only 21 specimens were found by the original sorter, prior to the 6 recoveries by the re-check. Values for PSE become skewed when overall numbers are low, thus failure of this one sample did not indicate systematic error (bias) in the sorting process. Three additional samples fell slightly below the 90% MQO, but were only ≤ 0.2 percentage points low and were judged as passing by the QC analyst.

Precision of laboratory sorting is calculated by use of RPD with metrics and indexes as the input variables (Table 4). If, for example, the targeted subsample size is 200 organisms, and that size subsample is drawn twice from a sorting tray without re-mixing or re-spreading, metrics can be calculated from the two separate subsamples. RPD would be an indication of how well the sample was mixed and spread in the tray; the "serial subsampling" and RPD calculations should be done on two timeframes. First, these calculations should be done, and the results documented and reported to demonstrate what the laboratory (or individual sorter) is capable of in application of the subsampling method. Second, they should be done periodically to demonstrate that the program routinely continues to meet that level of precision. Representativeness of the sorting/subsampling process is addressed as part of the SOP that requires random selection of grid squares (Flotemersch et al., 2006) with complete sorting, until the target number is reached within the final grid. Percent completeness for subsampling is calculated as the proportion of samples with the target subsample size (±20%) in the rough sort. Considered as "not applicable", estimates of *accuracy* are not necessary for characterizing sorting performance.

#### **5.3 Taxonomic precision (sample-based)**

*Precision* and *completeness* are quantitative performance characteristics that are used for taxonomy (Table 3). Precision of taxonomic identifications is calculated using percent taxonomic

*Accuracy* is considered "not applicable" to field sampling (Table 3), because efforts to define analytical truth would necessitate a sampling effort excessive beyond any practicality. That is, the analytical truth would be all benthic macroinvertebrates that exist in the river (shore zone to 1-m depth). There is no sampling approach that will collect all individual benthic

*Bias*, *precision*, and, in part, *completeness,* are quantitative characteristics of performance for laboratory sorting and subsampling (Table 3). Bias is the most critical performance characteristic of the sorting process, and is evaluated by checking for specimens that may have been overlooked or otherwise missed by the primary sorter (Flotemersch et al., 2006). Checking of the sort residue is performed by an independent sort checker in a separate laboratory using the same procedures as primary, specifically, the same magnification and lighting as called for in the SOP. The number of specimens found by the checker as a proportion of the total number of originally found specimens is the percent sorting efficiency (PSE; formula 6) (Table 4), and quantifies sorting bias. This exercise is performed on a randomly-selected subset of sort residues (generally 10% of total sample lot), the selection of which is stratified by individual sorters, by projects, or by programs. As a ruleof-thumb, an MQO could be "less than 10% of all samples checked will have a PSE ≤90%". Table 6 shows PSE results from sort rechecks for a project within the state of Georgia (US). One sample (no. 8) exhibited a substantial failure with a PSE of 77.8, which became an immediate flag for a potential problem. Further evaluation of the results showed that the sample was fully sorted (100%), and still only 21 specimens were found by the original sorter, prior to the 6 recoveries by the re-check. Values for PSE become skewed when overall numbers are low, thus failure of this one sample did not indicate systematic error (bias) in the sorting process. Three additional samples fell slightly below the 90% MQO, but were

only ≤ 0.2 percentage points low and were judged as passing by the QC analyst.

necessary for characterizing sorting performance.

**5.3 Taxonomic precision (sample-based)** 

Precision of laboratory sorting is calculated by use of RPD with metrics and indexes as the input variables (Table 4). If, for example, the targeted subsample size is 200 organisms, and that size subsample is drawn twice from a sorting tray without re-mixing or re-spreading, metrics can be calculated from the two separate subsamples. RPD would be an indication of how well the sample was mixed and spread in the tray; the "serial subsampling" and RPD calculations should be done on two timeframes. First, these calculations should be done, and the results documented and reported to demonstrate what the laboratory (or individual sorter) is capable of in application of the subsampling method. Second, they should be done periodically to demonstrate that the program routinely continues to meet that level of precision. Representativeness of the sorting/subsampling process is addressed as part of the SOP that requires random selection of grid squares (Flotemersch et al., 2006) with complete sorting, until the target number is reached within the final grid. Percent completeness for subsampling is calculated as the proportion of samples with the target subsample size (±20%) in the rough sort. Considered as "not applicable", estimates of *accuracy* are not

*Precision* and *completeness* are quantitative performance characteristics that are used for taxonomy (Table 3). Precision of taxonomic identifications is calculated using percent taxonomic

macroinvertebrate organisms.

**5.2 Sorting/subsampling** 


a Low PSE is due to there being small total number of specimens in the sample (n=27); this sample was also whole-pick (all 30 grid squares); b PSE values taken as passing, only ≤0.2 percentage points below MQO.

Table 6. Percent sorting efficiency (PSE) as laboratory sorting/ subsample quality control check. Results from 2006-2008 sampling for a routine monitoring program in north Georgia, USA.

disagreement (PTD) and percent difference in enumeration (PDE), both of which rely on the raw data (list of taxa and number of individuals) from whole-sample re-identifications (Stribling et al., 2003, 2008a). These two values are evaluated individually, and are used to indicate the overall quality of the taxonomic data. They can also be used to help identify the source of a problem. Percent taxonomic completeness (PTC) is calculated to document how consistently the taxonomist is able to attain the targeted taxonomic levels as specified in the SOP. It is important to note that the purpose of this evaluation approach is not to say that one taxonomist is correct over the other, but rather to make an effort to understand what is causing differences where they exist. The primary taxonomy is completed by one or more project taxonomists (T1); the re-identifications are completed as blind samples by one or more secondary, or QC taxonomists (T2) in a separate independent laboratory.

The number of samples for which this analysis is performed will vary, but 10% of the total sample lot (project, program, year, or other) is an acceptable rule-of-thumb. Exceptions are that large programs (>~500 samples) may not need to do >50 samples; small programs (<~30 samples) will likely still need to do at least 3 samples. In actuality, the number of reidentified samples will be program-specific and influenced by multiple factors, such as, how

Partitioning Error Sources for Quality Control and

**No.** 

**Count** 

**Sample no.** 

PTC of T1 and T2.

Mississippi.

**5.4 Taxonomic accuracy (taxon-based)** 

Comparability Analysis in Biological Monitoring and Assessment 75

1 243 244 232 0.2 4.9 234 96.3 223 91.4 4.9 2 227 223 204 0.9 10.1 205 90.3 194 87.0 3.3 3 214 213 191 0.2 10.7 202 94.4 199 93.4 1.0 4 221 223 207 0.5 7.2 212 95.9 208 93.3 2.6 5 216 214 202 0.5 6.5 207 95.8 201 93.9 1.9 6 216 216 214 0 0.9 209 96.8 208 96.3 0.5 7 86 83 69 1.8 19.8 77 89.5 64 77.1 12.4 8 206 201 194 1.2 5.8 204 99 187 93.0 6.0 9 208 210 196 0.5 6.7 203 97.6 195 92.9 4.7 10 192 195 180 0.8 7.7 182 94.8 172 88.2 6.6 Table 7. Summary table for sample by sample taxonomic comparison results, from routine biological monitoring in US state of Mississippi. T1 and T2 are the primary and QC taxonomists, respectively. "No. matches" is the number of individual specimens counted and given the same identity by each taxonomist, and PDE, PTD, and PTC are explained in text. Target level is the number and percentage of specimens identified to the SOP-specified level of effort (see Table 3 as an example); "Abs diff" is the absolute difference between the

**A. Number of samples in lot** 97 **B. Number of samples used for taxonomic comparison** 10 **C. Percent of sample lot** 10.3%

 *1. MQO* 15  *2. No. samples exceeding* 1  *3. Average* 7.9  *4. Standard deviation* 4.9

 *1. MQO* 5  *2. No. samples exceeding* 0  *3. Average* 0.6  *4. Standard deviation* 0.6

 *1. MQO* none  *2. Average* 4.3  *3. Standard deviation* 3.5 Table 8. Taxonomic comparison results from a bioassessment project in the US state of

*Accuracy* and *bias* (the inverse of accuracy) are quantitative performance characteristics for taxonomy (Table 3). Accuracy requires specification of an analytical truth, and for taxonomy

**F. Percent taxonomic completeness (PTC, absolute difference)**

**D. Percent taxonomic disagreement (PTD)**

**E. Percent difference in enumeration (PDE)**

**T1 T2 T1 PTC T2 PTC Abs** 

**Target level (taxonomic completeness)** 

**diff** 

**matches PDE PTD** 

many taxonomists are doing the primary identification (there may be an interest in having 10% of the samples from each taxonomist re-identified), and how confident the ultimate data user is with the results. Mean values across all re-identified samples are estimates of taxonomic precision (consistency) for a dataset or a program.

#### **5.3.1 Percent taxonomic disagreement (PTD)**

The sample-based error rate for taxonomic identifications is quantified by calculation of percent taxonomic disagreement (PTD) (Table 4, formula 7). The key exercise performed by the QC analyst is determining the number of matches, or shared identifications between the two taxonomists (Table 7). Matches must be exact, that is, negative comparisons result even if the difference is *only* hierarchical (genus vs. family, or other), whether they have been assigned different names, or whether specimens are missing from the overall results of either T1 or T2. *Error typing* individual sample comparisons is the process of determining differences as either: a) straight disagreements, b) hierarchical differences, or c) missing specimens. While tedious, this QC exercise provides information that is extremely valuable in formulating corrective actions. An MQO of 15% has been found to be attainable by most programs, and is used for the USEPA national surveys. As testing continues and laboratories and taxonomists become more accustomed to the procedure, it is becoming apparent that potentially the national standard could eventually be set at 10%. A standard summary report for taxonomic identification QC (Table 8) can be effectively communicated to data users.

#### **5.3.2 Percent difference in enumeration (PDE)**

Another summary data quality indicator for performance in taxonomic identification is comparison of the total number of organisms counted and reported in the sample by the two taxonomists (not the sorters). There is some redundancy of this measure with PTD, but it has proven useful in helping highlight coarse differences immediately, and is calculated as percent difference in enumeration (PDE) (Table 4, formula 8). While sorters may be welltrained, experienced, and have substantial internal QC oversight, they may not always be able to determine identifiability, the final decision of which is the responsibility of the taxonomist. It is rare to find exact agreement on sample counts between two taxonomists but the differences are usually minimal, hence the low recommended MQO of 5%. When PDE>5, reasons are usually fairly obvious, and the QC analyst can turn attention directly to the error source to determine if it may be systematic, and the nature and necessity of corrective action(s).

#### **5.3.3 Percent taxonomic completeness (PTC)**

Percent taxonomic completeness (PTC) (Table 3, formula 9) quantifies the proportion of individuals in a sample that are identified to the specified target taxonomic level (Table 1). Results can be interpreted in a number of ways: the individuals in a sample are damaged or early instar, many are damaged with diagnostic characters missing (such as, gills, legs, antennae, etc.) or the taxonomist is inexperienced or unfamiliar with the particular taxon. MQO have not been used for this characteristic, but barring an excessively damaged sample, it is not uncommon to see PTC in excess of 97 or 98. For purposes of QC, it is more important to have the absolute difference (abs diff) of PTC between T1 and T2 to be a low number, as documentation of consistency of effort; those values are often typical at 5-6%, or below.

many taxonomists are doing the primary identification (there may be an interest in having 10% of the samples from each taxonomist re-identified), and how confident the ultimate data user is with the results. Mean values across all re-identified samples are estimates of

The sample-based error rate for taxonomic identifications is quantified by calculation of percent taxonomic disagreement (PTD) (Table 4, formula 7). The key exercise performed by the QC analyst is determining the number of matches, or shared identifications between the two taxonomists (Table 7). Matches must be exact, that is, negative comparisons result even if the difference is *only* hierarchical (genus vs. family, or other), whether they have been assigned different names, or whether specimens are missing from the overall results of either T1 or T2. *Error typing* individual sample comparisons is the process of determining differences as either: a) straight disagreements, b) hierarchical differences, or c) missing specimens. While tedious, this QC exercise provides information that is extremely valuable in formulating corrective actions. An MQO of 15% has been found to be attainable by most programs, and is used for the USEPA national surveys. As testing continues and laboratories and taxonomists become more accustomed to the procedure, it is becoming apparent that potentially the national standard could eventually be set at 10%. A standard summary report for taxonomic identification QC (Table 8) can be effectively communicated

Another summary data quality indicator for performance in taxonomic identification is comparison of the total number of organisms counted and reported in the sample by the two taxonomists (not the sorters). There is some redundancy of this measure with PTD, but it has proven useful in helping highlight coarse differences immediately, and is calculated as percent difference in enumeration (PDE) (Table 4, formula 8). While sorters may be welltrained, experienced, and have substantial internal QC oversight, they may not always be able to determine identifiability, the final decision of which is the responsibility of the taxonomist. It is rare to find exact agreement on sample counts between two taxonomists but the differences are usually minimal, hence the low recommended MQO of 5%. When PDE>5, reasons are usually fairly obvious, and the QC analyst can turn attention directly to the error source to determine if it may be systematic, and the nature and necessity of

Percent taxonomic completeness (PTC) (Table 3, formula 9) quantifies the proportion of individuals in a sample that are identified to the specified target taxonomic level (Table 1). Results can be interpreted in a number of ways: the individuals in a sample are damaged or early instar, many are damaged with diagnostic characters missing (such as, gills, legs, antennae, etc.) or the taxonomist is inexperienced or unfamiliar with the particular taxon. MQO have not been used for this characteristic, but barring an excessively damaged sample, it is not uncommon to see PTC in excess of 97 or 98. For purposes of QC, it is more important to have the absolute difference (abs diff) of PTC between T1 and T2 to be a low number, as documentation of consistency of effort; those values are often typical at 5-6%, or

taxonomic precision (consistency) for a dataset or a program.

**5.3.1 Percent taxonomic disagreement (PTD)** 

**5.3.2 Percent difference in enumeration (PDE)** 

**5.3.3 Percent taxonomic completeness (PTC)** 

to data users.

corrective action(s).

below.


Table 7. Summary table for sample by sample taxonomic comparison results, from routine biological monitoring in US state of Mississippi. T1 and T2 are the primary and QC taxonomists, respectively. "No. matches" is the number of individual specimens counted and given the same identity by each taxonomist, and PDE, PTD, and PTC are explained in text. Target level is the number and percentage of specimens identified to the SOP-specified level of effort (see Table 3 as an example); "Abs diff" is the absolute difference between the PTC of T1 and T2.


Table 8. Taxonomic comparison results from a bioassessment project in the US state of Mississippi.

## **5.4 Taxonomic accuracy (taxon-based)**

*Accuracy* and *bias* (the inverse of accuracy) are quantitative performance characteristics for taxonomy (Table 3). Accuracy requires specification of an analytical truth, and for taxonomy

Partitioning Error Sources for Quality Control and

**6. Maintenance of data quality** 

**Site** 

Comparability Analysis in Biological Monitoring and Assessment 77

The purpose of QC is to identify assignable causes of variation (error) so that the quality of the outcomes in future processes can be made, on average, less variable (Shewhart, 1939). For reducing error rates, it is first and foremost critical to know of the existence of error, and second, to know its causes. Once the causes are known, corrective actions can be designed to reduce or eliminate them. The procedures described in this chapter for gathering information that allow performance and data quality characteristics to be documented need to become a routine part of biological monitoring programs. If they are used only when "conditions are right", as part of special studies, or when there are additional resources, they are not serving their purpose and could ultimately be counter-productive. The counterproductivity would arise when monitoring staff begin to view QC samples and analyses as activities that are less than routine, and something for which to strive to do their best, that is, only when they are being tested. This perspective leads programs to work to meet a

number, such as 15%, rather than using the information to maintain or improve.

A Poor 3 Poor 3 0 B Poor 3 Poor 3 0 C Good 1 Good 1 0 D Poor 3 Very Poor 4 1 E Fair 2 Fair 2 0 F Poor 3 Fair 2 1 G Poor 3 Poor 3 0 H Very Poor 4 Very Poor 4 0 I Very Poor 4 Very Poor 4 0 J Poor 3 Poor 3 0 K Poor 3 Poor 3 0 L Very Poor 4 Very Poor 4 0 M Very Poor 4 Very Poor 4 0 N Poor 3 Fair 2 1 O Poor 3 Poor 3 0 P Poor 3 Poor 3 0 Q Poor 3 Very Poor 4 1 R Poor 3 Poor 3 0 S Fair 2 Very Poor 4 2 T Fair 2 Fair 2 0 U Good 1 Good 1 0 V Poor 3 Fair 2 1 W Fair 2 Fair 2 0 X Poor 3 Poor 3 0 Y Poor 3 Poor 3 0 Z Very Poor 4 Very Poor 4 0 AA Poor 3 Poor 3 0 BB Fair 2 Fair 2 0 CC Poor 1 Poor 1 0

Table 9. Assessment results shown for sample pairs taken from 29 sites, each pair

good, 2 – fair, 3 – poor, and 4 – very poor.

representing two adjacent reaches (back to back (see Fig. 4). Assessment categories are 1 –

**category**

**Replicate 1 Replicate 2 Categorical** 

**Narrative Assessment** 

**category**

**Narrative difference Assessment** 

that is 1) the museum-based type specimen (holotype, or other form of type specimen), 2) specimen(s) verified by recognized expert(s) in that particular taxon or 3) unique morphological characteristics specified in dichotomous identification keys. Determination of accuracy is considered "not applicable" for production taxonomy (most often used in routine monitoring programs) because that kind of taxonomy is focused on characterizing the sample; taxonomic accuracy, by definition, would be focused on individual specimens. Bias in taxonomy can result from use of obsolete nomenclature and keys, imperfect understanding of morphological characteristics, inadequate optical equipment, or poor training. Neither of these performance characteristics is considered necessary for production taxonomy, in that they are largely covered by the estimates of precision and completeness. For example, although it is possible that two taxonomists would put an incorrect name on an organism, it is considered low probability that they would put the *same incorrect name* on that organism.

#### **5.5 Data entry accuracy**

Recognition and correction of data entry errors (even the one mentioned in Section 4.3) could come from one of two methods for assuring accuracy in data entry; both do not need to be done. One is the double entry of all data by two separate individuals, and then performing a direct match between databases. Where there are differences, it is determined which database is in error, and corrections are made. The second approach is to perform a 100% comparison of all data entered to handwritten data sheets. Comparisons should be performed by someone other than the primary data entry person. When errors are found, they are hand-edited for documentation, and corrections are made electronically. The rates of data entry errors are recorded and segregated by data type (e.g., fish, benthic macroinvertebrates, periphyton, header information, latitude and longitude, physical habitat, and water chemistry). Issues could potentially arise when entering data directly into field e-tablets or laboratory computers. Because there would be no paper backup, QC checks of data entry are not possible.

#### **5.6 Site assessment and interpretation**

Quantitative performance characteristics for site assessment and interpretation are *precision*, *accuracy*, and *completeness* (Table 3). Site assessment precision is based on the narrative assessments from the associated index scores (good, fair, poor) from reach duplicates and quantifies the percentage of duplicate samples that are receiving the same narrative assessments. These comparisons are done for a randomly-selected 10% of the total sample lot. Table 9 shows this direct comparison that, for this dataset, 79% of the replicates returned assessments of the same category (23 out of 29); 17% were 1 category different (5 of 29); and 3% were 2 categories different (1 of 29). Assessment accuracy is expressed using discrimination efficiency (DE) (formula 10; Table 4), a value developed during the index calibration process, which relies upon, first, specifying magnitudes of physical, chemical, and/or hydrologic stressors that are unacceptable, and identifying those sites exhibiting those excessive stressor characteristics. The set of sites exhibiting unacceptable stressor levels constitute the analytical truth. The proportion of samples for which the biological index correctly identifies sites as impaired is DE. This is a performance characteristic that is directly suitable for expressing how well an indicator does what it is designed to do, detect stressor conditions, but it is not suitable for routine QC analyses. Percent completeness (%C) is the proportion of sites (of the total planned) for which valid final assessments were obtained.

## **6. Maintenance of data quality**

76 Modern Approaches To Quality Control

that is 1) the museum-based type specimen (holotype, or other form of type specimen), 2) specimen(s) verified by recognized expert(s) in that particular taxon or 3) unique morphological characteristics specified in dichotomous identification keys. Determination of accuracy is considered "not applicable" for production taxonomy (most often used in routine monitoring programs) because that kind of taxonomy is focused on characterizing the sample; taxonomic accuracy, by definition, would be focused on individual specimens. Bias in taxonomy can result from use of obsolete nomenclature and keys, imperfect understanding of morphological characteristics, inadequate optical equipment, or poor training. Neither of these performance characteristics is considered necessary for production taxonomy, in that they are largely covered by the estimates of precision and completeness. For example, although it is possible that two taxonomists would put an incorrect name on an organism, it is considered low probability that they would put the *same incorrect name* on

Recognition and correction of data entry errors (even the one mentioned in Section 4.3) could come from one of two methods for assuring accuracy in data entry; both do not need to be done. One is the double entry of all data by two separate individuals, and then performing a direct match between databases. Where there are differences, it is determined which database is in error, and corrections are made. The second approach is to perform a 100% comparison of all data entered to handwritten data sheets. Comparisons should be performed by someone other than the primary data entry person. When errors are found, they are hand-edited for documentation, and corrections are made electronically. The rates of data entry errors are recorded and segregated by data type (e.g., fish, benthic macroinvertebrates, periphyton, header information, latitude and longitude, physical habitat, and water chemistry). Issues could potentially arise when entering data directly into field e-tablets or laboratory computers. Because there would be no paper backup, QC

Quantitative performance characteristics for site assessment and interpretation are *precision*, *accuracy*, and *completeness* (Table 3). Site assessment precision is based on the narrative assessments from the associated index scores (good, fair, poor) from reach duplicates and quantifies the percentage of duplicate samples that are receiving the same narrative assessments. These comparisons are done for a randomly-selected 10% of the total sample lot. Table 9 shows this direct comparison that, for this dataset, 79% of the replicates returned assessments of the same category (23 out of 29); 17% were 1 category different (5 of 29); and 3% were 2 categories different (1 of 29). Assessment accuracy is expressed using discrimination efficiency (DE) (formula 10; Table 4), a value developed during the index calibration process, which relies upon, first, specifying magnitudes of physical, chemical, and/or hydrologic stressors that are unacceptable, and identifying those sites exhibiting those excessive stressor characteristics. The set of sites exhibiting unacceptable stressor levels constitute the analytical truth. The proportion of samples for which the biological index correctly identifies sites as impaired is DE. This is a performance characteristic that is directly suitable for expressing how well an indicator does what it is designed to do, detect stressor conditions, but it is not suitable for routine QC analyses. Percent completeness (%C) is the proportion of sites (of the total planned) for which valid final assessments were

that organism.

obtained.

**5.5 Data entry accuracy** 

checks of data entry are not possible.

**5.6 Site assessment and interpretation** 

The purpose of QC is to identify assignable causes of variation (error) so that the quality of the outcomes in future processes can be made, on average, less variable (Shewhart, 1939). For reducing error rates, it is first and foremost critical to know of the existence of error, and second, to know its causes. Once the causes are known, corrective actions can be designed to reduce or eliminate them. The procedures described in this chapter for gathering information that allow performance and data quality characteristics to be documented need to become a routine part of biological monitoring programs. If they are used only when "conditions are right", as part of special studies, or when there are additional resources, they are not serving their purpose and could ultimately be counter-productive. The counterproductivity would arise when monitoring staff begin to view QC samples and analyses as activities that are less than routine, and something for which to strive to do their best, that is, only when they are being tested. This perspective leads programs to work to meet a number, such as 15%, rather than using the information to maintain or improve.


Table 9. Assessment results shown for sample pairs taken from 29 sites, each pair representing two adjacent reaches (back to back (see Fig. 4). Assessment categories are 1 – good, 2 – fair, 3 – poor, and 4 – very poor.

Partitioning Error Sources for Quality Control and

Dataset or protocol A (performance characteristics)

Combine into single dataset; proceed with analyses

protocols.

groups.

**9. References** 

391.

**8. Conclusion** 

Comparability Analysis in Biological Monitoring and Assessment 79

Compare Apc to Bpc

*Acceptable data quality?*

**YES NO**

Dataset or protocol B (performance characteristics)

Exclude from analyses, or adjust with corrective approaches

Fig. 5. Framework for analysis of comparability between or among monitoring datasets or

If data of unknown quality are used, whether by themselves or in combination with others, the assumption is implicit that they are acceptable, and hence, comparable. We must acknowledge the risk of incorrect decisions when using such data and be willing to communicate those risks to both data users and other decisionmakers. The primary message of this chapter is that appropriate and sufficient QC activities should be a routine component of any monitoring program, whether it is terrestrial or aquatic, focuses on physical, chemical, and/or biological indicators, and, if biological, whether it includes macroinvertebrates, algae/diatoms, fish, broad-leaf plants, or other organisms

APHA. 2005. *Standard Methods for the Examination of Water and Wastewater*. 21st edition.

Barbour, M.T., & J. Gerritsen. 1996. Subsampling of benthic samples: a defense of the fixed

Water Environment Federation, Washington, DC.

American Public Health Association, American Water Works Association, and

count method. *Journal of the North American Benthological Society* 15:386-


Table 10. Key measurement quality objectives (MQO) that could be used to track maintenance of data quality at acceptable levels.

Key to maintaining data quality of known and acceptable levels is establishing performance standards based on MQO. Qualitative standards, such as some of the representativeness and accuracy factors (Table 3), can be evaluated by comparing SOP and SOP application to the goals and objectives of the monitoring program. However, a clear statement of data quality expectations, such as that shown in Table 10, will help to ensure consistency of success in implementing the procedures. As a program becomes more proficient and consistent in meeting the standards, efforts could be undertaken to "tighten up" the standards. With this comes necessary budgetary considerations; better precision can always be attained, but often at elevated costs.

## **7. Comparability analysis and acceptable data quality**

All discussion to this point has been directed toward documenting data quality associated with monitoring programs, hopefully with sufficient emphasis that there are no data that are right or wrong, but just that they are acceptable or not. If data are acceptable for a decision (for example, in the context of biological assessment and monitoring), a defensible statement on the ecological condition of a site or an ecological system can be made. If they are not acceptable to support that decision, likewise, the decision not to use the data should also be defensible. Routine documentation and reporting of data quality within a monitoring program provides a statement of intra-programmatic consistency, that is, sample to sample comparability even if collected from different temporal or spatial scales. If there is an interest in or need to combine datasets from different programs (Figure 5), it is imperative for routinely documented performance characteristics be available for each. Lack of them will preclude any determination of acceptability for decision making by data users, whether scientists, policy-makers, or the public.

Fig. 5. Framework for analysis of comparability between or among monitoring datasets or protocols.

## **8. Conclusion**

78 Modern Approaches To Quality Control

(multimetric index) CI90 ≤ 15 index points, on a 100-point scale

Sorting/subsampling accuracy PSE≥90, for ≥ 90% of externally QC'd sort

residues

of error

error

Key to maintaining data quality of known and acceptable levels is establishing performance standards based on MQO. Qualitative standards, such as some of the representativeness and accuracy factors (Table 3), can be evaluated by comparing SOP and SOP application to the goals and objectives of the monitoring program. However, a clear statement of data quality expectations, such as that shown in Table 10, will help to ensure consistency of success in implementing the procedures. As a program becomes more proficient and consistent in meeting the standards, efforts could be undertaken to "tighten up" the standards. With this comes necessary budgetary considerations; better precision can always be attained, but

All discussion to this point has been directed toward documenting data quality associated with monitoring programs, hopefully with sufficient emphasis that there are no data that are right or wrong, but just that they are acceptable or not. If data are acceptable for a decision (for example, in the context of biological assessment and monitoring), a defensible statement on the ecological condition of a site or an ecological system can be made. If they are not acceptable to support that decision, likewise, the decision not to use the data should also be defensible. Routine documentation and reporting of data quality within a monitoring program provides a statement of intra-programmatic consistency, that is, sample to sample comparability even if collected from different temporal or spatial scales. If there is an interest in or need to combine datasets from different programs (Figure 5), it is imperative for routinely documented performance characteristics be available for each. Lack of them will preclude any determination of acceptability for decision making by data users,

Table 10. Key measurement quality objectives (MQO) that could be used to track

CV < 10%, for a sampling event

(field season, watershed, or other strata)

Median PTD ≤ 15% for overall sample lot; samples with PTD ≥ 15% examined for patterns

Median PDE ≤ 5%; samples with PDE ≥ 5% should be further examined for patterns of

Median PTC ≥ 90%; samples with PTC ≤ 90% should be examined and those taxa not meeting targets isolated; mAbs diff ≤ 5%

**Performance characteristic MQO** 

Field sampling precision (multimetric index)

Field sampling precision

Field sampling precision

Taxonomic precision

Taxonomic precision

often at elevated costs.

Taxonomic completeness

(multimetric index) RPD < 15

maintenance of data quality at acceptable levels.

whether scientists, policy-makers, or the public.

**7. Comparability analysis and acceptable data quality** 

Field sampling completeness Completeness > 98%

If data of unknown quality are used, whether by themselves or in combination with others, the assumption is implicit that they are acceptable, and hence, comparable. We must acknowledge the risk of incorrect decisions when using such data and be willing to communicate those risks to both data users and other decisionmakers. The primary message of this chapter is that appropriate and sufficient QC activities should be a routine component of any monitoring program, whether it is terrestrial or aquatic, focuses on physical, chemical, and/or biological indicators, and, if biological, whether it includes macroinvertebrates, algae/diatoms, fish, broad-leaf plants, or other organisms groups.

## **9. References**


Partitioning Error Sources for Quality Control and

828.

USA.

*American Benthological Society* 15:713-727.

*102:263-283. DOI: 10.1007/s10661-005-6026-2* 

Protection Agency, Cincinnati, OH.

Published online in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/rra.1367.

<http://www.gao.gov/new.items/d04382.pdf>.

*Applications* 16:1277–1294.

Press. Available from:

*Ecological Applications* 10:1456–1477.

<http://www.heinzctr.org/ecosystems/index.htm>.

biotic integrity. *Ecological Indicators* 2:325–338.

Comparability Analysis in Biological Monitoring and Assessment 81

Diamond, J.M., M.T. Barbour & J.B. Stribling. 1996. Characterizing and comparing

Edwards, P.N. 2004. "A vast-machine": Standards as social technology. *Science* 304 (7):827-

Flotemersch, J.E. & K.A. Blocksom. 2005 Electrofishing in boatable rivers: Does sampling

Flotemersch J.E., J.B. Stribling, & M.J. Paul. 2006. *Concepts and Approaches for the Bioassessment* 

Flotemersch, J.E., J.B. Stribling, R.M. Hughes, L. Reynolds, M.J. Paul & C. Wolter. 2010. Site

General Accounting Office (GAO). 2004. *Watershed Management: Better Coordination of Data Collection Efforts.* GAO-04-382. Washington, DC , USA. Available from:

Gurtz, M.E. & T.A. Muir (editors). 1994. *Report of the Interagency Biological Methods* 

Haase, P., J. Murray-Bligh, S. Lohse, S. Pauls, A. Sundermann, R. Gunn & R. Clarke. 2006.

Hawkins, C.P. 2006. Quantifying biological integrity by taxonomic completeness: evaluation

Hawkins, C.P., R.H. Norris, J.N. Hogue & J.W. Feminella. 2000. Development and

Heinz Center, The. 2002. *The state of the nation's ecosystems: measuring the lands, waters, and* 

Herbst, D.B. & E.L. Silldorf. 2006. Comparison of the performance of different bioassessment

procedures. *Journal of the North American Benthological Society* 25:513–530. Hill, B.H., A.T. Herlihy, P.R. Kaufmann, S.J. Decelles & M.A. Vander Borgh. 2003.

samples. *Hydrobiologia* 566:505–521. DOI 10.1007/s10750-006-0075-6

Deming. Dover Publications, Inc., 31 East 2nd Street, Mineola, NY.

Washington, DC. 105 pp. Republished 1986, with a new Foreword by W.E.

bioassessment methods and their results: a perspective. *Journal of the North* 

design affect bioassessment *metrics? Environmental Monitoring and Assessment* 

*of Non-Wadeable Streams and Rivers*. EPA/600/R-06/127. U.S. Environmental

length for biological assessment of boatable rivers. *River Research and Applications*.

*Workshop*. U.S. Geological Survey, Open File Report 94-490, Reston, Virginia,

Assessing the impact of errors in sorting and identifying macroinvertebrate

of a potential indicator for use in regional- and global-scale assessments. *Ecological* 

evaluation of predictive models for measuring the biological integrity of streams.

*living resources of the United States.* The H. John Heinz III Center for Science, Economics, and the Environment, Washington, DC, USA. Cambridge University

methods: similar evaluations of biotic integrity from separate programs and

Assessment of streams of the eastern United States using a periphyton index of


Barbour, M.T., J. Gerritsen, B.D. Snyder, J.B. Stribling. 1999. *Rapid Bioassessment Protocols for* 

Barbour, M. T., J. B. Stribling, & P.F.M. Verdonschot. 2006. The multihabitat approach of

Blocksom, K.A., & J.E. Flotemersch. 2005. Comparison of macroinvertebrate sampling

Blocksom, K.A., & J.E. Flotemersch. 2008. Field and laboratory performance characteristics

Bortolus, A. 2008. Error cascades in the biological sciences: the unwanted consequences of

Brunialti, G., P. Giordani, & M. Ferretti. 2004. Discriminating between the Good and the

Cao, Y., & C.P. Hawkins. 2005. Simulating biological impairment to evaluate the accuracy of

Cao, Y., C.P. Hawkins, & M.R. Vinson. 2003. Measuring and controlling data quality in

Carter, J.L. & V.H. Resh. 2001. After site selection and before data analysis: sampling,

Caton, L. R. 1991. Improved subsampling methods for the EPA rapid bioassessment benthic protocols. *Bulletin of the North American Benthological Society* 8:317-319. Clarke, R.T., M.T. Furse, J.F. Wright & D. Moss. 1996. Derivation of a biological quality

Clarke, R.T., J.F. Wright & M.T. Furse. 2003. RIVPACS models for predicting the expected

Costanza, R., S.O. Funtowicz & J.R. Ravetz. 1992. Assessing and communicating data

Courtemanch, D.L. 1996. Commentary on the subsampling procedures used for rapid

Deming, W.E. 1986. *Foreward*. In, Shewhart, W.A. 1939. *Statistical Methods from the Viewpoint* 

http://water.epa.gov/scitech/monitoring/rsl/bioassessment/index.cfm.

*Research and Applications* 24: 373–387. DOI: 10.1002/rra.1073

using bad taxonomy in ecology*. Ambio* 37(2): 114-118.

ecological indicators. *Journal of Applied Ecology* 42:954-965.

macroinvertebrates. *Freshwater Biology* 48: 1898–1911.

*Benthological Society* 20: 658-676.

*Applied Statistics* 23:311–332.

*Modeling* 160:219–233.

131.

385.

25(3-4): 229-240.

102:243-262.

Press.

*Streams and Wadeable Rivers: Periphyton, Benthic Macroinvertebrates and Fish.* Second edition. EPA/841-D-97-002. U.S. EPA, Office of Water, Washington, DC. *URL*:

USEPA's rapid bioassessment protocols: Benthic macroinvertebrates. *Limnetica*

methods for non-wadeable streams. *Environmental Monitoring and Assessment*

of a new protocol for sampling riverine macroinvertebrate assemblages. *River* 

Bad: Quality Assurance Is Central in Biomonitoring Studies. Chapter 20, pp. 443-464, IN, G.B. Wiersma (editor), *Environmental Monitoring*. CRC

biological assemblage surveys with special reference to stream benthic

sorting, and laboratory procedures used in stream benthic macroinvertebrate monitoring programs by USA state agencies. *Journal of the North American* 

index for river sites: comparison of the observed with the expected fauna. *Journal of* 

macroinvertebrate fauna and assessing the ecological quality of rivers. *Ecological* 

quality in policy-relevant research. *Environmental Management* 16(1):121-

bioassessments. *Journal of the North American Benthological Society* 15:381-

*of Quality Control.* The Graduate School, U.S. Department of Agriculture,

Washington, DC. 105 pp. Republished 1986, with a new Foreword by W.E. Deming. Dover Publications, Inc., 31 East 2nd Street, Mineola, NY.


(www.interscience.wiley.com) DOI: 10.1002/rra.1367.


<http://www.heinzctr.org/ecosystems/index.htm>.


Partitioning Error Sources for Quality Control and

Reston, Virginia, USA.

Street, Mineola, NY.

10.1899/07-175.1

USA.

pp.

10.1007/s10750-005-9003-4

Comparability Analysis in Biological Monitoring and Assessment 83

NWQMC. 2001. *Towards a Definition of Performance-Based Laboratory Methods*. National Water

Nichols, S.J., W.A. Robinson & R.H. Norris. 2006. Sample variability influences on the

Peters, J.A. 1988. Quality control infusion into stationary source sampling. Pages 317–333 in

Shewhart, W.A. 1939. *Statistical Methods from the Viewpoint of Quality Control.* The Graduate

Smith, R.-K. 2000. *Interpretation of Organic Data*. ISBN 1-890911-19-4. Genium Publishing

Stribling, J. B., S.R. Moulton II & G.T. Lester. 2003. Determining the quality of taxonomic data. *Journal of the North American Benthological Society* 22(4): 621-631. Stribling, J.B., K.L. Pavlik, S.M. Holdsworth & E.W. Leppo. 2008a. Data quality,

Stribling, J.B., B.K. Jessup & D.L. Feldman. 2008b. Precision of benthic macroinvertebrate

Taylor, J.K. 1988. *Defining the Accuracy, Precision, and Confidence Limits of Sample Data.*

Taylor, J.R. 1997. *An Introduction to Error Analysis. The Study of Uncertainties in Physical* 

Taylor, B.N. & C.E. Kuyatt. 1994. *Guidelines for Evaluating and Expressing the Uncertainty of* 

USEPA. 2004a. *Wadeable Stream Assessment: Field Operations Manual.* EPA 841-B-04-004.

USEPA. 2004b. *Wadeable Stream Assessment: Benthic Laboratory Methods.* EPA 841-B-04007.

U. S. GPO (Government Printing Office). 1989. *Federal Water Pollution Control Act (33 U. S. C.* 

Reference Book. American Chemical Society, Columbus, Ohio.

Corporation. Genium Group, Inc., Amsterdam, New York.

*Benthological Society* 27(1):58-67. doi: 10.1899/07-037R.1

Chemical Society. Columbus, Ohio.

Protection Agency, Washington, DC.

Protection Agency, Washington, DC.

Quality Monitoring Council Technical Report 01 – 02, U.S. Geological Survey,

precision of predictive bioassessment. *Hydrobiologia* 572: 215–233. doi

L. H. Keith (editor). *Principles of Environmental Sampling*. ACS Professional

School, U.S. Department of Agriculture, Washington, DC. 105 pp. Republished 1986, with a new Foreword by W.E. Deming. Dover Publications, Inc., 31 East 2nd

performance, and uncertainty in taxonomic identification for biological assessments. *Journal of the North American Benthological Society* 27(4): 906-919. doi:

indicators of stream condition in Montana. *Journal of the North American* 

Chapter 6, pages 102-107, IN Lawrence H. Keith (editor), *Principles of Environmental Sampling*. ACS Professional Reference Book. ISBN 0-8412-1173-6. American

*Measurements*. Second edition. University Science Books, Sausalito, California,

*NIST Measurement Results*. NIST Technical Note 1297. National Institute of Standards and Technology, U.S. Department of Commerce, Washington, DC. 24

Office of Water and Office of Research and Development, US Environmental

Office of Water and Office of Research and Development, US Environmental

*1251 et seq.)* as amended by a P. L. 92-500. In: Compilation of selected water


 http://www.deq.mt.gov/wqinfo/QAProgram/ WQPBWQM-009rev2\_final\_web. pdf)

Hill, B.H., A.T. Herlihy, P.R. Kaufmann, R.J. Stevenson, F.H. McCormick & C.B. Johnson.

Hughes, R.M., P.R. Kaufmann, A.T. Herlihy, T.M. Kincaid, L. Reynolds & D.P. Larsen. 1998.

ITFM. 1995a. *The Strategy for Improving Water Quality Monitoring in the U.S.*

ITFM. 1995b. Performance-based approach to water quality monitoring. In: *Strategy for* 

Karr, J.R., K.D. Fausch, P.L. Angermeier, P.R. Yant & I.J. Schlosser. 1986. *Assessing Biological* 

Keith, L.H. (editor). 1988. *Principles of Environmental Sampling.* ACS Professional Reference

Keith, L.H. 1991. *Environmental Sampling and Analysis. A Practical Guide*. Lewis Publishers,

Klemm, D.J., P.A. Lewis, F. Fulk & J.M. Lazorchak. 1990. *Macroinvertebrate Field and* 

Klemm, D.J., J.M. Lazorchak & P.A. Lewis. 1998. Benthic macroinvertebrates. Pages 147–182

Lindley, D. 2007. *Uncertainty. Einstein, Heisenberg, Bohr, and the Struggle for the Soul of Science.*

Merritt, R.W., K.W. Cummins & M.B. Berg (editors). 2008. *An Introduction to the Aquatic* 

Milberg, P., J. Bergstedt, J. Fridman, G. Odell & L. Westerberg. 2008. Systematic and random

Montana DEQ. 2006. *Sample collection, sorting, and taxonomic identification of benthic* 

http://www.deq.mt.gov/wqinfo/QAProgram/ WQPBWQM-009rev2\_final\_web.

*Canadian Journal of Fisheries and Aquatic Sciences* 55:1618–1631.

*the North American Benthological Society* 19:50–67.

U.S. Geological Survey, Reston, Virginia, USA.

Natural History Survey, Champaign, Illinois, USA.

Book. American Chemical Society, Columbus, Ohio.

Environmental Protection Agency, Cincinnati, OH. 256 pp.

Environmental Protection Agency, Washington, DC.

Dubuque, Iowa. ISBN 978-0-7575-5049-2. 1158 pp.

http://dx.doi.org/10.3170/2008-8-18423.

Helena, Montana. (Available from:

Survey, Reston, Virginia, USA.

Chelsea, Michigan.

NY. 257 pp.

pdf)

2000. Use of periphyton assemblage data as an index of biotic integrity. *Journal of* 

A process for developing and evaluating indices of fish assemblage integrity.

Intergovernmental Task Force on Monitoring Water Quality. Report #OFR95-742,

*Improving Water Quality Monitoring in the U.S.*, Appendix M, Report #OFR95-742, Intergovernmental Task Force on Monitoring Water Quality, U.S. Geological

*Integrity in Running Waters: a Method and its Rationale.* Special publication 5. Illinois

*Laboratory Methods for Evaluating the Biological Integrity of Surface Waters*. EPA/600/4-90/030. Environmental Monitoring Systems Laboratory, U.S.

in J. M. Lazorchak, D. J. Klemm, and D. V. Peck (editors). *Environmental Monitoring and Assessment Program—Surface Waters: Field Operations and Methods for Measuring the Ecological Condition of Wadeable Streams.* EPA/620/R-94/ 004F. U.S.

Anchor Books, a Division of Random House. ISBN: 978-1-4000-7996-4. New York,

*Insects of North America*. Fourth Edition. Kendall/Hunt Publishing Company,

variation in vegetation monitoring data. *Journal of Vegetation Science* 19: 633-644.

*macroinvertebrates. Standard operation procedure. WQPBWQM-009. Revision no. 2.* Water Quality Planning Bureau, Montana Department of Environmental Quality,


**5** 

**Patient Satisfaction with Primary Health Care Services in a Selected District Municipality of** 

N. Phaswana-Mafuya1,2, A. S. Davids1,I. Senekal3 and S. Munyaka3

*2Office of the Deputy Vice Chancellor: Research and Engagement, Nelson Mandela* 

Traditionally, decisions about health services were made on the basis of health-provider and health authorities' views on what is in the best interest of the patient. This was based on a view that members of the general public lack the technical knowledge to make fully informed decisions themselves. Currently, the use of patient satisfaction surveys (PSS) in developing countries is advancing. Professionals have recognized that a systematic and consumer oriented perspective toward patient viewpoints about the level of care can result in feedback useful for promoting higher quality standards of patient care (Dağdeviren &

Patient satisfaction surveys are seen as a means of determining patients' views on primary health care (PHC) (Ajayi, Olumide & Oyediran 2005; Andaleeb 2001; Campbell, Ramsay & Green 2001). These surveys are increasingly being promoted as a means of understanding health care service quality and the demand for these services in developing countries (Glick 2009) for various reasons. First, they highlight those aspects of care that need improvement in a health care setting (Ajayi, Olumide & Oyediran 2005; Muhondwa *et al.* 2008; Newman *et al.* 1998). Second, they are simple, quick and inexpensive to administer. Third, they are critical for developing measures to increase the utilization of PHC services. Fourth, they can help to educate medical staff about their achievements as well as their failures, assisting them to be more responsive to their patients' needs. Fifth, they allow managerial judgment to be exercised from a position of knowledge rather than guesswork in the important task of

The South African government also endorses the centrality of consumers in service delivery. The White Paper on Transforming Public Services of 1997 (Department of Public Service and Administration 1997) and the Department of Health's policy on quality in health care (Department of Health 2007) state that public services need to respond to customers' needs, wants and expectations. Feedback from consumers is required in terms of experiences of health services – quality of care received. Feedback from customers will not only improve

**1. Introduction** 

Akturk 2004; Newman et al. 1998; Peltzer 2009).

managing public expectations and resources (Glick 2009).

**the Eastern Cape of South Africa** 

*1Human Sciences Research Council, Port Elizabeth,* 

*Metropolitan University, Port Elizabeth,* 

*3University of Fort Hare,* 

*South Africa* 

resources and water pollution control laws. Printed for use of the Committee on Public Works and Transportation. Washington, DC, USA.


## **Patient Satisfaction with Primary Health Care Services in a Selected District Municipality of the Eastern Cape of South Africa**

N. Phaswana-Mafuya1,2, A. S. Davids1,I. Senekal3 and S. Munyaka3 *1Human Sciences Research Council, Port Elizabeth, 2Office of the Deputy Vice Chancellor: Research and Engagement, Nelson Mandela Metropolitan University, Port Elizabeth, 3University of Fort Hare, South Africa* 

## **1. Introduction**

84 Modern Approaches To Quality Control

Vinson, M.R. & C.P. Hawkins. 1996. Effects of sampling area and subsampling procedure on

Zar, J.H. 1999. *Biostatistical Analysis. 4th edition*. Prentice Hall, Upper Saddle River, New

Public Works and Transportation. Washington, DC, USA.

*Benthological Society* 15:392-399.

Jersey, USA.

resources and water pollution control laws. Printed for use of the Committee on

comparisons of taxa richness among streams. *Journal of the North American* 

Traditionally, decisions about health services were made on the basis of health-provider and health authorities' views on what is in the best interest of the patient. This was based on a view that members of the general public lack the technical knowledge to make fully informed decisions themselves. Currently, the use of patient satisfaction surveys (PSS) in developing countries is advancing. Professionals have recognized that a systematic and consumer oriented perspective toward patient viewpoints about the level of care can result in feedback useful for promoting higher quality standards of patient care (Dağdeviren & Akturk 2004; Newman et al. 1998; Peltzer 2009).

Patient satisfaction surveys are seen as a means of determining patients' views on primary health care (PHC) (Ajayi, Olumide & Oyediran 2005; Andaleeb 2001; Campbell, Ramsay & Green 2001). These surveys are increasingly being promoted as a means of understanding health care service quality and the demand for these services in developing countries (Glick 2009) for various reasons. First, they highlight those aspects of care that need improvement in a health care setting (Ajayi, Olumide & Oyediran 2005; Muhondwa *et al.* 2008; Newman *et al.* 1998). Second, they are simple, quick and inexpensive to administer. Third, they are critical for developing measures to increase the utilization of PHC services. Fourth, they can help to educate medical staff about their achievements as well as their failures, assisting them to be more responsive to their patients' needs. Fifth, they allow managerial judgment to be exercised from a position of knowledge rather than guesswork in the important task of managing public expectations and resources (Glick 2009).

The South African government also endorses the centrality of consumers in service delivery. The White Paper on Transforming Public Services of 1997 (Department of Public Service and Administration 1997) and the Department of Health's policy on quality in health care (Department of Health 2007) state that public services need to respond to customers' needs, wants and expectations. Feedback from consumers is required in terms of experiences of health services – quality of care received. Feedback from customers will not only improve

Patient Satisfaction with Primary Health Care Services

 The clinic renders comprehensive integrated PHC services using a one-stop approach for at least 8 hours a day, five days a week. Access, as measured by the proportion of people

 The clinic receives a supportive monitoring visit at least once a month to support personnel, monitor the quality of service and identify

 The clinic has at least one member of staff who has completed a recognised PHC course. Doctors and other specialised professionals are

accessible for consultation, support and referral and provide periodic visits. Clinic managers receive training in facilitation skills and primary health care management. There is an annual evaluation of the provision of the PHC services to reduce the gap between needs and service provision using a situation analysis of the community's health needs and the regular health information data

needs and priorities.

collected at the clinic. There is annual plan based on this evaluation. The clinic has a mechanism for monitoring services and quality assurance and at least one

annual service audit.

Source: Department of Health, South Africa

**2.3 Data collection method** 

Community perception of services is tested at least

twice a year through patient interviews or anonymous patient questionnaires.

Table 1. The core norms and services for primary health care PHC set by the NDOH.

A patient satisfaction questionnaire adapted from the one developed by the Health Systems Trust in 2004 was used. Only slight changes were made to the questionnaire in collaboration with the Eastern Cape Department of Health to allow for cross-comparisons with earlier patient surveys that have been undertaken within the Eastern Cape Province using the same

living within 5km of a clinic, is improved.

Core Norms Core Services

Eastern Cape Department of Health.

in a Selected District Municipality of the Eastern Cape of South Africa 87

in. Ethics approval for the study protocol was obtained from the University of Fort Hare's Research Ethics Committee and permission to conduct the study was received from the

> Women's Reproductive Health Integrated Management of Childhood

 Diseases prevented by Immunisation Adolescent and Youth Health

 Management of Communicable Disease Control of Cholera, diarrhoeal disease and

Sexual Transmitted Diseases (STD)and

Prevention of Hearing Impairment due to

 Treatment and support of victims of Sexual Offenses, Domestic Violence and

Chronic Diseases, Diabetes, Hypertension

Gender Violence

Illness

dysentery

HIV/AIDS

Otitis Media Rheumatic Fever and Haemolytic Streptococcal Infection

 Trauma and Emergency Oral and Mental Health

Substance Abuse

Rehabilitation Services

Geriatric care

Helminths

 Malaria Rabies Tuberculosis Leprosy

knowledge of decision makers, but will also facilitate more improved prioritization, improved strategic resource allocation and improved value for money. It will also serve as a platform for providing better services to citizens.

Against this background, a patient satisfaction survey with PHC services was conducted in a selected district of the Eastern Cape.

## **2. Methods**

## **2.1 Design and setting**

A cross-sectional descriptive design was employed to collect data among patients visiting 12 clinics in a selected health district of the Eastern Cape of South Africa in 2009. The majority of the South Africans are dependent on the public health sector, with only 15% of the citizenry belonging to a private medical aid scheme (McIntyre, 2010). In the Eastern Cape, private medical aid covers only 10.9% of the province's population and less than 7% of South Africa's private and independent hospitals, are located in the Eastern Cape (Hospital Association of South Africa, 2011). The current study focused on public health services (the main provider of health care) in a selected district in the Eastern Cape. We did not ask about private health care utilization.

The public health system of the Eastern Cape consists of 817 clinics, 81 hospitals and 18 community health care centres. The core norms, set by the National Department of Health in South Africa, for primary health care services are indicated in Table 1. Statistics South Africa estimated that the mid-year population for Eastern Cape in 2010 was 6743800, about 13,5% of the estimated total population of South Africa. Persons under the age of 15 years constitute 32.8% of the total population and the economically active population (15- 64 years) is 61.2% of total population of the Eastern Cape. For the period 2006 to 2011 it is estimated that approximately 211 600 people will migrate from the Eastern Cape to other provinces of the country (Statistics South Africa, 2010). The burden of disease study for the year 2000, estimated that South Africans suffer from poverty-related diseases and conditions, emerging chronic diseases, injuries and HIV/AIDS, and differences in morbidity and mortality between socioeconomic groups (Bradshaw et al., 2003). The Eastern Cape had an estimated 23.6% unemployment rate at the end of the second quarter of 2009. This drives levels of poverty in the province, as those deemed to be living in poverty was 3564504, nearly 53% of the 2010 midyear population. The corresponding figure for South Africa is 38%, making the province one of the poorest in the country (ECSECC, 2011).

## **2.2 Sample and procedure**

A purposive sample of 836 out of 939 patients (89% response rate), visiting 12 primary care facilities in a selected district of the Eastern Cape of South Africa were interviewed while exiting the clinic. Patients aged 18 years or above were considered eligible, provided that they were able to understand and respond to the interview questions. Patients were interviewed face to face by trained interviewers in their preferred language in five consecutive days per clinic. Two fieldworkers and one fieldwork coordinator (with at least a high school certificate) were trained per clinic. Two of the four local fieldworkers conducted the interviews in the respective clinics, while the other two served as a reserve. A clinic nurse supervised and coordinated the fieldwork process in the clinic where s/he was based

knowledge of decision makers, but will also facilitate more improved prioritization, improved strategic resource allocation and improved value for money. It will also serve as a

Against this background, a patient satisfaction survey with PHC services was conducted in

A cross-sectional descriptive design was employed to collect data among patients visiting 12 clinics in a selected health district of the Eastern Cape of South Africa in 2009. The majority of the South Africans are dependent on the public health sector, with only 15% of the citizenry belonging to a private medical aid scheme (McIntyre, 2010). In the Eastern Cape, private medical aid covers only 10.9% of the province's population and less than 7% of South Africa's private and independent hospitals, are located in the Eastern Cape (Hospital Association of South Africa, 2011). The current study focused on public health services (the main provider of health care) in a selected district in the Eastern Cape. We did not ask about

The public health system of the Eastern Cape consists of 817 clinics, 81 hospitals and 18 community health care centres. The core norms, set by the National Department of Health in South Africa, for primary health care services are indicated in Table 1. Statistics South Africa estimated that the mid-year population for Eastern Cape in 2010 was 6743800, about 13,5% of the estimated total population of South Africa. Persons under the age of 15 years constitute 32.8% of the total population and the economically active population (15- 64 years) is 61.2% of total population of the Eastern Cape. For the period 2006 to 2011 it is estimated that approximately 211 600 people will migrate from the Eastern Cape to other provinces of the country (Statistics South Africa, 2010). The burden of disease study for the year 2000, estimated that South Africans suffer from poverty-related diseases and conditions, emerging chronic diseases, injuries and HIV/AIDS, and differences in morbidity and mortality between socioeconomic groups (Bradshaw et al., 2003). The Eastern Cape had an estimated 23.6% unemployment rate at the end of the second quarter of 2009. This drives levels of poverty in the province, as those deemed to be living in poverty was 3564504, nearly 53% of the 2010 midyear population. The corresponding figure for South Africa is 38%, making the province one of the poorest in the country

A purposive sample of 836 out of 939 patients (89% response rate), visiting 12 primary care facilities in a selected district of the Eastern Cape of South Africa were interviewed while exiting the clinic. Patients aged 18 years or above were considered eligible, provided that they were able to understand and respond to the interview questions. Patients were interviewed face to face by trained interviewers in their preferred language in five consecutive days per clinic. Two fieldworkers and one fieldwork coordinator (with at least a high school certificate) were trained per clinic. Two of the four local fieldworkers conducted the interviews in the respective clinics, while the other two served as a reserve. A clinic nurse supervised and coordinated the fieldwork process in the clinic where s/he was based

platform for providing better services to citizens.

a selected district of the Eastern Cape.

**2. Methods** 

**2.1 Design and setting** 

private health care utilization.

(ECSECC, 2011).

**2.2 Sample and procedure** 

in. Ethics approval for the study protocol was obtained from the University of Fort Hare's Research Ethics Committee and permission to conduct the study was received from the Eastern Cape Department of Health.


Source: Department of Health, South Africa

Table 1. The core norms and services for primary health care PHC set by the NDOH.

### **2.3 Data collection method**

A patient satisfaction questionnaire adapted from the one developed by the Health Systems Trust in 2004 was used. Only slight changes were made to the questionnaire in collaboration with the Eastern Cape Department of Health to allow for cross-comparisons with earlier patient surveys that have been undertaken within the Eastern Cape Province using the same

Patient Satisfaction with Primary Health Care Services

Table 2. Demographic Characteristics.

Mean no. of private doctor visits in 12

Table 3. Health Care Utilization.

Mean no. of traditional healer visits in 12

**Items M (SD)** Mean no. of clinic visits in 12 months 8.48 (6.798) Mean no. of hospital visits in 12 months 1.14 (1.905)

months 1.58 (2.520)

months 0.23 (1.121) Main reason for visiting health facility N (%) Non-communicable diseases 108 (11.5) Communicable diseases **40 (4.3)** Treatment **391 (41.8)** Treatment (for baby or child) **57 (6.1)** Bodily aches **47 (5.0)** Family Planning **95 (10.1)** Other/Unidentified **198 (21.2)**

Mean Age: M (SD) 39 years (14.91)

**Gender N (%)** Male 230 (29.9) Female 674 (72.9) **Race N (%)** African 469 (50.9) White 26 (2.8) Indian 10 (1.1) Coloured 375 (40.7) Other 39 (4.2) **Occupation N (%)** Employed 335 (37.1) Not Employed 509 (56.4) Other 59 (6.4) **Highest Level of Education N (%)** None 139 (15.4) Finished primary 325 (36.0) Finished Grade 10 266 (29.4) Finished Grade 12 108 (11.9) Degree/Diploma 12 (1.3) Other 54 (6.0) **Enough Money to meet own needs N (%)** None 354 (39.3) A little 318 (35.3) Moderately 112 (12.4) Mostly 40 (4.4) Completely 10 (1.1) Other 67 (7.4)

in a Selected District Municipality of the Eastern Cape of South Africa 89

questionnaire. Further, some questions were asked on demographics, health status, reason for health visit, and health care utilization. The questionnaire was translated from English into Afrikaans and Xhosa. The Xhosa and Afrikaans versions were developed using backtranslation methods (Brislin 1970). The procedure entailed having two native-speakers of the target languages independently do a back-translation. Discrepancies were arbitrated by a third consultant, and solutions were reached by consensus. The translated questionnaire underwent pilot-testing.

## **2.4 Measures**

The questionnaire included demographics and eight domains, each having several items on a 5-point likert scale: Strongly Agree=5; Agree=4; Unsure=3; Disagree=2; and Strongly Disagree=1

## **2.5 Data analysis**

Data was captured on SPSS version 17.0 and analysed. Frequency distributions of domain items were made and positive responses (Agree and Strongly agree) were grouped and are presented. Cross tabulations of domain items by gender were made. Chi square tests were performed to determine the relationship between each domain item and gender.

## **2.6 Limitations**

Response biases introduced through the methodology of using exit interviews might act as filters and influence patient satisfaction ratings. For example, exit interviews automatically select out those who do not have access to public health facilities, but would otherwise have used services. In addition, using exit interviews in health facilities identified by the subdistrict officials, means that respondents were purposively selected. Non-randomisation in the selection of respondents means that results are more difficult to generalise to a feeder population around a health facility. The study compensated for this limitation by collecting data from each facility over a week during a period of normal use and through achieving a high number of respondents. A further limitation is that the existing PSS methodology does not enable the relationship between aggregate satisfaction scores and changes in health status of populations to be explored.

## **3. Results**

## **3.1 Sample characteristics**

The majority of the respondents were African (50.9%), female (72.9) and unemployed (56.4%) with a mean age of 39.4 years. Only 5.5% of the respondents indicated that they had enough money to meet their basic needs for most of the time. Almost 85% had some form of formal education.

#### **3.2 Utilization of health services**

Most respondents visited clinics more frequently (i.e. about 7 times within 12 months) compared to private doctors (i.e. about 1.58 times), hospitals (i.e. about 1.14 times) and traditional healers were list visited (0.23 times). The main reason for visiting the health facility was to get treatment (41.8%), followed by suffering from non-communicable diseases (NCDs) (11.5%) and family planning (10.1%).

questionnaire. Further, some questions were asked on demographics, health status, reason for health visit, and health care utilization. The questionnaire was translated from English into Afrikaans and Xhosa. The Xhosa and Afrikaans versions were developed using backtranslation methods (Brislin 1970). The procedure entailed having two native-speakers of the target languages independently do a back-translation. Discrepancies were arbitrated by a third consultant, and solutions were reached by consensus. The translated questionnaire

The questionnaire included demographics and eight domains, each having several items on a 5-point likert scale: Strongly Agree=5; Agree=4; Unsure=3; Disagree=2; and Strongly

Data was captured on SPSS version 17.0 and analysed. Frequency distributions of domain items were made and positive responses (Agree and Strongly agree) were grouped and are presented. Cross tabulations of domain items by gender were made. Chi square tests were

Response biases introduced through the methodology of using exit interviews might act as filters and influence patient satisfaction ratings. For example, exit interviews automatically select out those who do not have access to public health facilities, but would otherwise have used services. In addition, using exit interviews in health facilities identified by the subdistrict officials, means that respondents were purposively selected. Non-randomisation in the selection of respondents means that results are more difficult to generalise to a feeder population around a health facility. The study compensated for this limitation by collecting data from each facility over a week during a period of normal use and through achieving a high number of respondents. A further limitation is that the existing PSS methodology does not enable the relationship between aggregate satisfaction scores and changes in health

The majority of the respondents were African (50.9%), female (72.9) and unemployed (56.4%) with a mean age of 39.4 years. Only 5.5% of the respondents indicated that they had enough money to meet their basic needs for most of the time. Almost 85% had some form of

Most respondents visited clinics more frequently (i.e. about 7 times within 12 months) compared to private doctors (i.e. about 1.58 times), hospitals (i.e. about 1.14 times) and traditional healers were list visited (0.23 times). The main reason for visiting the health facility was to get treatment (41.8%), followed by suffering from non-communicable diseases

performed to determine the relationship between each domain item and gender.

underwent pilot-testing.

**2.4 Measures** 

Disagree=1

**2.5 Data analysis** 

**2.6 Limitations** 

**3. Results** 

formal education.

status of populations to be explored.

**3.1 Sample characteristics** 

**3.2 Utilization of health services** 

(NCDs) (11.5%) and family planning (10.1%).


Table 2. Demographic Characteristics.


Table 3. Health Care Utilization.

Patient Satisfaction with Primary Health Care Services

It takes longer than an hour to go to the

I don't think healthworkers/nurses come

It cost more than R10.00 to get to the

The nurse who treated me spoke in a

When I come to this clinic I'm always treated & never told to return on another

The clinic is user friendly to disabled

Getting through to the clinic on the

Table 6. Perceived Access to PHC Services.

made them feel they had time during consultations.

Being able to speak to the nurse

**4.2 Perceived empathy** 

**4.3 General satisfaction** 

staff were helpful.

day

**Item Men: N** 

in a Selected District Municipality of the Eastern Cape of South Africa 91

clinic 70 (27.5) 184 (72.2) 255 (27.2) 0.067

clinic 50 (29.2) 120 (70.2) 171 (18.2) 0.218 The clinic has convenient opening hours 161 (25.8) 461 (74.0) 623 (66.3) 0.982

often enough to the place where I stay 76 (27.4) 201 (72.6) 277 (29.5) 0.689 I paid money to be treated in this clinic 28 (24.6) 86 (75.4) 144 (12.1) 0.088

language I understood 191 (25.5) 556 (74.3) 748 (79.7) 0.07

persons 153 (25.4) 449 (74.6) 602 (64.1) 0.265 Getting an appointment to suit you 114 (23.7) 367 (76.3) 481 (51.2) 0.429

phone 101 (25.0) 303 (75.0) 404 (43.1) 0.834

practitioner on the telephone 93 (26.6) 256 (73.4) 349 (37.2) 0.588

Women, when compared to men, were also more positive in their responses to items of the empathy domain. More than three quarters of women respondents agreed that their privacy was respected by all the staff involved in their treatment, that the nurse/doctor who treated them was polite and that they could answer all questions about their illness. The same number felt that this made it easy to tell the doctor/nurse about their problems. Just under three quarters of women respondents agreed that the nurse/doctor who treated them introduced themselves, that they gave their permission to be examined and treated and

Larger proportions of women when compared to men had positive responses on items of this domain. Almost eight in ten women respondents positively agreed that patients do not usually appreciate all that the clinic staff does for them. More than three in four also agreed that staff do inform clients of changes in service, as well as any delays in services, on occasion. The same proportion of women agreed that their treatment is always better when an injection is administered and that they are pleased with the way they were treated at the clinic. Nearly three quarters agreed that they always get treatment when attending the clinic where they were interviewed and that they would attend the same facility again on another occasion. The same number will also recommend the clinic to friends and family when should they be sick. Despite these figures, just more than six in ten women agreed that the

**(%)** 

**Women: N (%)** 

155 (25.8) 445 (74.0) 601 (64.0) 0.478

**Total: N** 

**(%) <sup>P</sup>**

## **3.3 Symptom reporting**

More than two-thirds reported that coughing, headache, fever, and body/limb aches were the symptoms suffered in descending order.


Table 4. Symptom Reporting.

## **3.4 Prior diagnosis**

More than 60% of the respondents indicated that they had prior diagnosis of other STIs (95%), other illnesses (88.9%), TB (78.4%), Diabetes (72.2%), High Blood Pressure (69.3%) and HIV (65.2%).


Table 5: Prior diagnosis.

## **4. Descriptions of patients evaluations: percentage of patients who used the most positive answering category by sex (N=836, percentages)**

## **4.1 Access to PHC services**

A larger portion of women respondents positively agreed with the items from the access domain, than did men. More than three quarters of women agreed that it was possible to get an appointment that suited them and about the same number indicated that no payment was required for treatment at that clinic. About three quarters of women also agreed that it was possible to get through to the clinic by telephone and that the clinic was disabilityfriendly. The same number of women also agreed that they were treated by nurses who spoke a language they could understand and that the clinic's opening hours were convenient. Nearly 75% of women further agreed that they are always treated and not asked to return on another day and that is was possible to speak to the nurse on the phone. More than seven in ten women did not think that nurses did not visit their places of residence often enough. In terms of time and financial costs, about seven in ten women respondents agreed that the journey there took longer than one hour and that it costs more than R10-00 (US\$ 1.46) to get to the clinic.


Table 6. Perceived Access to PHC Services.

## **4.2 Perceived empathy**

90 Modern Approaches To Quality Control

More than two-thirds reported that coughing, headache, fever, and body/limb aches were

More than 60% of the respondents indicated that they had prior diagnosis of other STIs (95%), other illnesses (88.9%), TB (78.4%), Diabetes (72.2%), High Blood Pressure (69.3%)

**4. Descriptions of patients evaluations: percentage of patients who used the** 

A larger portion of women respondents positively agreed with the items from the access domain, than did men. More than three quarters of women agreed that it was possible to get an appointment that suited them and about the same number indicated that no payment was required for treatment at that clinic. About three quarters of women also agreed that it was possible to get through to the clinic by telephone and that the clinic was disabilityfriendly. The same number of women also agreed that they were treated by nurses who spoke a language they could understand and that the clinic's opening hours were convenient. Nearly 75% of women further agreed that they are always treated and not asked to return on another day and that is was possible to speak to the nurse on the phone. More than seven in ten women did not think that nurses did not visit their places of residence often enough. In terms of time and financial costs, about seven in ten women respondents agreed that the journey there took longer than one hour and that it costs more

**most positive answering category by sex (N=836, percentages)** 

**3.3 Symptom reporting** 

Table 4. Symptom Reporting.

**3.4 Prior diagnosis** 

and HIV (65.2%).

Table 5: Prior diagnosis.

**4.1 Access to PHC services** 

than R10-00 (US\$ 1.46) to get to the clinic.

the symptoms suffered in descending order.

**Symptoms N (%)**  Coughing 183 (87.1) Body/limb aches 110 (80.3) Fever 153 (85.0) Rash 46 (63.0) Headache 178 (86.8) Diarrhoea 23 (46.0)

**Prior Diagnosis N (%)**  TB 105 (78.4) HIV 56 (65.1) Diabetes 78 (72.2) Other STI 891 (94.9) High Blood Pressure 651 (69.3) Pregnancy 38 (55.1) Other illness 835 (88.9)

Women, when compared to men, were also more positive in their responses to items of the empathy domain. More than three quarters of women respondents agreed that their privacy was respected by all the staff involved in their treatment, that the nurse/doctor who treated them was polite and that they could answer all questions about their illness. The same number felt that this made it easy to tell the doctor/nurse about their problems. Just under three quarters of women respondents agreed that the nurse/doctor who treated them introduced themselves, that they gave their permission to be examined and treated and made them feel they had time during consultations.

### **4.3 General satisfaction**

Larger proportions of women when compared to men had positive responses on items of this domain. Almost eight in ten women respondents positively agreed that patients do not usually appreciate all that the clinic staff does for them. More than three in four also agreed that staff do inform clients of changes in service, as well as any delays in services, on occasion. The same proportion of women agreed that their treatment is always better when an injection is administered and that they are pleased with the way they were treated at the clinic. Nearly three quarters agreed that they always get treatment when attending the clinic where they were interviewed and that they would attend the same facility again on another occasion. The same number will also recommend the clinic to friends and family when should they be sick. Despite these figures, just more than six in ten women agreed that the staff were helpful.

Patient Satisfaction with Primary Health Care Services

If I can't be helped here I will be referred

Nurses in this facility call an ambulance if

When I'm sick I usually visit a traditional

Nurses in this facility ask patients to

Table 9. Referral.

are satisfactory

complaints

service delivery

name tag on him/her

suggestion box provided

clinic committee of this facility

Table 10. Service Standards.

**4.5 Service standards** 

**Item Men: N** 

received feedback and that such action improved service delivery.

**Item Men: N** 

The registration procedures in this clinic

In this clinic the time I had to wait before I

There are fast queues in this clinic (e.g. under 5 Immunisation, TB clients, etc)

The health worker that assisted me had a

I know where and to whom to raise my

When I complain I write it and put it in the

Raising complaints/suggestions improve

I know the chairperson/member of the

was examined was reasonable

in a Selected District Municipality of the Eastern Cape of South Africa 93

nurses in that facility ask patients to return to see how they are doing. The role of traditional healers is still an important aspect of health care, as more than three quarters of women responded that they usually visit a traditional healer before coming to the clinic.

to the nearest hospital/Doctor 163 (24.4) 503 (75.4) 667 (71.0) 0.216

a client is very sick 167 (24.4) 517 (75.5) 685 (72.9) 0.358

return to see how they are doing 152 (24.2) 475 (75.6) 628 (66.9) 0.321

healer before I come to clinic 42 (23.3) 137 (76.1) 180 (19.2) 0.395

Items in the service standards domain elicited more positive responses from women than from men. More than three in four women responded that they knew either the chairperson or a member of the clinic committee of that clinic that the health worker that assisted them had a name tag on her/him, that they knew where and to whom to raise complaints, and know of the availability of a suggestion box at the clinic. The same number also agreed that the registration procedures in the clinic were satisfactory, waiting time before examination was reasonable and that there were are fast queues in this clinic for certain services. Just fewer than three in four women agreed that when they had reason to complain, they

**(%)**

When I complained I received feedback 61 (24.6) 187 (75.4) 248 (26.4) 0.723

**Women: N (%)**

148 (24.7) 451 (75.2) 600 (63.9) 0.005

127 (25.5) 372 (74.5) 499 (53.1) 0.449

114 (24.9) 342 (74.8) 457 (48.7) 0.849

150 (23.9) 476 (75.9) 627 (66.8) 0.248

89 (26.0) 253 (74.0) 342 (36.4) 0.602

85 (24.7) 259 (75.3) 344 (36.6) 0.606

87 (27.0) 235 (73.0) 322 (34.3) 0.042

48 (23.4) 157 (76.6) 205 (21.8) 0.515

**Total: N (%)** 

**P** 

**(%)** 

**Women: N (%)** 

**Total: N** 

**(%) <sup>P</sup>**


Table 7. Empathy.


Table 8. General Satisfaction.

## **4.4 Referral**

The items in this domain also received a majority of positive responses from women respondents. For example, more than three quarters of women agreed that if they cannot be helped at the clinic they will be referred to the nearest hospital or doctor. They same number was also sure that nurses in this facility will call an ambulance if a client is very sick and that nurses in that facility ask patients to return to see how they are doing. The role of traditional healers is still an important aspect of health care, as more than three quarters of women responded that they usually visit a traditional healer before coming to the clinic.


Table 9. Referral.

92 Modern Approaches To Quality Control

introduced him/herself 121 (25.9) 348 (73.9) 468 (49.8) 0.49

answered all questions about my illness 162 (24.8) 490 (75.0) 653 (69.5) 0.474

treated 185 (25.4) 543 (74.5) 729 (77.6) 0.908 My privacy was respected by all the staff 172 (24.2) 538 (75.7) 711 (75.7) 0.022

polite 162 (24.3) 503 (75.5) 666 (70.9) 0.251

interested in their clients 165 (25.2) 490 (74.7) 656 (69.9) 0.883

consultations 168 (26.4) 469 (73.6) 638 (67.9) 0.481 Interest in your personal situation 163 (26.1) 462 (74.0) 626 (66.6) 0.425

about your problems 164 (24.8) 497 (75.2) 662 (70.5) 0.721

**(%)** 

Next time I am ill I will come back here 184 (25.0) 551 (74.9) 736 (78.4) 0.034

at this clinic 172 (24.0) 544 (75.9) 717 (76.4) 0.034

them to come to this facility 172 (25.0) 514 (74.8) 687 (73.2) 0.426

injection 117 (23.6) 379 (76.4) 496 (52.8) 0.147

staff in this clinic do for them 97 (21.6) 353 (78.4) 450 (47.9) 0.008 I always get treatment when I come here 172 (24.6) 526 (75.3) 699 (74.4) 0.397

from time to time 136 (23.7) 438 (76.3) 574 (61.1) 0.042

from time to time 128 (23.3) 421 (76.7) 549 (58.5) 0.019 The helpfulness of staff 277 (37.3) 465 (62.7) 624 (66.4) 0.715

The items in this domain also received a majority of positive responses from women respondents. For example, more than three quarters of women agreed that if they cannot be helped at the clinic they will be referred to the nearest hospital or doctor. They same number was also sure that nurses in this facility will call an ambulance if a client is very sick and that

**(%)**

**Women: N (%)**

**Women: N (%)** 

**Total: N** 

**Total: N (%) <sup>P</sup>**

**(%) <sup>P</sup>**

**Item Men: N** 

The nurse/Doctor who treated me

The nurse/Doctor who treated me

I gave permission to be examined and

The nurse/doctor who treated me was

Making you feel you had time during

Making it easy for you to tell him or her

I was pleased with the way I was treated

My treatment is always better if I have an

Patients don't usually appreciate all that

Staff informs clients of delays in service

Staff informs clients of changes in service

Table 8. General Satisfaction.

**4.4 Referral** 

If my friends/family are sick I will tell

Item **Men: N** 

The nurse in this clinic are very

Table 7. Empathy.

## **4.5 Service standards**

Items in the service standards domain elicited more positive responses from women than from men. More than three in four women responded that they knew either the chairperson or a member of the clinic committee of that clinic that the health worker that assisted them had a name tag on her/him, that they knew where and to whom to raise complaints, and know of the availability of a suggestion box at the clinic. The same number also agreed that the registration procedures in the clinic were satisfactory, waiting time before examination was reasonable and that there were are fast queues in this clinic for certain services. Just fewer than three in four women agreed that when they had reason to complain, they received feedback and that such action improved service delivery.


Table 10. Service Standards.

Patient Satisfaction with Primary Health Care Services

**4.8 Tangibles** 

understand

**4.9 Assurance** 

Table 13. Tangibles.

in a Selected District Municipality of the Eastern Cape of South Africa 95

Items under the tangibles domain also yielded positive responses from the majority of women respondents. More than three in four women agreed that the toilets were clean and in a good condition, that there were indeed toilets for patients in the clinic and that the clinic has enough consultation rooms and that there were enough benches for patients to sit while waiting to be seen by health workers. Just under three quarters agreed that there was clean drinking water for patients, that the building was in a good condition and the clinic and its surroundings are clean, that the services and hours of service displayed on the board

**(%)** 

The clinic building is in a good condition 174 (25.7) 501 (74.1) 676 (72.0) 0.976 The clinic and its surroundings are clean 182 (25.5) 532 (74.4) 715 (76.1) 0.903 There are toilets for patients in this clinic 167 (24.1) 526 (75.8) 694 (73.9) 0.026 The toilets are in a good condition 150 (23.7) 483 (76.2) 634 (67.5) 0 The toilets are clean 146 (23.7) 468 (76.1) 615 (65.5) 0.009 The clinic has enough consultation rooms 137 (24.5) 421 (75.3) 559 (59.5) 0.282

waiting to be seen by health worker 165 (24.5) 507 (75.3) 673 (71.7) 0.615 There is clean water for patients in this clinic 171 (25.1) 509 (74.7) 681 (72.5) 0.432

clearly displayed on a board outside the facility 149 (26.0) 424 (73.9) 574 (61.1) 0.463

A greater percentage of women than men also responded positively to items of the assurance domain. More than three quarters of women agreed that the staff at the clinic had given preferential treatment to patients who looked more ill, that the nurses were able to tell them more about their illness and symptoms, and that were told how to store and selfadminister their medication. The same proportion also agreed that health workers gave them help in dealing with the emotional problems related to your health status, that they felt comfortable to bring their partners to the facility when requested and that they felt assured that their treatment records remained confidential. Exactly three quarters agreed with their physical examination by health workers, and that they helped in making their patients understand the importance of following his or her medical advice and their

Just under three quarters of women respondents agreed that attending the health service meant quick relief of one's symptoms, that the explanation of the purpose of tests and treatments were clear and that they felt compelled to complete their treatment as was instructed. Slightly less than three in four women agreed that health workers at the facility listened to patients, that they can get them to always return when asked to do so and that they involved their patients in decisions affecting their medical care. The same proportion of women felt that health workers also helped patients to feel well enough to perform

preparation of patients as to what to expect from specialist or hospital care.

**Women: N (%)** 

142 (25.2) 421 (74.6) 564 (60.1) 0.378

**Total: N (%) <sup>P</sup>**

outside the clinic was clear and in a language that could be understood.

Domain **Men: N** 

There are benches for patients to sit while

The services rendered and hours of service are

The services and hours of service displayed on the board outside are in a language I can

## **4.6 Reliability**

More than three quarters of women judged services as reliable as they did not wait long before receiving medication and that the clinic provided quick services for urgent health problems. Just under this figure regarded general waiting time in waiting rooms as positive.


Table 11. Reliability.

## **4.7 Health promotion**

A majority of women respondents were positive on items referring to health promotion at the clinic. For example, more than three quarters agreed that as patients are waiting to be seen, health workers in the clinic sometimes give talks on health related issues affecting the community. Also, nearly three in four replied that when they had to wait at the clinic, very useful things can be learnt from the posters and other IEC materials. The reason for this was that the posters and other IEC materials, the 'Batho Pele' (people first) principles and the patients' rights charter, were all in a language they could understand.


Table 12. Health Promotion.

### **4.8 Tangibles**

94 Modern Approaches To Quality Control

More than three quarters of women judged services as reliable as they did not wait long before receiving medication and that the clinic provided quick services for urgent health problems. Just under this figure regarded general waiting time in waiting rooms as positive.

**(%)** 

have to wait long for them 140 (24.2) 439 (75.8) 579 (61.7) 0.003

Waiting time in the waiting room 106 (25.2) 314 (74.8) 420 (44.7) 0.062

problems 136 (24.2) 425 (75.8) 562 (59.8) 0.161

A majority of women respondents were positive on items referring to health promotion at the clinic. For example, more than three quarters agreed that as patients are waiting to be seen, health workers in the clinic sometimes give talks on health related issues affecting the community. Also, nearly three in four replied that when they had to wait at the clinic, very useful things can be learnt from the posters and other IEC materials. The reason for this was that the posters and other IEC materials, the 'Batho Pele' (people first) principles and the

**(%)** 

a language I understand 145 (25.0) 435 (74.9) 581 (61.9) 0.927

patients' rights charter, were all in a language they could understand.

**Item Men: N** 

I saw on the walls of this clinic a Patients Rights Charter in a language I could

I saw on the walls of this clinic Batho Pele

sometimes learn very useful things from the posters and other IEC (Information, Education & Communication) materials

The posters and other IEC material are in

As patients are waiting to be seen, health workers in this facility sometimes talk to us about health related issues that affect

Principles in a language I could

When I had to wait in this clinic I

**Women: N (%)** 

**Women: N** 

148 (26.3) 413 (73.5) 562 (59.9) 0.887

141 (26.3) 395 (73.6) 537 (57.2) 0.719

138 (24.9) 415 (74.9) 554 (59.0) 0.417

113 (22.9) 379 (76.9) 493 (52.5) 0.177

**(%)** 

**Total: N (%) <sup>P</sup>**

**Total: N (%) <sup>P</sup>**

**Item Men: N** 

If I received medicines or pills I did not

Providing quick services for urgent health

**4.6 Reliability** 

Table 11. Reliability.

**4.7 Health promotion** 

understand

understand

our community

Table 12. Health Promotion.

Items under the tangibles domain also yielded positive responses from the majority of women respondents. More than three in four women agreed that the toilets were clean and in a good condition, that there were indeed toilets for patients in the clinic and that the clinic has enough consultation rooms and that there were enough benches for patients to sit while waiting to be seen by health workers. Just under three quarters agreed that there was clean drinking water for patients, that the building was in a good condition and the clinic and its surroundings are clean, that the services and hours of service displayed on the board outside the clinic was clear and in a language that could be understood.


Table 13. Tangibles.

## **4.9 Assurance**

A greater percentage of women than men also responded positively to items of the assurance domain. More than three quarters of women agreed that the staff at the clinic had given preferential treatment to patients who looked more ill, that the nurses were able to tell them more about their illness and symptoms, and that were told how to store and selfadminister their medication. The same proportion also agreed that health workers gave them help in dealing with the emotional problems related to your health status, that they felt comfortable to bring their partners to the facility when requested and that they felt assured that their treatment records remained confidential. Exactly three quarters agreed with their physical examination by health workers, and that they helped in making their patients understand the importance of following his or her medical advice and their preparation of patients as to what to expect from specialist or hospital care.

Just under three quarters of women respondents agreed that attending the health service meant quick relief of one's symptoms, that the explanation of the purpose of tests and treatments were clear and that they felt compelled to complete their treatment as was instructed. Slightly less than three in four women agreed that health workers at the facility listened to patients, that they can get them to always return when asked to do so and that they involved their patients in decisions affecting their medical care. The same proportion of women felt that health workers also helped patients to feel well enough to perform

Patient Satisfaction with Primary Health Care Services

assessed might be a further factor.

**6. Acknowledgements** 

**7. References** 

in a Selected District Municipality of the Eastern Cape of South Africa 97

men #(Kaplan, 1996), whilst a 2005 Canadian study (Human Resources and Skills Development Canada, 2009) found almost similar satsifaction levels between male (86%) and female (84%) patients. However, a meta-analysis of 110 studies of patient satisfaction, using standard instruments, concluded that there was no average difference in satisfaction with medical care between women and men (Hall & Dornan, 1990). More recently, Sanmartin et al. (2002) suggested that user frequency might influence the descrepancies found betwee male and female patient statisfaction rates and that the type of service being

Wessels et al (2010) found that amongst oncology patients women rated care aspect of services more highly. A recent Ugandan study found some gender and age difference in patient satisfaction with TB services (Babikako et al. 2011). Past experience and consequently patient expectations, they argue might influence age and gender differences in patient satisfaction. What is common across these studies is the importance of considering the influence of demographic variables on patient satisfaction. Simply controlling for demographic differences, might result in the needs of important demographic groupings being overlooked. In addition, demographic differences, such as gender are likely to shape patients' needs and preferences and might be a particularly important consideration in shaping specific health services to better meet needs and support treatment adherence. In the South African context, the role that gender plays in patient satisfaction and the gender differences in patient satisfaction need further exploration. We conclude that quality improvement and research in primary care could benefit from gender analysis of patient

satisfaction data and from more gender-sensitive patient satisfaction measures.

clinics for overseeing the study in their respective clinics.

*of Medicine & Medical Science* 34(2), 133-140.

*Policy and Systems* 2011, 9:6. Available from: http://www.health-policy-systems.com/content/9/1/6

South African Medical Research Council, 2003.

We thank the Eastern Cape Department of Health for financially supporting the study. We would like to extend our gratitude to the district manager and clinic managers of the 12

Our gratitude is also due to the fieldworkers and patients who agreed to be interviewed.

Ajayi, I.O., Olumide, E.A. & Oyediran, O., 2005, 'Patient satisfaction with the services

Andaleeb, S.S., 2001, 'Service quality perceptions and patient satisfaction: a study of hospitals in a developing country', *Social Science & Medicine* 52, 1359–1370. Babikako, H.M., Neuhauser, D., Katamba, A., Mupere, E.(2011). Patient satisfaction,

Bediako, M.A., Nel, M. & Hiemstra, L.A., 2006, 'Patients' satisfaction with government health care and services in the Taung district, North West Province' *Curationis* 29(2), 12-15. Bradshaw D, Groenewald P, Laubscher R, Nannan N, Nojilana B, Norman R, Pieterse D and

provided at a general outpatients' clinic, Ibadan, Oyo State, Nigeria', *African Journal* 

feasibility and reliability of satisfaction questionnaire among patients with pulmonary tuberculosis in urban Uganda: a cross-sectional study. *Health Research* 

Schneider M. *Initial Burden of Disease Estimates for South Africa*, 2000. Cape Town:


normal daily activities, were through, knew what advice were given to patients previously and were competent in offering advice on the prevention of diseases.

Table 14. Assurance.

## **5. Discussion**

Seeking to understand patient perspectives is an important step in the efforts to improve the quality of health care. Research examining patient satisfaction with health care provision in South Africa and, more specifically, the perceived quality of care given by the health care providers is limited (Myburgh et al., 2005). In this study, there were consistently significant differences regarding patient satisfaction between male and female patients across selected items in the various domains.

Evidence from developed countries for gender differences in mean satisfaction levels is mixed. Some authors report that women are more satisfied than men with medical care received (Weiss, 1988), and some report that women are more critical of medical care than men #(Kaplan, 1996), whilst a 2005 Canadian study (Human Resources and Skills Development Canada, 2009) found almost similar satsifaction levels between male (86%) and female (84%) patients. However, a meta-analysis of 110 studies of patient satisfaction, using standard instruments, concluded that there was no average difference in satisfaction with medical care between women and men (Hall & Dornan, 1990). More recently, Sanmartin et al. (2002) suggested that user frequency might influence the descrepancies found betwee male and female patient statisfaction rates and that the type of service being assessed might be a further factor.

Wessels et al (2010) found that amongst oncology patients women rated care aspect of services more highly. A recent Ugandan study found some gender and age difference in patient satisfaction with TB services (Babikako et al. 2011). Past experience and consequently patient expectations, they argue might influence age and gender differences in patient satisfaction.

What is common across these studies is the importance of considering the influence of demographic variables on patient satisfaction. Simply controlling for demographic differences, might result in the needs of important demographic groupings being overlooked. In addition, demographic differences, such as gender are likely to shape patients' needs and preferences and might be a particularly important consideration in shaping specific health services to better meet needs and support treatment adherence. In the South African context, the role that gender plays in patient satisfaction and the gender differences in patient satisfaction need further exploration. We conclude that quality improvement and research in primary care could benefit from gender analysis of patient satisfaction data and from more gender-sensitive patient satisfaction measures.

## **6. Acknowledgements**

We thank the Eastern Cape Department of Health for financially supporting the study. We would like to extend our gratitude to the district manager and clinic managers of the 12 clinics for overseeing the study in their respective clinics.

Our gratitude is also due to the fieldworkers and patients who agreed to be interviewed.

## **7. References**

96 Modern Approaches To Quality Control

normal daily activities, were through, knew what advice were given to patients previously

Worker there was a patient that looked more ill 89 (23.8) 285 (76.2) 374 (39.8) 0.172

back 183 (25.4) 536 (74.4) 720 (76.7) 0.381 I finish all my treatment as instructed 190 (25.3) 559 (74.5) 750 (79.9) 0.885 I bring my partner(s) when requested to 157 (24.7) 478 (75.2) 636 (67.7) 0.852 I was told how to take my pills/medication 187 (24.6) 572 (75.3) 799 (80.9) 0.433 I was told how to store my pills/medication 171 (24.5) 527 (75.4) 699 (74.4) 0.136

care 168 (25.8) 483 (74.2) 652 (69.4) 0.608 Listening to you 182 (25.5) 533 (74.5) 716 (76.3) 0.934 Keeping your records and data confidential 175 (24.6) 529 (75.1) 705 (75.0) 0.956 Quick relief of your symptoms 167 (25.4) 490 (74.6) 658 (70.1 0.917

your normal daily activities 172 (25.9) 493 (74.1) 665 (70.9) 0.213 Thoroughness 150 (26.5) 416 (73.5) 567 (60.4) 0.499 Physical examination of you 155 (25.0) 466 (75.0) 622 (66.3) 0.202 Offering you services for preventing diseases 168 (26.2) 473 (73.8) 642 (68.4) 0.228 Explaining the purpose of tests and treatments 173 (25.4) 509 (74.6) 683 (72.7) 0.356

your symptoms and/or illness 162 (24.0) 514 (76.0) 678 (72.2) 0.604

to your health status 161 (24.7) 490 (75.3) 651 (69.3) 0.237

following his or her advice 168 (25.1) 502 (75.0) 671 (71.4) 0.488

previous contacts 162 (26.0) 461 (74.0) 624 (66.5) 0.406

or hospital care 155 (25.1) 463 (75.0) 619 (66.0) 0.914

Seeking to understand patient perspectives is an important step in the efforts to improve the quality of health care. Research examining patient satisfaction with health care provision in South Africa and, more specifically, the perceived quality of care given by the health care providers is limited (Myburgh et al., 2005). In this study, there were consistently significant differences regarding patient satisfaction between male and female patients across selected

Evidence from developed countries for gender differences in mean satisfaction levels is mixed. Some authors report that women are more satisfied than men with medical care received (Weiss, 1988), and some report that women are more critical of medical care than

**(%)**

**Women: N (%)**

**Total: N (%) <sup>P</sup>**

and were competent in offering advice on the prevention of diseases.

**Domain Men: N** 

At the time I was waiting to be seen by a Health

I always return when asked by the nurse to come

Involving you in decisions about your medical

Helping you to feel well so that you can perform

Telling you what you wanted to know about

Helping you understand the importance of

Table 14. Assurance.

items in the various domains.

**5. Discussion** 

Help in dealing with emotional problems related

Knowing what s/he had done or told you during

Preparing you for what to expect from specialist


http://www.health-policy-systems.com/content/9/1/6


**6** 

*1U.S.A. 2P.R.C.* 

**Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure: Quality Control-Quality** 

Due to the lack of a rational, effective, and systematic quality control-quality assurance (QC/QA) methodology, the nonconformity of construction quality with design requirements for public works, especially for civil engineering infrastructure systems, can result in increased expenditures over time. Thus, development of a rational QC/QA methodology to ensure that the construction quality complies with the design requirements should have a high priority. The limited sample size constrained by the consideration of cost and time may result in the

In this chapter, the effects of sampling size, sampling strategies, and acceptance/rejection criteria for QC/QA projects using statistically based decision making in hot-mix asphalt (HMA) construction are presented. Also, there has developed an increased interest recently in ensuring that the HMA as placed will meet certain performance requirements by measuring the actual performance parameters on test specimens prepared from in situ samples rather than from surrogate values such as asphalt content and aggregate gradation. Examples include direct measures of mix permanent deformation characteristics and fatigue characteristics, mix stiffness, and degree of compaction as measured by air-void content. Determination of sample size is primarily based on an acceptable error level for a performance parameter specified by the agency. It is not uncommon to base quality assurance by many agencies on three samples. Through the *t* distributions, discussion is presented as to why it is not appropriate to take only this number of samples for qualityassurance. Based only on three samples in a large project, the agency will have insufficient power to reject the null hypothesis given that the null hypothesis is false unless the project quality delivered by the contractor is extremely poor so that the agency is confident enough

In addition to providing a general introduction to fundamental statistics and hypothesis testing, two case studies are used to clarify the relationships among sampling size, sample strategies, and performance specifications (or acceptance/rejection criterion). These include

misjudgement that the construction quality does not meet the design requirements.

**1. Introduction** 

to reject the project.

the following:

**Assurance Sampling; Specification for** 

**Performance Test Requirements** 

*1University of California at Berkeley, 2South China University of Technology,* 

Bor-Wen Tsai1, Jiangmiao Yu1,2 and Carl L. Monismith1


## **Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure: Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements**

Bor-Wen Tsai1, Jiangmiao Yu1,2 and Carl L. Monismith1 *1University of California at Berkeley, 2South China University of Technology, 1U.S.A. 2P.R.C.* 

## **1. Introduction**

98 Modern Approaches To Quality Control

Brislin, R.W., 1970, 'Back translation for cross-cultural research', *Journal of Cross-Cultural* 

Campbell, J.L., Ramsay, J. & Green, J., 2001, 'Age, gender, socioeconomic and ethnic differences in patients' assessments of primary health care', *Quality in Health Care* 10, 90-95. Dağdeviren, N. & Akturk, Z., 2004, 'An evaluation of patient satisfaction in Turkey with the

De Jager ,J., & Du Plooy, T., 2007, 'Service quality assurance and tangibility for public health

Department of Health, 2007, *A policy on quality in health care for South Africa*, Department of

Department of Public Service and Administration, 1997, *Transforming Public Service Delivery*,

Eastern Cape Socio-Economic Consultative Council (ECSECC) 2011. Statistics at your fingertips *http://www.ecsecc.org/statistics-database* [Accessed 28 June 2011] Glick, P., 2009, 'How reliable are surveys of client satisfaction with healthcare services?

Hall, J.A. & Dornan, M.C., 1990, 'Patient socio-demographic characteristics as predictors of

Hospital Association of South Africa, 2011. *http://www.hasa.co.za/hospitals/members/* [Accessed

Kaplan SH, Sullivan LM, Spetter D. *Gender and patterns of physician-patient communication*. In:

McIntyre D, 2010, *Private sector involvement in funding and providing health services in South* 

Myburgh, N.G., Solanki, G.C., Smith, M.J. & Lalloo, R., 2005, 'Patient satisfaction with health

Peltzer, K., 2009, 'Patient experiences and health system responsiveness in South Africa' ,

Peltzer, K., 2000, 'Community perceptions of biomedical health care in a rural area in the

Newman, R.D., Gloyd, S., Nyangezi, J.M., Machobo, F. & Muiser, J., 1998, 'Satisfaction with

Sanmartin, C., Houle, C., Berthelot, J. & White, K., 2002, *Access to Health Care Services in* 

Weiss, G.L., 1988, 'Patient Satisfaction with Primary Medical Care Evaluation of Sociodemographic and Predispositional Factors', *Medical Care* 26(4), 383-392. Wessels, H., De Graff, A., Wynia, K., De Heus, M., Kruitwagen, C.L.J.J., Woltjer, G.T.G.J.,

Evidence from matched facility and household data in Madagascar', *Social Science &* 

satisfaction with medical care: A meta-analysis, *Social Science & Medicine*, 30(7), 811-818.

Falik MM, Collins KS, eds. Women's health: The Commonwealth Fund Survey.

*Africa: implications for equity and access to health care*, EQUINET Discussion Paper Series 84 Health Economics Unit (UCT), ISER Rhodes University, EQUINET: Harare. Muhondwa, E.P., Leshabari, M.T., Mwangu, M., Mbembati, N. & Ezekiel, M.J., 2008, 'Patient

satisfaction at the Muhimbili National Hospital in Dar es Salaam, Tanzania', *East* 

care providers in South Africa: the influences of race and socioeconomic status',

Outpatient Health Care Services in Manica Province, Mozambique', *Health Policy &* 

Teunissen, S.C.C.M., Voest, E. (2010). Gender-Related Needs and Preferences in Cancer Care Indicate the Need for an Individualized Approach to Cancer Patients.

EUROPEP instrument', *Yonsei Medical Journal* 45(1), 23-28.

Department of Public Service and Administration, Pretoria.

Baltimore, MD: Johns Hopkins University Press; 1996.

*International Journal for Quality in Health Care* 17(6), 473-477.

*BMC Health Services Research* 9, 117, DOI:10.1186/1472-6963-9-117.

Limpopo Province South Africa', *Health SA Gesondheid* 5 (1), 55-63.

*The Oncologist,*15:648–655 doi: 10.1634/theoncologist.2009-0337

*African Journal of Public Health* 5(2), 67-73.

*Canada, 2001;* Statistics Canada, Ottawa.

care in South Africa' *Acta Commercii* 7, 96-117.

*Psychology* 1(3), 185-216.

Health, Pretoria.

*Medicine* 68(2), 368-379.

*Planning* 13(2), 174-180.

28 June 2011]

Due to the lack of a rational, effective, and systematic quality control-quality assurance (QC/QA) methodology, the nonconformity of construction quality with design requirements for public works, especially for civil engineering infrastructure systems, can result in increased expenditures over time. Thus, development of a rational QC/QA methodology to ensure that the construction quality complies with the design requirements should have a high priority. The limited sample size constrained by the consideration of cost and time may result in the misjudgement that the construction quality does not meet the design requirements.

In this chapter, the effects of sampling size, sampling strategies, and acceptance/rejection criteria for QC/QA projects using statistically based decision making in hot-mix asphalt (HMA) construction are presented. Also, there has developed an increased interest recently in ensuring that the HMA as placed will meet certain performance requirements by measuring the actual performance parameters on test specimens prepared from in situ samples rather than from surrogate values such as asphalt content and aggregate gradation. Examples include direct measures of mix permanent deformation characteristics and fatigue characteristics, mix stiffness, and degree of compaction as measured by air-void content.

Determination of sample size is primarily based on an acceptable error level for a performance parameter specified by the agency. It is not uncommon to base quality assurance by many agencies on three samples. Through the *t* distributions, discussion is presented as to why it is not appropriate to take only this number of samples for qualityassurance. Based only on three samples in a large project, the agency will have insufficient power to reject the null hypothesis given that the null hypothesis is false unless the project quality delivered by the contractor is extremely poor so that the agency is confident enough to reject the project.

In addition to providing a general introduction to fundamental statistics and hypothesis testing, two case studies are used to clarify the relationships among sampling size, sample strategies, and performance specifications (or acceptance/rejection criterion). These include the following:

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

= *P*{fail to reject *H*0 | *H*0 is false}.

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 101

*P*{reject *H*0| *H*0 is true} ; when the null hypothesis is not true, the probability β of erroneously accepting it is named the Type II error (or buyer's risk), i.e., β = *P*{Type II error}

Reject *H*0 Type I Error (α) Correct Decision

Accept *H*0 Correct Decision Type II Error (β) The power is defined as the probability 1 – β of correctly rejecting *H*0 if *H*0 is not true, i.e., 1 – β = *P*{reject *H*0 | *H*0 is false}. Hence, from the viewpoint of the agency (the buyer), it is necessary to have the power as high as possible; likely, from the perspective of the

stabilometer tests, *CS* is the minimum specification limit for the stabilometer test, and *SE*

 *n p* 1 0.95 1.64485 

*<sup>C</sup> P z SD*

*S*

*<sup>C</sup> P z SD SD*

1

*z*

ˆ

 

the distribution function of a standard normal distribution. The size of test 0.05

represents that at most a 5% chance is allowed to erroneously reject a valid null hypothesis;

contractor (the seller), the Type I error should be as minimum as possible.

**2.1.1 Testing inequality μ ≥ Cs and size of test α**

contractor. The relevant *t* statistic is given by <sup>ˆ</sup>

*t* statistic lies in the acceptance region 1 ,*<sup>n</sup> <sup>p</sup> t t*

Suppose that the hypothesis is not true, that is,

power as shown by Stone (Stone, 1996) is:

by 1 ,*<sup>n</sup> <sup>p</sup> t t* 

The objective is to test the null hypothesis that 0 : *H C*

the standard error. The critical region for the *t* test of size

1 , ˆ ˆ /

that is, there is a 95% chance that *H*0 is accepted if *H*0 is valid.

**2.1.2 Test power, sample size, and operating-characteristic curve** 

<sup>ˆ</sup> <sup>1</sup> <sup>ˆ</sup>

Note that the critical region for the *t* test of size 0.05

can be given by *t t* 1 , 

 *C SE t S n*

**Truth about the population**

*H*0 True *H*0 Not True

*<sup>p</sup>* 1 , ˆ ˆ

 *C t SE S np* 

*CS* (the opposite of

( )ˆ *CS <sup>t</sup> SE* 

, where *nn n* <sup>1</sup> *<sup>p</sup>* and *p* the number of laboratories. In other words, the

if and only if

1

1

 

ˆ ˆ

*S*

, where

*<sup>S</sup>* from the viewpoint of the

ˆ is the sample mean of the

of the null hypothesis is given

(1)

*<sup>S</sup>*

*CS* ). Then the

of the null hypothesis <sup>0</sup> : *H C*

, if *n p* 0 , where is

(1) A QC/QA case study is used to illustrate a methodology to determine strategies for a sampling scheme and selection of sample size for QC/QA for HMA construction to ensure that the acceptable level of a mix parameter is obtained with the same risk to the contractor and the agency. A sampling scheme and sampling size based on statistical simulation of a fixed length of a one-lane-width placement of HMA are discussed. Sample size is based on the combination of the sample size of the contractor and that of the agency to balance the risk to both organizations which will result in a mix that will meet the minimum performance requirement. An example is presented for the placement of 15,000 tons of HMA according to the California Department of Transportation (Caltrans) QC/QA requirements. For this total tonnage, the contractor and agency are assumed to perform a specific number of performance tests using the California stabilometer methodology for QC and QA.

(2) A QA case study is used to illustrate the application of the use of uniform design (UD) as a sampling strategy to ensure that the most representative sampling scheme can be achieved with a specified sample size. A sampling scheme using uniform design and sampling size through statistical simulation of a fixed length of a two-lane-width placement of HMA with several segregation data patterns is discussed. Based on the simulation, a QA guideline is developed by inspecting the accuracy of sample mean and the precision of sample standard deviation criteria combined with the application of the UD table is proposed and verified with two full scale pavement sections by measured air-void contents (measure of degree of compaction).

## **2. Case I: quality control-quality assurance sampling strategies for hot-mix asphalt construction**

The effects of sampling strategies and size on statistically based decision making in hot-mix asphalt (HMA) construction are presented. For sample sizes agreed upon by the agency and the contractor, an acceptable level for an HMA mix parameter is determined with risk balanced between the two organizations. With increased emphasis on specific performance requirements, the use of performance tests on HMA specimens prepared from in situ samples is developing. Examples include direct measures of mix stiffness and permanent deformation characteristics. A measure of rutting resistance, the stabilometer S-value, is used by the California Department of Transportation (Caltrans) for quality control-quality assurance (QC/QA) projects. Although the S-value was used for this simulation because extensive tests were available, this approach is applicable to any performance measures already in use, such as HMA thickness or compacted air-void content. A sampling scheme and sampling size through statistical simulation of a fixed length of a one-lane-width placement of HMA are discussed. Sample size is based on the combination of the sample size of the contractor and that of the agency to balance the risk to both organizations and results in a mix that meets the minimum performance requirement.

#### **2.1 Hypothesis testing of inequality**

The acceptance or rejection of the null hypothesis, *H*0, is referred to as a decision. Therefore, a correct decision is made in situations where (1) the *H*0 is correctly accepted if *H*0 is true and (2) the *H*0 is correctly rejected if the *H*0 is not true. As shown in the following for a decision based on a sample, when the null hypothesis is valid, the probability α of erroneously rejecting it is designated as the Type I error (or seller's risk), i.e., α = *P*{Type I error} = *P*{reject *H*0| *H*0 is true} ; when the null hypothesis is not true, the probability β of erroneously accepting it is named the Type II error (or buyer's risk), i.e., β = *P*{Type II error} = *P*{fail to reject *H*0 | *H*0 is false}.


The power is defined as the probability 1 – β of correctly rejecting *H*0 if *H*0 is not true, i.e., 1 – β = *P*{reject *H*0 | *H*0 is false}. Hence, from the viewpoint of the agency (the buyer), it is necessary to have the power as high as possible; likely, from the perspective of the contractor (the seller), the Type I error should be as minimum as possible.

#### **2.1.1 Testing inequality μ ≥ Cs and size of test α**

100 Modern Approaches To Quality Control

(1) A QC/QA case study is used to illustrate a methodology to determine strategies for a sampling scheme and selection of sample size for QC/QA for HMA construction to ensure that the acceptable level of a mix parameter is obtained with the same risk to the contractor and the agency. A sampling scheme and sampling size based on statistical simulation of a fixed length of a one-lane-width placement of HMA are discussed. Sample size is based on the combination of the sample size of the contractor and that of the agency to balance the risk to both organizations which will result in a mix that will meet the minimum performance requirement. An example is presented for the placement of 15,000 tons of HMA according to the California Department of Transportation (Caltrans) QC/QA requirements. For this total tonnage, the contractor and agency are assumed to perform a specific number of performance tests using the California stabilometer methodology for QC

(2) A QA case study is used to illustrate the application of the use of uniform design (UD) as a sampling strategy to ensure that the most representative sampling scheme can be achieved with a specified sample size. A sampling scheme using uniform design and sampling size through statistical simulation of a fixed length of a two-lane-width placement of HMA with several segregation data patterns is discussed. Based on the simulation, a QA guideline is developed by inspecting the accuracy of sample mean and the precision of sample standard deviation criteria combined with the application of the UD table is proposed and verified with two full scale pavement sections by measured air-void contents (measure of degree of

**2. Case I: quality control-quality assurance sampling strategies for hot-mix** 

results in a mix that meets the minimum performance requirement.

**2.1 Hypothesis testing of inequality** 

The effects of sampling strategies and size on statistically based decision making in hot-mix asphalt (HMA) construction are presented. For sample sizes agreed upon by the agency and the contractor, an acceptable level for an HMA mix parameter is determined with risk balanced between the two organizations. With increased emphasis on specific performance requirements, the use of performance tests on HMA specimens prepared from in situ samples is developing. Examples include direct measures of mix stiffness and permanent deformation characteristics. A measure of rutting resistance, the stabilometer S-value, is used by the California Department of Transportation (Caltrans) for quality control-quality assurance (QC/QA) projects. Although the S-value was used for this simulation because extensive tests were available, this approach is applicable to any performance measures already in use, such as HMA thickness or compacted air-void content. A sampling scheme and sampling size through statistical simulation of a fixed length of a one-lane-width placement of HMA are discussed. Sample size is based on the combination of the sample size of the contractor and that of the agency to balance the risk to both organizations and

The acceptance or rejection of the null hypothesis, *H*0, is referred to as a decision. Therefore, a correct decision is made in situations where (1) the *H*0 is correctly accepted if *H*0 is true and (2) the *H*0 is correctly rejected if the *H*0 is not true. As shown in the following for a decision based on a sample, when the null hypothesis is valid, the probability α of erroneously rejecting it is designated as the Type I error (or seller's risk), i.e., α = *P*{Type I error} =

and QA.

compaction).

**asphalt construction** 

The objective is to test the null hypothesis that 0 : *H C <sup>S</sup>* from the viewpoint of the contractor. The relevant *t* statistic is given by <sup>ˆ</sup> ( )ˆ *CS <sup>t</sup> SE* , where ˆ is the sample mean of the stabilometer tests, *CS* is the minimum specification limit for the stabilometer test, and *SE* the standard error. The critical region for the *t* test of size of the null hypothesis is given by 1 ,*<sup>n</sup> <sup>p</sup> t t* , where *nn n* <sup>1</sup> *<sup>p</sup>* and *p* the number of laboratories. In other words, the *t* statistic lies in the acceptance region 1 ,*<sup>n</sup> <sup>p</sup> t t* if and only if

$$\left(\hat{\mu} - \mathcal{C}\_{\mathcal{S}}\right) / \text{SE}(\hat{\mu}) > \ -t\_{1-a, n-p} \Rightarrow \hat{\mu} > \mathcal{C}\_{\mathcal{S}} - t\_{1-a, n-p} \text{SE}(\hat{\mu}) \tag{1}$$

Note that the critical region for the *t* test of size 0.05 of the null hypothesis <sup>0</sup> : *H C <sup>S</sup>* can be given by *t t* 1 , *n p* 1 0.95 1.64485 , if *n p* 0 , where is the distribution function of a standard normal distribution. The size of test 0.05 represents that at most a 5% chance is allowed to erroneously reject a valid null hypothesis; that is, there is a 95% chance that *H*0 is accepted if *H*0 is valid.

#### **2.1.2 Test power, sample size, and operating-characteristic curve**

Suppose that the hypothesis is not true, that is, *CS* (the opposite of *CS* ). Then the power as shown by Stone (Stone, 1996) is:

$$\begin{aligned} 1 - \beta &\equiv P\left(\frac{\hat{\mu} - \mathcal{C}\_S}{SD\left(\hat{\mu}\right)} \le -z\_{1-\alpha}\right) \\ &= P\left(\frac{\hat{\mu} - \mu}{SD\left(\hat{\mu}\right)} \le -z\_{1-\alpha} + \frac{\mathcal{C}\_S - \mu}{SD\left(\hat{\mu}\right)}\right) \\ &= \Phi\left(-z\_{1-\alpha} + \delta\right) \end{aligned}$$

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

*SD*

Test Power 1

Sample size

Minimum Requirement of contractor

Upper and Lower bounds of Agency

*p*

1*z* 

and lower bounds of agency.

where

*SD SE*

**2.1.3 Size of test α and power 1 -** 

and 1*z*

*S*

of size  Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 103

**The agency and the contractor:**  *n kn* 1 2 **,** 0 1 *k* **(** *n*<sup>1</sup> **: number of QA samples;** *n*2 **: number of QC samples.)**

ˆ 2 12

2

2 1

1 1

1 1

<sup>1</sup> <sup>ˆ</sup> <sup>2</sup> *S P z z <sup>k</sup> C S*

<sup>1</sup> <sup>ˆ</sup> <sup>2</sup> *S P z z <sup>k</sup> C S*

are quantiles of a standard normal distribution; is the

2 

ˆ ˆ *C t SE S np* 1 , 

 1 1 *z z* 

 

 

*n*

Note: The pooled sample variance, <sup>2</sup> *Sp* , is defined as,

distribution function of a standard normal distribution.

; 1 2

Table 1. Test power, required sample size, minimum requirement of contractor, and upper

 1 2 2 2 <sup>2</sup> 1, 1 2, 2 ,

*xx xx xx*

*nn np*

*i i pi p*

*p*

2 *CS <sup>d</sup> S* 

 

and 1*z*

Therefore, to satisfy the power requirement of the agency, 1 1 *z z* ( )

1 *z*1

For the agency, as noted earlier, the power of a test under the null hypothesis is given by

2 11 1 1 2

For the contractor, under the null hypothesis 0 : *H C*

is given by Equation 1 , that is,

ˆ ˆ ˆ *C C S S*

 

; 1*<sup>z</sup>*

*ii i*

*n n np*

1 2 12

1

1 1

2 *k z z*

2

2

*k n*

*k n*

; if *<sup>p</sup>* <sup>1</sup> , then *S S P S* .

(lower bound)

(upper bound)

2

2

*<sup>S</sup>* , the acceptance region for the *t* test

are quantiles of a standard normal distribution.

, i.e.,

.

 

 

*k n*

2

*Sp S nn p n n nn*

*k n z d k*

 

> *k d*

1 2 1, 2

<sup>1</sup> <sup>ˆ</sup> <sup>2</sup> *n n S P <sup>t</sup> <sup>k</sup> C S*

 

1 1

<sup>1</sup> 2

where <sup>ˆ</sup> *CS SD* and 1*<sup>z</sup>* is quantile of a standard normal distribution. For the specified

α and β levels under the null hypothesis that 0 : *H C <sup>S</sup>* , Table 1 lists the *SD* ˆ , test power, and required sample sizes for the case with the agency and the contractor.

The test power equation shown in Table 1 indicates that the power of testing a null hypothesis is actually a standard normal distribution function in terms of the test of size ,

*d* ( *<sup>S</sup> p <sup>C</sup> <sup>d</sup> S* ), and the number of tests. Figure 1 plots power versus *d* with 0.05 at

various numbers of tests, designated as the operating-characteristic curves. Several observations can be addressed in the following:


$$\begin{aligned} \text{3. For } n=4 \text{ to ensure that the test power is greater than 0.95, } d \ge 1.645, \text{ i.e., } \\ d \equiv \frac{|\hat{\mu} - \mathbf{C}\_{S}|}{S\_{S}} > d\_{0.95}. \quad \text{In other words, if the sample mean of tests } \hat{\mu} \text{ is either } \\ \hat{\mu} > \mathbf{C}\_{S} + d\_{0.95} \cdot S\_{S} \text{ or } \hat{\mu} < \mathbf{C}\_{S} - d\_{0.95} \cdot S\_{S} \text{ (then the energy has enough power to confidently accept to confidently)} \\ \text{accept or reject the null hypothesis } \ H\_{0} : \mu \ge \mathbf{C}\_{S}. \quad \text{If } \hat{\mu} \text{ lies in the range of} \\ \left(\mathbf{C}\_{S} - d\_{0.95} \cdot S\_{S}, \mathbf{C}\_{S} + d\_{0.95} \cdot S\_{S}\right), \text{ then the energy does not have enough power with } n = 4. \\ \text{Thus the number of tests has to be increased to reach the same level of power.} \end{aligned}$$

4. The test power approaches the test of size as 0 *d* .

Fig. 1. Operating-characteristic curves with α = 0.05.


The pooled sample variance, <sup>2</sup> *Sp* , is defined as,

$$S\_p^2 = \frac{\sum\_{i=1}^{n\_1} \left(\mathbf{x}\_{1,i} - \overline{\mathbf{x}}\_1\right)^2 + \sum\_{i=1}^{n\_2} \left(\mathbf{x}\_{2,i} - \overline{\mathbf{x}}\_2\right)^2 + \dots + \sum\_{i=1}^{n\_p} \left(\mathbf{x}\_{p,i} - \overline{\mathbf{x}}\_p\right)^2}{n\_1 + n\_2 + \dots + n\_p - p}; \text{if } p = 1, \text{ then } S\_p = S\_S \dots$$

1*z* and 1*z* are quantiles of a standard normal distribution; is the distribution function of a standard normal distribution.

$$d = \frac{\left| \mu - \mathcal{C}\_{\mathcal{S}} \right|}{\mathcal{S}\_2}; \ \mu = \frac{\mu\_1 + \mu\_2}{2}$$

Table 1. Test power, required sample size, minimum requirement of contractor, and upper and lower bounds of agency.

#### **2.1.3 Size of test α and power 1 -**

102 Modern Approaches To Quality Control

The test power equation shown in Table 1 indicates that the power of testing a null hypothesis is actually a standard normal distribution function in terms of the test of size

), and the number of tests. Figure 1 plots power versus *d* with 0.05

various numbers of tests, designated as the operating-characteristic curves. Several

*d*; alternatively, at the same number of tests and a fixed value of *d*, increasing

3. For 4 *n* to ensure that the test power is greater than 0.95, *d* > 1.645, i.e.,

. In other words, if the sample mean of tests

Thus the number of tests has to be increased to reach the same level of power.

Power < 0.95 Power > 0.95

n = 1

 = 0.10; n = 4 = 0.05; n = 4

2

4

power, and required sample sizes for the case with the agency and the contractor.

is quantile of a standard normal distribution. For the specified

, increasing the number of tests will reduce the value of *d*.

*Cd S S S* , then the agency has enough power to confidently

*<sup>S</sup>* . If

as 0 *d* .

*C d SC d S S SS S* 0.95 , 0.95 , then the agency does not have enough power with 4 *n* .

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 **d**

0.601 1.645 2.326 3.290

*<sup>S</sup>* , Table 1 lists the *SD*

ˆ , test

will decrease the value of

ˆ lies in the range of

Power = 0.95

 **= 0.05**

,

at

will

ˆ is either

where <sup>ˆ</sup>

*d* ( *<sup>S</sup>*

*p <sup>C</sup> <sup>d</sup> S* 

increase the power. 2. At fixed levels of

> ˆ *<sup>S</sup> S <sup>C</sup> d d S*

0.95 ˆ

0.0

0.2

0.4

0.6

**Power (1-)**

0.8

1.0

1.2

*CS SD* 

and 1*<sup>z</sup>*

observations can be addressed in the following:

 and 

0.95

 *Cd S S S* or 0.95 ˆ

4. The test power approaches the test of size

30

Fig. 1. Operating-characteristic curves with α = 0.05.

α and β levels under the null hypothesis that 0 : *H C*

1. With the same number of tests and power level, increasing

accept or reject the null hypothesis 0 : *H C*

For the contractor, under the null hypothesis 0 : *H C <sup>S</sup>* , the acceptance region for the *t* test of size is given by Equation 1 , that is, ˆ ˆ *C t SE S np* 1 , .

For the agency, as noted earlier, the power of a test under the null hypothesis is given by

$$1 - \beta \equiv \Phi(-z\_{1-\alpha} + \delta) \implies z\_{1-\beta} \equiv -z\_{1-\alpha} + \delta$$

where ˆ ˆ ˆ *C C S S SD SE* ; 1*<sup>z</sup>* and 1*z* are quantiles of a standard normal distribution.

Therefore, to satisfy the power requirement of the agency, 1 1 *z z* ( ) , i.e.,

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

To conduct the sampling size simulation, several assumptions were made:

**2.2 QC/QA demonstration example** 

1. Lane width: 12 ft (3.66 m),

strip or the longitudinal strip.

& Monismith, 2009).

2. Unit weight of HMA – 145 lb/ft3 (2,323 kg/m3), 3. HMA layer thickness – 8 in. (20 cm), and

assigned a normalized stability value.

case, per data pattern, was simulated 200 times.

the agency (QA) and the contractor (QC).

inspect its stability over the *M × N* domain.

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 105

In this demonstration example 15,000 tons of HMA will be placed on 20 sublots (750 tons per sublot). The contractor is required to conduct 20 tests ( *n*<sup>2</sup> ), i.e., one test per sublot. The number of tests conducted by the agency ( *n kn* 1 2 ) will include the minimum required by the agency according to Caltrans specifications, i.e, *k* = 0.1 (2 tests in this case); in addition, determinations will be made for four tests (*k* = 0.2), six tests (*k* = 0.3), and eight tests (*k* = 0.4). The minimum stabilometer S-value has been set at 37 (Type A HMA) (California Department of Transportation [CALTRANS], 2007*),* and a standard deviation *SP* is used for the S-value for tests between two laboratories of 6.6 (Paul Benson, private communication transmitting analyses of stabilometer test results for periods 1967– 1970 and 1995-1999). The demonstration example will include sampling consistency between QC and QA, sampling stabilization of *SP*, and minimum requirements for both the agency and the contractor.

4. One stability sample is represented by a 4 × 4-in (10 × 10-cm) square with each square

For these assumptions, the 15,000 tons of HMA will produce a section ~26,000 ft (7,925 m). long and 12 ft (3.66 m) wide. This results in a *N*(0,1) stability population of 12 x 3 x 26,000 x 3 = 2,808,000 data points to generate three types of data patterns as schematically shown in Figure 3: (1) random pattern, (2) transverse strip pattern with 40 vertical strips, and (3) longitudinal strip pattern with 6 horizontal strips. The *N(0,1)* distribution is separated by the points of quantiles into several intervals, e.g., 6 intervals for transverse strip pattern or 4 intervals for longitudinal strip pattern as shown in Figure 3. These intervals are then permuted to vary randomly across the *x*-direction or the *y*-direction of a lane of HMA paving. Those points within the interval are also randomly distributed over the transverse

The sampling scheme used was illustrated in Figure 4 with cases of *M × N* cells (*N* [*y*direction] = 1; *M* [*x*-direction] = 10, 20, 30, 40, 50, 100, 200, and 500). That is, one random QC sample from each cell and one random QA sample from one random cell of *n kn* 1 2 random transverse strips. A total of 8 cases were simulated over three data patterns. Each

To verify the minimum sampling size for an HMA paving strip is to show (1) no apparent difference of sampling consistency between the contractor (QC) and the agency (QA) and (2) stabilization of the pooled sample estimate of standard deviation of stability value, *SP*. (Tsai

In each sampling simulation, the normalized stability values form a distribution with mean and standard deviation; hence, when repeated 200 times, the standard deviations will form another distribution. For each case, the standard deviations of the standard deviation distributions (SDSD) were calculated for QC and QA respectively. The difference of SDSD between QA and QC were used as an index to represent the sampling consistency between

Likewise, for each simulation, the *Sp* was calculated based on the equation in Table 1; hence, when repeated 200 times, the standard deviation of the *Sp* distribution will be used to

$$\frac{\left|\hat{\mu} - C\_S\right|}{SE(\hat{\mu})} \ge z\_{1-a} + z\_{1-\beta}$$

$$\implies \quad \hat{\mu} \ge \mathbb{C}\_{\text{S}} + \left(z\_{1-a} + z\_{1-\beta}\right) \cdot \text{SE}\left(\hat{\mu}\right) \text{ , or } \hat{\mu} \le \mathbb{C}\_{\text{S}} - \left(z\_{1-a} + z\_{1-\beta}\right) \cdot \text{SE}\left(\hat{\mu}\right) \tag{2}$$

The *C z z SE <sup>S</sup>* 1 1 ˆ will be designated as the upper bound and *C z z SE <sup>S</sup>* 1 1 ˆ the lower bound of 1 power.

It should be noted that (1) if 0.5 , then 1 *z* 0 and (2) 1 1,*<sup>n</sup> <sup>p</sup> z t* , if *n p* 0 .

Thus, Equation 1 is equivalent to the lower bound of Equation 2. Based on Equation 1, the minimum requirement of the contractor, and Equation 2, the upper and lower bounds of power requirement of the agency, the case of the agency and the contractor is defined in Table 1.

Figure 2 illustrates plots of the upper and lower bounds at various power levels of the agency and the minimum requirement of the contractor under 0 *H* : 37 in terms of and sample size, *n*<sup>2</sup> . The minimum requirements of the contractor in Figure 2 are plotted based on the *t-* distribution and standard normal distribution. It will be noted that the two curves coincide after 2 *n* 10 . From Table 1 and Figure 2, two observations can be made:


Fig. 2. Minimum stability requirements of the contractor and power requirement of the agency under the same null hypothesis.

## **2.2 QC/QA demonstration example**

104 Modern Approaches To Quality Control

1 1

 

Thus, Equation 1 is equivalent to the lower bound of Equation 2. Based on Equation 1, the minimum requirement of the contractor, and Equation 2, the upper and lower bounds of power requirement of the agency, the case of the agency and the contractor is defined in

Figure 2 illustrates plots of the upper and lower bounds at various power levels of the

and sample size, *n*<sup>2</sup> . The minimum requirements of the contractor in Figure 2 are plotted based on the *t-* distribution and standard normal distribution. It will be noted that the two curves coincide after 2 *n* 10 . From Table 1 and Figure 2, two observations can be made: 1. It is very important to recognize that the minimum requirement of the contractor

2. The distance enclosed by the upper and lower bounds at a specified power level

1 10 100 **Sample Size n2**

Fig. 2. Minimum stability requirements of the contractor and power requirement of the

Based on t distribution (Contractor) Based on normal distribution (Contractor)

and power.

ˆ ˆ *C z z SE <sup>S</sup>* 1 1 

ˆ will be designated as the upper bound and

 and (2) 1 1,*<sup>n</sup> <sup>p</sup> z t* 

 

> 

, larger *k* (0 1 *k* ), and, more

No. of Labs = 2 H0: ≥ 37 *Sp* = 6.6

n1 = 1/2\*n2

Size of Test: = 0.05

(2)

, if *n p* 0 .

in terms of

*CS z z*

ˆ

*SE*

 , or

ˆ the lower bound of 1

agency and the minimum requirement of the contractor under 0 *H* : 37

actually matches the lower bound of 0.5 power of the agency.

1- = 0.95 (Agency)

ˆ ˆ *C z z SE <sup>S</sup>* 1 1 

 

The *C z z SE <sup>S</sup>* 1 1 

> 

It should be noted that (1) if

*C z z SE <sup>S</sup>* 1 1 

Table 1.

 

decreases with smaller *SP* , larger

importantly, larger sample size.

0.9 0.8 0.5

0.2

0.9 0.8 0.5 0.2

agency under the same null hypothesis.

20

1- = 0.95

25

30

35

40

45

50

55

ˆ

0.5 , then 1 *z* 0

In this demonstration example 15,000 tons of HMA will be placed on 20 sublots (750 tons per sublot). The contractor is required to conduct 20 tests ( *n*<sup>2</sup> ), i.e., one test per sublot. The number of tests conducted by the agency ( *n kn* 1 2 ) will include the minimum required by the agency according to Caltrans specifications, i.e, *k* = 0.1 (2 tests in this case); in addition, determinations will be made for four tests (*k* = 0.2), six tests (*k* = 0.3), and eight tests (*k* = 0.4). The minimum stabilometer S-value has been set at 37 (Type A HMA) (California Department of Transportation [CALTRANS], 2007*),* and a standard deviation *SP* is used for the S-value for tests between two laboratories of 6.6 (Paul Benson, private communication transmitting analyses of stabilometer test results for periods 1967– 1970 and 1995-1999). The demonstration example will include sampling consistency between QC and QA, sampling stabilization of *SP*, and minimum requirements for both the agency and the contractor. To conduct the sampling size simulation, several assumptions were made:


For these assumptions, the 15,000 tons of HMA will produce a section ~26,000 ft (7,925 m). long and 12 ft (3.66 m) wide. This results in a *N*(0,1) stability population of 12 x 3 x 26,000 x 3 = 2,808,000 data points to generate three types of data patterns as schematically shown in Figure 3: (1) random pattern, (2) transverse strip pattern with 40 vertical strips, and (3) longitudinal strip pattern with 6 horizontal strips. The *N(0,1)* distribution is separated by the points of quantiles into several intervals, e.g., 6 intervals for transverse strip pattern or 4 intervals for longitudinal strip pattern as shown in Figure 3. These intervals are then permuted to vary randomly across the *x*-direction or the *y*-direction of a lane of HMA paving. Those points within the interval are also randomly distributed over the transverse strip or the longitudinal strip.

The sampling scheme used was illustrated in Figure 4 with cases of *M × N* cells (*N* [*y*direction] = 1; *M* [*x*-direction] = 10, 20, 30, 40, 50, 100, 200, and 500). That is, one random QC sample from each cell and one random QA sample from one random cell of *n kn* 1 2 random transverse strips. A total of 8 cases were simulated over three data patterns. Each case, per data pattern, was simulated 200 times.

To verify the minimum sampling size for an HMA paving strip is to show (1) no apparent difference of sampling consistency between the contractor (QC) and the agency (QA) and (2) stabilization of the pooled sample estimate of standard deviation of stability value, *SP*. (Tsai & Monismith, 2009).

In each sampling simulation, the normalized stability values form a distribution with mean and standard deviation; hence, when repeated 200 times, the standard deviations will form another distribution. For each case, the standard deviations of the standard deviation distributions (SDSD) were calculated for QC and QA respectively. The difference of SDSD between QA and QC were used as an index to represent the sampling consistency between the agency (QA) and the contractor (QC).

Likewise, for each simulation, the *Sp* was calculated based on the equation in Table 1; hence, when repeated 200 times, the standard deviation of the *Sp* distribution will be used to inspect its stability over the *M × N* domain.

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

(c) Relation between min and k

0 0.5 1 1.5 2 **d**

0 0.511.5

0 0.511.5 2 **d**

 **= 5 % n1 = k \* n2 ; k =0.2** 

0.802 0.902

> **= 10 % n1 = k \* n2 ; k =0.2**

0.703 0.803

0 0.1 0.2 0.3 0.4 0.5 **k**

1 10 100 1000 **Sample size n2**

N = 20

alpha = 5%, power = 0.90 alpha = 5%, power = 0.95 alpha = 10%, power = 0.90 alpha = 10%, power = 0.95

100

0 0.2 0.4 0.6 0.8 1

**Power (1-)**

0 0.2 0.4 0.6 0.8 1

**Power (1-)**

and (c) relationship between *k* and μmin.

**min**

0 0.1 0.2 0.3 0.4 0.5

**SDSD (QA) - SDSD (QC)**

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 107

**SD of Sp**

k = 0.1 k = 0.2 k = 0.3 k = 0.4

(a) Sampling Consistency (b) Sampling Stabilization

Fig. 5. Summary of simulation results: (a) sampling consistency; (b) sampling stabilization;

0.95 0.90

(a) = 5%; n2 = 20, k = 0.2

0.95 0.90

= 37

Power = 0.95 Power = 0.90

= 37

Power = 0.95 Power = 0.90

Power = 0.95 Power = 0.90

Power = 0.95 Power = 0.90

> 42.95 42.29

42.29 41.63

(b) = 10%; n2 = 20, k = 0.2

Fig. 6. Examples of operating-characteristic curves and μmin required to meet the agency's power requirement and the contractor's minimum requirement: (a) α = 5% and (b) α = 10%.

0 0.05 0.1 0.15 0.2 0.25

> 1 10 100 1000 **Sample size n2**

0 10 20 30 40 50 60 **Sample Size n2**

0 10 20 30 40 50 60 **Sample Size n2**

 **= 5 % n1 = k \* n2 ; k =0.2 Sp = 6.6** 

 **= 10 % n1 = k \* n2 ; k =0.2 Sp = 6.6** 

H0: 37 H0: 44.61 H0: 43.95

H0: 37 H0: 45.92 H0: 45.26

N = 20

k = 0.1 k = 0.2 k = 0.3 k = 0.4

Fig. 3. Schemetic illustration of three data patterns: (a) random pattern, (b) transverse strip pattern, and (c) longitudinal strip pattern.

Figure 5a illustrates the simulation results for sampling consistency between QC and QA at various *k* values in terms of global smoothed line over three different data patterns. As would be expected, the sampling consistency between QC and QA increases as the *k* value increases. Figure 5b indicates that sampling stabilization of *SP* depends only on the contractor's sampling size, *n*<sup>2</sup> , rather than the *k* value.

From a series of operating-characteristic curves for the four *k* values and two α values (5% and 10%), the values in Table 2 were determined for the required minimum value of S, termed min . With Figure 6a as an example, under the condition that 5% , *n*2 = 20, *k* = 0.2, and power = 0.95, *d* has to be greater than 0.902 to satisfy the agency's power requirement; that is, ˆ has to be greater than 42.95 so that the agency has power 0.95 to clearly accept the contractor's mix. Figure 6b shows a smaller *d* (0.803) will be obtained when the α value is increased to 10%.

0.0 0.1 0.2 0.3

0.0 0.1 0.2 0.3

**1 2**

0.0 0.1 0.2 0.3

1 2 3 n2 M = n2 - 1

: The agency QA sample locations (n1 = k × n2) : The contractor QC sample locations (n2)

Figure 5a illustrates the simulation results for sampling consistency between QC and QA at various *k* values in terms of global smoothed line over three different data patterns. As would be expected, the sampling consistency between QC and QA increases as the *k* value increases. Figure 5b indicates that sampling stabilization of *SP* depends only on the

From a series of operating-characteristic curves for the four *k* values and two α values (5% and 10%), the values in Table 2 were determined for the required minimum value of S,

0.2, and power = 0.95, *d* has to be greater than 0.902 to satisfy the agency's power

clearly accept the contractor's mix. Figure 6b shows a smaller *d* (0.803) will be obtained

ˆ has to be greater than 42.95 so that the agency has power 0.95 to

min . With Figure 6a as an example, under the condition that 5%

Prob.

(c) Longitudinal Strip Pattern

Fig. 3. Schemetic illustration of three data patterns: (a) random pattern, (b) transverse strip

(b) Transverse Strip Pattern

Prob.

(a) Random Pattern

Prob.



**3 4**

**5 6**

, *n*2 = 20, *k* =


**<sup>1</sup> <sup>2</sup> <sup>3</sup> <sup>4</sup>**

0 10 20 30 40 50 60

0 10 20 30 40 50 60

**3 5 1 2 6 4**

0 10 20 30 40 50 60

n2 (n ) 1Fig. 4. Sampling scheme.

contractor's sampling size, *n*<sup>2</sup> , rather than the *k* value.

when the α value is increased to 10%.

**1**

pattern, and (c) longitudinal strip pattern.

N = 1

termed

requirement; that is,

**4**

**2 3**

Fig. 5. Summary of simulation results: (a) sampling consistency; (b) sampling stabilization; and (c) relationship between *k* and μmin.

Fig. 6. Examples of operating-characteristic curves and μmin required to meet the agency's power requirement and the contractor's minimum requirement: (a) α = 5% and (b) α = 10%.

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

**quality assurance** 

measure already in use.

 \* , min *P Zns <sup>M</sup> P MP* .

**3.1 Uniform experimental design** 

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 109

The application of using uniform design (UD) as a sampling strategy for quality assurance (QA) ensures that the most representative and unbiased sampling scheme can be achieved with the sample size based on an acceptable error level of a hot-mix asphalt (HMA) parameter specified by the agency. Through statistical simulations and demonstration of airvoid measurements of two field pavement sections, a QA guideline combined with the UD sampling scheme was developed to justify construction quality using the sample mean and sample standard deviation criteria. This approach can also be applied to any performance

Statisticians have developed a variety of experimental design methods for different purposes, with the expectation that use of these methods will result in increased yields from experiments, quality improvements, and reduced development time or overall costs. Popular experimental design methods include full factorial designs, fractional factorial designs, block designs, orthogonal arrays, Latin square, supersaturated designs, etc. One relatively new design method is called Unifrom Design (UD). Since it was proposed by Fang and Wang in the 1980s (Fang, 1980; Fang et al., 2000; Wang & Fang, 1981), UD has been successfully used in various fields, such as chemistry and chemical engineering, quality and system engineering,

Generally speaking, uniform design is a space-filling experimental design that allocates experimental points uniformly scattered in the domain. The fundamental concept of UD is to choose a set of experimental points with the smallest discrepancy among all the possible

Suppose that there are *s* factors in an experiment. Without loss of generality we can assume that the experimental domain is the unit cube 0,1 *<sup>s</sup> <sup>s</sup> C* (after making a suitable linear transformation). The aim is to choose a set of *n* experiment points *P*= {*x*1,…, *x*n} *Cs* that is uniformly scattered on *Cs*. Let *M* be a measure of uniformity of *P* such that the smaller *M* corresponds to better uniformity. Let *Z*(*n*,*s*) be the set of sets of *n* points on *Cs*. A set \* *P Zns* , is called a uniform design if it has the minimum *M*-value over *Z*(*n*,*s*), i.e.,

Many different measures of uniformity have been defined. However, the centered *L*2 discrepancy (*CD*2) is regarded as one of the most commonly used measures in constructing the UD tables, the reason is that the *CD*2 considers the uniformity not only of *P* over *Cs*, but also of all the projection uniformity of *P* over *Cu* which is the *u*-dimensional unit cube involving the coordinates in *u*, *Pu* is the projection of *P* on *Cu*. Hickernell gave an analytical

<sup>2</sup>

222

12 2 2

13 2 1 1 1 0.5 0.5

*ki ji ki ji*

*x x xx*

*kj kj*

1 2

1 1

1 111 1 0.5 0.5

*k j*

*CD P x x n*

*s n s*

computer sciences, survey design, pharmaceuticals, and natural sciences, etc.

designs for a given number of factors and experimental runs.

expression of *CD*2 as follows (Fang & Lin, 2003):

2

2

*n*

1 1 1

*k j i*

*n n s*

**3. Case II: HMA sampling strategies using uniform experimental design for** 

Figure 5c illustrates the relationship between *k* and the minimum S-value. It is apparent that an increase of *k*-value reduces the value of min . It is interesting to observe that the curve of 5% and power = 0.90 is exactly the same as the curve of 10% and power = 0.95. From Figure 5c, it is also shown that the higher min -criterion is needed if both the agency and the contractor require a high power level and a low α-level, whereas if both the agency and the contractor require a low power level and a high α level, then the min criterion can be much smaller.


## Note:

$$\text{Null hypothesis: } H\_0: \,\mu \ni \mathfrak{F} \text{ .} $$

*n kn* 1 2 (0 1 *k* ), where *n*1 is the number of tests of agency; *n*2 the number of tests of contractor.

*S P <sup>C</sup> <sup>d</sup> S* , where 1 2 2 , 1 the average stabilometer value from agency; 2 the average stabilometer value from contractor; 37 *CS* ; 6.6 *SP* . min *P S dS C* .

Table 2. Acceptance μmin values and target hypotheses for contractor with combinations of various α levels, power levels, and *k-* values.

## **3. Case II: HMA sampling strategies using uniform experimental design for quality assurance**

The application of using uniform design (UD) as a sampling strategy for quality assurance (QA) ensures that the most representative and unbiased sampling scheme can be achieved with the sample size based on an acceptable error level of a hot-mix asphalt (HMA) parameter specified by the agency. Through statistical simulations and demonstration of airvoid measurements of two field pavement sections, a QA guideline combined with the UD sampling scheme was developed to justify construction quality using the sample mean and sample standard deviation criteria. This approach can also be applied to any performance measure already in use.

### **3.1 Uniform experimental design**

108 Modern Approaches To Quality Control

Figure 5c illustrates the relationship between *k* and the minimum S-value. It is apparent that

and the contractor require a high power level and a low α-level, whereas if both the agency

**μmin**

min . It is interesting to observe that the curve of

min -criterion is needed if both the agency

and power = 0.95.

min criterion can

**Target Hypothesis (Contractor)** 

> <sup>0</sup> *H* : 48.19

> <sup>0</sup> *H* : 45.26

> <sup>0</sup> *H* : 44.02

> <sup>0</sup> *H* : 43.31

> <sup>0</sup> *H* : 49.07

> <sup>0</sup> *H* : 45.92

> <sup>0</sup> *H* : 44.58

> <sup>0</sup> *H* : 43.81

> <sup>0</sup> *H* : 46.41

> <sup>0</sup> *H* : 43.95

> <sup>0</sup> *H* : 42.91

> <sup>0</sup> *H* : 42.31

> <sup>0</sup> *H* : 47.30

> <sup>0</sup> *H* : 44.61

> <sup>0</sup> *H* : 43.46

> <sup>0</sup> *H* : 42.81

**(Agency) k d Acceptance**

5% 0.90 0.1 1.085 44.16 \*

0.2 0.802 42.29 \*

0.3 0.680 41.49 \*

0.4 0.613 41.05 \*

0.95 0.1 1.220 45.05 \*

0.2 0.902 42.95 \*

0.3 0.766 42.06 \*

0.4 0.688 41.54 \*

0.2 0.703 41.63 \*

0.3 0.597 40.94 \*

0.4 0.537 40.54 \*

0.95 0.1 1.086 44.16 \*

0.2 0.803 42.29 \*

0.3 0.682 41.50 \*

0.4 0.613 41.05 \*

2 

,

Note: Null hypothesis: 0 *H* : 37

*n kn* 1 2 (0 1 *k* ), where *n*1 is the number of tests of agency; *n*2 the number of tests of contractor.

min *P S*

Table 2. Acceptance μmin values and target hypotheses for contractor with combinations of

*dS C* .

.

2 the average stabilometer value from contractor; 37 *CS* ; 6.6 *SP* .

1 the average stabilometer value from

10% 0.90 0.1 0.951 43.27 \*

5% and power = 0.90 is exactly the same as the curve of 10%

and the contractor require a low power level and a high α level, then the

an increase of *k*-value reduces the value of

*S P <sup>C</sup> <sup>d</sup> S* 

various α levels, power levels, and *k-* values.

agency;

, where 1 2

From Figure 5c, it is also shown that the higher

**Power 1-**

be much smaller.

**α (Contractor)**

> Statisticians have developed a variety of experimental design methods for different purposes, with the expectation that use of these methods will result in increased yields from experiments, quality improvements, and reduced development time or overall costs. Popular experimental design methods include full factorial designs, fractional factorial designs, block designs, orthogonal arrays, Latin square, supersaturated designs, etc. One relatively new design method is called Unifrom Design (UD). Since it was proposed by Fang and Wang in the 1980s (Fang, 1980; Fang et al., 2000; Wang & Fang, 1981), UD has been successfully used in various fields, such as chemistry and chemical engineering, quality and system engineering, computer sciences, survey design, pharmaceuticals, and natural sciences, etc.

> Generally speaking, uniform design is a space-filling experimental design that allocates experimental points uniformly scattered in the domain. The fundamental concept of UD is to choose a set of experimental points with the smallest discrepancy among all the possible designs for a given number of factors and experimental runs.

> Suppose that there are *s* factors in an experiment. Without loss of generality we can assume that the experimental domain is the unit cube 0,1 *<sup>s</sup> <sup>s</sup> C* (after making a suitable linear transformation). The aim is to choose a set of *n* experiment points *P*= {*x*1,…, *x*n} *Cs* that is uniformly scattered on *Cs*. Let *M* be a measure of uniformity of *P* such that the smaller *M* corresponds to better uniformity. Let *Z*(*n*,*s*) be the set of sets of *n* points on *Cs*. A set \* *P Zns* , is called a uniform design if it has the minimum *M*-value over *Z*(*n*,*s*), i.e., \* , min *P Zns <sup>M</sup> P MP* .

> Many different measures of uniformity have been defined. However, the centered *L*2 discrepancy (*CD*2) is regarded as one of the most commonly used measures in constructing the UD tables, the reason is that the *CD*2 considers the uniformity not only of *P* over *Cs*, but also of all the projection uniformity of *P* over *Cu* which is the *u*-dimensional unit cube involving the coordinates in *u*, *Pu* is the projection of *P* on *Cu*. Hickernell gave an analytical expression of *CD*2 as follows (Fang & Lin, 2003):

$$\begin{split} \text{C}D\_{2}\left(P\right) &= \left[ \left(\frac{13}{12}\right)^{s} - \frac{2}{n} \sum\_{k=1}^{n} \prod\_{j=1}^{s} \left(1 + \frac{1}{2} \left| \mathbf{x}\_{kj} - \mathbf{0.5} \right| - \frac{1}{2} \left| \mathbf{x}\_{kj} - \mathbf{0.5} \right|^{2} \right) \right] \\ &+ \frac{1}{n^{2}} \sum\_{k=1}^{n} \sum\_{j=1}^{n} \prod\_{i=1}^{s} \left(1 + \frac{1}{2} \left| \mathbf{x}\_{ki} - \mathbf{0.5} \right| + \frac{1}{2} \left| \mathbf{x}\_{ji} - \mathbf{0.5} \right| - \frac{1}{2} \left| \mathbf{x}\_{ki} - \mathbf{x}\_{ji} \right| \right) \right)^{\frac{1}{2}} \end{split}$$

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

<sup>2</sup> *z n* 

**Sample1 Size** 

**k**  *E x k*

 

If <sup>2</sup>

write

confidence interval on *s* is

As for the one-sided 100 1 %

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 111

2

1.0 4 0.9800 0.2682 1.7653 1.6140 0.9 5 0.8765 0.3480 1.6691 1.5401 0.8 7 0.7408 0.4541 1.5518 1.4487 0.7 8 0.6930 0.4913 1.5125 1.4176 0.6 11 0.5910 0.5698 1.4312 1.3530 0.5 16 0.4900 0.6461 1.3537 1.2909 0.4 25 0.3920 0.7188 1.2807 1.2318 0.3 43 0.2989 0.7868 1.2128 1.1764 0.2 97 0.1990 0.8587 1.1411 1.1174 0.1 385 0.0999 0.9293 1.0707 1.0591 0.62 10 0.6198 0.5478 1.4538 1.3711 0.44 20 0.4383 0.6847 1.3149 1.2596 0.36 30 0.3578 0.7439 1.2556 1.2114 0.31 40 0.3099 0.7788 1.2208 1.1829 0.28 50 0.2772 0.8025 1.1971 1.1636 0.25 60 0.2530 0.8199 1.1798 1.1493 Note: Sample size is calculated by 2 2

**Mean Standard Deviation** 

2, 1 1 *n n* 

*nz E z k*

 2 2 

.

 

> 

, then a two-sided 100 1 %

.

.

.

 

 

The two-sided 100(1-α)% confidence interval of sample mean is calculated by 2 2

*z nx z n*

The two-sided 100(1-α)% confidence interval of sample standard deviation is calculated by 2 2 2, 1 1 2, 1 1 1

The one-sided 100(1-α)% confidence interval of sample standard deviation is calculated by <sup>2</sup> 1 ,1 1 *<sup>n</sup> s n* 

 *n n ns n* 

Table 3. Summary of 95% confidence intervals of sample mean and sample standard

2 2 2, 1 1 2, 1 1 1 *n n s n n*

<sup>2</sup>

*n s*

1

deviation at various error levels and sample sizes for a *N*(0, 1) distribution.

*s* is the sample variance from a random sample of *n* observations from a normal

 

2 2 1 ,1

 

 

 

1 *<sup>n</sup>*

upper confidence bound as shown in Figure 7d, we may

distribution with known or specified variance <sup>2</sup>

*p*

**Two-Sided2 Two-Sided3 One-Sided4**

2 1 2, 1 1 *n*

*n* 

2 1 ,1 1 *n n* 

where *xk*=(*xk1*,…,*xks*) is the *k*-th experimental point, *s* is the number of factors in an experiment, *n* is the number of runs.

One of the most noteworthy advantages of the uniform design is that it allows an experiment strategy to be conducted in a relatively small number of runs. It is very useful when the levels of the factors are large, especially in some situations in which the number of runs is strictly limited to circumstances when factorial designs and orthogonal arrays can not be realized in practice.

Given that the strength of uniform design is that it provides a series of uniformly scattered experiment points over the domain, this homogeneity in two factors has physically become the spatial uniformity of sampling from a pavement section in *x* and *y* directions. The application of uniform design resulted in the generation of a sampling scheme with a UD table consisting of pairs of (*x*, *y*) coordinates.

#### **3.2 Fundamental statistics**

If *x* is the sample mean of a random sample of size *n* from a normal population, <sup>2</sup> *X N*~ , , then *Zx n* has a standard normal distribution. A 100 1 % 

confidence interval (CI) can be defined as (Figure 7a), 2 2 <sup>1</sup> *<sup>x</sup> pz z n* .

Hence, If and σ are specified, a 100 1 % confidence interval on *x* can be then given by

$$
\mu - z\_{a/2} \cdot \frac{\sigma}{\sqrt{n}} \le \overline{\chi} \le \mu + z\_{a/2} \cdot \frac{\sigma}{\sqrt{n}} \tag{3}
$$

It can be assumed that the error *E x* is equivalent to <sup>2</sup> *z n* (Figure 7b). Then the required sample size will be

$$m = \left(\frac{z\_{\alpha f2} \cdot \sigma}{E}\right)^2\tag{4}$$

That is to say, if *x* is used as an estimate of , we can be 100 1 % confident that the error *x* will not exceed a specified amount *E* when the sample size is <sup>2</sup> *nz E* 2 (Montgomery & Runger, 2010). If the specified error level is selected as the fraction of standard deviation of <sup>2</sup> *N* , distribution, i.e., *Ex k* , where 0 *k* , then the Equation 4 can be simplified as 2 2 *nz E z k* 2 2 . It should be noted that <sup>2</sup> *z* 1.9600 if 0.05 ; 2 *z* 1.6449 if 0.10 .

The same argument of sample mean can also be applied to sample standard deviation *s*. Let *X*<sup>1</sup> , *X*<sup>2</sup> , …, *Xn* be a random sample of size *n* from a normal distribution <sup>2</sup> *N* , , and let <sup>2</sup> *<sup>s</sup>* be the sample variance. Then the random variable 2 2 <sup>2</sup> 1 *sn <sup>X</sup>* has a chi-square ( <sup>2</sup> ) distribution with *n* 1 degrees of freedom. As shown in Figure 7c, we may write

$$p\left(\boldsymbol{\chi}\_{\boldsymbol{a}\mathcal{P}^{2,n-1}}^2 \le \frac{(n-1)\mathbf{s}^2}{\sigma^2} \le \boldsymbol{\chi}\_{1-\boldsymbol{a}\mathcal{P}^{2,n-1}}^2\right) = 1-\alpha^2$$


where *xk*=(*xk1*,…,*xks*) is the *k*-th experimental point, *s* is the number of factors in an

One of the most noteworthy advantages of the uniform design is that it allows an experiment strategy to be conducted in a relatively small number of runs. It is very useful when the levels of the factors are large, especially in some situations in which the number of runs is strictly limited to circumstances when factorial designs and orthogonal arrays can

Given that the strength of uniform design is that it provides a series of uniformly scattered experiment points over the domain, this homogeneity in two factors has physically become the spatial uniformity of sampling from a pavement section in *x* and *y* directions. The application of uniform design resulted in the generation of a sampling scheme with a UD

If *x* is the sample mean of a random sample of size *n* from a normal population,

confidence interval (CI) can be defined as (Figure 7a), 2 2 <sup>1</sup> *<sup>x</sup>*

2 2 *z xz*

<sup>2</sup> *z*

*E* 

(Montgomery & Runger, 2010). If the specified error level is selected as the fraction of

*X*<sup>1</sup> , *X*<sup>2</sup> , …, *Xn* be a random sample of size *n* from a normal distribution <sup>2</sup> *N*

distribution with *n* 1 degrees of freedom. As shown in Figure 7c, we may write

 <sup>1</sup> <sup>1</sup> <sup>2</sup> 2 1,21 2

*sn <sup>p</sup>*

*n* 1,2 *n*

 if 0.10 . The same argument of sample mean can also be applied to sample standard deviation *s*. Let

distribution, i.e., *Ex k*

 2 2 

 

*nz E z k*

*n n*

 

 

is equivalent to <sup>2</sup> *z*

2

will not exceed a specified amount *E* when the sample size is <sup>2</sup>

, we can be 100 1 %

 

(3)

*n*

has a standard normal distribution. A 100 1 %

*pz z*

confidence interval on *x* can be then given by

*n*

. It should be noted that

2 2

<sup>2</sup> 1 

*n*

 

 

(Figure 7b). Then the

confident that the

, where 0 *k* , then the

,

*sn <sup>X</sup>* has a chi-square ( <sup>2</sup> )

*nz E* 2 

, and let

(4)

.

experiment, *n* is the number of runs.

table consisting of pairs of (*x*, *y*) coordinates.

, then *Zx n*

Hence, If and σ are specified, a 100 1 %

It can be assumed that the error *E x*

That is to say, if *x* is used as an estimate of

,

 ; 2 *z* 1.6449 

Equation 4 can be simplified as 2 2

<sup>2</sup> *<sup>s</sup>* be the sample variance. Then the random variable

  2

required sample size will be

standard deviation of <sup>2</sup> *N*

 

not be realized in practice.

**3.2 Fundamental statistics** 

 <sup>2</sup> *X N*~ , 

error *x*

<sup>2</sup> *z* 1.9600 

 if 0.05 


Note:

Sample size is calculated by 2 2 *nz E z k* 2 2 .

The two-sided 100(1-α)% confidence interval of sample mean is calculated by

$$
\mu - z\_{af2} \cdot \sigma / \sqrt{n} \le \overline{x} \le \mu + z\_{af2} \cdot \sigma / \sqrt{n} \dots
$$

The two-sided 100(1-α)% confidence interval of sample standard deviation is calculated by

$$\sqrt[n]{\mathbb{Z}\_{a\not\!2,n-1}^2} \Big/ \Big(n-1\Big) \cdot \sigma \le s \le \sqrt[n]{\mathbb{Z}\_{1-a\not\!2,n-1}^2} \Big/ \Big(n-1\Big) \cdot \sigma \dots$$

The one-sided 100(1-α)% confidence interval of sample standard deviation is calculated by

$$s \le \sqrt{\mathcal{X}\_{1-\alpha, n-1}^2} (n-1) \cdot \sigma \dots$$

Table 3. Summary of 95% confidence intervals of sample mean and sample standard deviation at various error levels and sample sizes for a *N*(0, 1) distribution.

If <sup>2</sup> *s* is the sample variance from a random sample of *n* observations from a normal distribution with known or specified variance <sup>2</sup> , then a two-sided 100 1 % confidence interval on *s* is

$$\sqrt{\frac{\mathbb{X}\_{a f 2, n-1}^2}{n-1}} \cdot \sigma \le s \le \sqrt{\frac{\mathbb{X}\_{1-a f 2, n-1}^2}{n-1}} \cdot \sigma$$

As for the one-sided 100 1 % upper confidence bound as shown in Figure 7d, we may write

$$p\left(\frac{(n-1)s^2}{\sigma^2} \le \chi^2\_{1-\alpha, n-1}\right) = 1-\alpha$$

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

95% Confidence Interval

95% Confidence Interval

(a)

0 0.2 0.4 0.6 0.8 1 1.2 **Fraction of standard deviation**

1

0.8 1 1.2 1.4 1.6 1.8

deviation.

**Sample standard deviation**

10

100

**Sample size**

1000

(c)

**3.3 Sampling scheme and size simulation** 

segregation, uneven compaction, etc.

blocks of pavement sections.

1 10 100 1000 **Sample size**

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 113


**Sample mean**

Fig. 8. (a) Sample size versus fraction of standard deviation, (b) 95% two-sided confidence interval of sample mean, and (c) 95% one-sided upper confidence bound of sample standard

In this approach, it was assumed that the air-void contents on a project can be represented by a standard normal *N*(0, 1) distribution. The data from the *N*(0, 1) distribution were used to generate five data patterns: random pattern, central segregation pattern, bilateral segregation pattern, central-bilateral segregation pattern, and block segregation pattern

2. Central segregation pattern: the gap between two augers of an asphalt paver makes

3. Bilateral segregation pattern: the gap between the auger and the lateral board of the asphalt paver makes coarse aggregate concentrated near the bilateral regions of the

5. Block segregation pattern: as demonstrated in gradation segregation, temperature

The segregation horizontal strips as shown in Figures 9b, 9c, and 9d were randomly generated using the data in the shaded area of the *N*(0, 1) distribution, which represent higher air-void contents. In the block segregation pattern (Figure 9e), the *N*(0, 1) distribution was divided into 6 intervals and the data of each interval were randomly distributed into

4. Central-bilateral segregation pattern: a combined situation of patterns 2 and 3.

(Figure 9). The reasons for selecting these pattern types are as follows: 1. Random pattern: non-segregation, with ideal construction quality.

coarse aggregate concentrated near the center of the paved area.

paved area, or provides less compaction of the side area.

(b)

1 10 100 1000 **Sample size**

95% Confidence Interval

then the confidence upper bound on *s* is

$$s \le \sqrt{\frac{\mathcal{X}\_{1-\alpha, n-1}^2}{n-1}} \cdot \sigma \tag{5}$$

Table 3 summarizes the 95% confidence interval of sample mean and sample standard deviation at various error levels and sample sizes. Notice that the sample size listed in Table 3 was rounded to its ceiling value.

Figure 8a plots the sample size versus the specified error ( *E x* ) in terms of standard error ( ) with 95% confidence interval. The two-sided 95% confidence interval on the sample mean and the one-sided 95% upper confidence bound on the sample standard deviation of a *N*(0, 1) distribution, as a function of sample size, can be illustrated as shown in Figures 8b and 8c, respectively.

Fig. 7. (a) 100(1-α)% confidence interval of *N*(0, 1) distribution, (b) sample size determination with a specified error level, (c) 100(1-α)% two-sided confidence interval of *χ*2 distribution, and (d) 100(1-α)% one-sided confidence interval of *χ*2 distribution.

Fig. 8. (a) Sample size versus fraction of standard deviation, (b) 95% two-sided confidence interval of sample mean, and (c) 95% one-sided upper confidence bound of sample standard deviation.

## **3.3 Sampling scheme and size simulation**

112 Modern Approaches To Quality Control

Table 3 summarizes the 95% confidence interval of sample mean and sample standard deviation at various error levels and sample sizes. Notice that the sample size listed in Table

sample mean and the one-sided 95% upper confidence bound on the sample standard deviation of a *N*(0, 1) distribution, as a function of sample size, can be illustrated as shown

) with 95% confidence interval. The two-sided 95% confidence interval on the

(5)

 

<sup>2</sup> *<sup>n</sup>* 1,1

 <sup>1</sup> <sup>1</sup> <sup>2</sup> 2 1,1 2

1

 *E <sup>z</sup> <sup>n</sup> <sup>n</sup> zE*

*<sup>n</sup> zx* <sup>2</sup> *<sup>n</sup> zx* <sup>2</sup> *<sup>x</sup>*

*n*

 

*sn <sup>p</sup>*

) in terms of standard

*xerrorE*

2 2 <sup>2</sup>

  *X*

*<sup>n</sup> s n* 

Figure 8a plots the sample size versus the specified error ( *E x*

1

 

*Z*

(a) (b)

(c) (d)

Fig. 7. (a) 100(1-α)% confidence interval of *N*(0, 1) distribution, (b) sample size determination with a specified error level, (c) 100(1-α)% two-sided confidence interval of *χ*2 distribution,

then the confidence upper bound on *s* is

3 was rounded to its ceiling value.

in Figures 8b and 8c, respectively.

*N* 1,0

 

 

<sup>2</sup> *<sup>n</sup>* 1,2

2

 <sup>1</sup> <sup>1</sup> <sup>2</sup> 2 1,21 2

*sn <sup>p</sup>*

*n* 1,2 *n*

*<sup>x</sup> zp*

<sup>2</sup> *z* <sup>2</sup> *z*

 <sup>1</sup> <sup>2</sup> <sup>2</sup> *<sup>z</sup> <sup>n</sup>*

> 

and (d) 100(1-α)% one-sided confidence interval of *χ*2 distribution.

<sup>2</sup> *<sup>n</sup>* 1,21

1

error (

> In this approach, it was assumed that the air-void contents on a project can be represented by a standard normal *N*(0, 1) distribution. The data from the *N*(0, 1) distribution were used to generate five data patterns: random pattern, central segregation pattern, bilateral segregation pattern, central-bilateral segregation pattern, and block segregation pattern (Figure 9). The reasons for selecting these pattern types are as follows:


The segregation horizontal strips as shown in Figures 9b, 9c, and 9d were randomly generated using the data in the shaded area of the *N*(0, 1) distribution, which represent higher air-void contents. In the block segregation pattern (Figure 9e), the *N*(0, 1) distribution was divided into 6 intervals and the data of each interval were randomly distributed into blocks of pavement sections.

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 115

Fig. 9. Schematic illustration of five data patterns: (a) random pattern, (b) central segregation pattern, (c) bilateral segregation pattern, (d) central-bilateral segregation pattern, and (e)

block segregation pattern.

The prospective road section was divided into *n*(*X*) (*x*-direction) *n*(*Y*) (*y*-direction) cells. The *n*(*X*) represents the number of intervals in the *x*-direction. *N* points were then assigned to these *n*(*X*) *n*(*Y*) cells. Hence, a sampling scheme was defined by *n*(*X*), *n*(*Y*), and *N.* For instance, *x30y6n30* represents 30 runs that were assigned to 30 cells of the 30 6 cells based on the UD table. The sampling schemes considered in this study were combinations of various numbers of *n*(*X*) and *n*(*Y*) ―that is, *n*(*X*) = 3, 5, 10, 15, 20, 25, 30, 35, 40, 45, 55, 60 and *n*(*Y*) = 1, 2, 3, 4, 6―and *N* = *n*(*X*); however, the cases with *n*(*Y*) > *n*(*X*) were excluded, resulting in a total of 62 cases. Each case was assigned a UD table with minimum *CD*2 value. Figures 10a through 10c respectively illustrate the example sampling schemes (i.e., UD tables), *x10y6n10*, *x30y6n30*, and *x60y6n60*, from the uniform design. These sampling schemes are on the same scales of a 900 ft 24 ft (274 m 7.32 m) pavement section. The black rectangle cell physically represents the area of which one measure should be sampled randomly.

For this sampling simulation, a total of 2700 72 points with a standard normal distribution of air-void contents were used to generate five data patterns with the following assumptions:


Each type of sampling scheme per data pattern was simulated 200 times. For each simulation, the sample mean and sample standard deviation were calculated. It should be noted that the data of each simulation were randomly drawn from the cells specified in the UD table with replacement. Consequently, the distributions of the sample mean and standard deviation were generated after 200 simulations. The boxplot was then utilized to characterize the location and dispersion of sample means and standard deviations.

The boxplot illustrates a measure of location (the median [solid black dot or white strip]), a measure of dispersion (the interquartile range IQR [lower quartile: left or bottom-edge of box; upper quartile: right or top-edge of box]), and the possible outliers (data points with light circle or horizontal line outside the 1.5 IQR distance from the edges of box; the most extreme data points within 1.5 IQR distance are marked with square brackets) and also gives an indication of the symmetry or skewness of the distribution.

The Trellis graph introduced by Cleveland in 1993 (Cleveland, 1993) is a graphical way of examining high-dimensional data structure by means of conditional one-, two-, and threedimensional graphs. As an example, we would like to determine how the sample mean distribution depends on *n*(*X*), *n*(*Y*), and the data pattern. To inspect this graphically, the simulation results can be split up into groups and can be plot separately as opposed to blurring the effects in a single graph. The Trellis graph of boxplots presented in Figures 11 and 12 was arranged in such a way that each panel consists of all the *n*(*Y*) = 1, 2, 3, 4, 6 cases (i.e., 5 boxplots in each panel), each row is made by all the N = *n*(*X*) = 3, 5, 10, 15, 20, 25, 30, 35, 40, 45, 55, 60 cases (i.e., 13 panels in a row) with the same data pattern, and each column has 5 panels (i.e., 5 data patterns) with the same *n*(*X*). Thus, for each individual column, the effects of data pattern and *n*(*Y*) can be examined at the specified *n*(*X*); for each individual row, the effects of *n*(*X*) and *n*(*Y*) can be inspected at the specified data pattern. The Trellis graph was categorized by *n*(*X*), *n*(*Y*), and five data patterns.

The prospective road section was divided into *n*(*X*) (*x*-direction) *n*(*Y*) (*y*-direction) cells. The *n*(*X*) represents the number of intervals in the *x*-direction. *N* points were then assigned to these *n*(*X*) *n*(*Y*) cells. Hence, a sampling scheme was defined by *n*(*X*), *n*(*Y*), and *N.* For instance, *x30y6n30* represents 30 runs that were assigned to 30 cells of the 30 6 cells based on the UD table. The sampling schemes considered in this study were combinations of various numbers of *n*(*X*) and *n*(*Y*) ―that is, *n*(*X*) = 3, 5, 10, 15, 20, 25, 30, 35, 40, 45, 55, 60 and *n*(*Y*) = 1, 2, 3, 4, 6―and *N* = *n*(*X*); however, the cases with *n*(*Y*) > *n*(*X*) were excluded, resulting in a total of 62 cases. Each case was assigned a UD table with minimum *CD*2 value. Figures 10a through 10c respectively illustrate the example sampling schemes (i.e., UD tables), *x10y6n10*, *x30y6n30*, and *x60y6n60*, from the uniform design. These sampling schemes are on the same scales of a 900 ft 24 ft (274 m 7.32 m) pavement section. The black rectangle cell physically represents the area of which one measure should be sampled

For this sampling simulation, a total of 2700 72 points with a standard normal distribution of air-void contents were used to generate five data patterns with the following

2. Time frame of construction: 1 hour with 900 ft (274 m) of HMA placed, i.e., paver speed

3. One air-void sample is represented by a 4 × 4-in. (10 × 10-cm) square with each square

Each type of sampling scheme per data pattern was simulated 200 times. For each simulation, the sample mean and sample standard deviation were calculated. It should be noted that the data of each simulation were randomly drawn from the cells specified in the UD table with replacement. Consequently, the distributions of the sample mean and standard deviation were generated after 200 simulations. The boxplot was then utilized to

The boxplot illustrates a measure of location (the median [solid black dot or white strip]), a measure of dispersion (the interquartile range IQR [lower quartile: left or bottom-edge of box; upper quartile: right or top-edge of box]), and the possible outliers (data points with light circle or horizontal line outside the 1.5 IQR distance from the edges of box; the most extreme data points within 1.5 IQR distance are marked with square brackets) and also gives

The Trellis graph introduced by Cleveland in 1993 (Cleveland, 1993) is a graphical way of examining high-dimensional data structure by means of conditional one-, two-, and threedimensional graphs. As an example, we would like to determine how the sample mean distribution depends on *n*(*X*), *n*(*Y*), and the data pattern. To inspect this graphically, the simulation results can be split up into groups and can be plot separately as opposed to blurring the effects in a single graph. The Trellis graph of boxplots presented in Figures 11 and 12 was arranged in such a way that each panel consists of all the *n*(*Y*) = 1, 2, 3, 4, 6 cases (i.e., 5 boxplots in each panel), each row is made by all the N = *n*(*X*) = 3, 5, 10, 15, 20, 25, 30, 35, 40, 45, 55, 60 cases (i.e., 13 panels in a row) with the same data pattern, and each column has 5 panels (i.e., 5 data patterns) with the same *n*(*X*). Thus, for each individual column, the effects of data pattern and *n*(*Y*) can be examined at the specified *n*(*X*); for each individual row, the effects of *n*(*X*) and *n*(*Y*) can be inspected at the specified data pattern. The Trellis

characterize the location and dispersion of sample means and standard deviations.

an indication of the symmetry or skewness of the distribution.

graph was categorized by *n*(*X*), *n*(*Y*), and five data patterns.

randomly.

assumptions:

1. Lane width: 24 ft (7.32 m).

= 15 ft/min. (4.57 m/min.).

assigned a normalized air-void value.

Fig. 9. Schematic illustration of five data patterns: (a) random pattern, (b) central segregation pattern, (c) bilateral segregation pattern, (d) central-bilateral segregation pattern, and (e) block segregation pattern.

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:


patterns with *N = n(X).*

0.0 1.0 2.0 3.0

   

and five data patterns with *N = n(X).*

 

 

 

0.0 1.0 2.0 3.0 

0.0 1.0 2.0 3.0

 

 

 

 

 

> 

> 

> 

> 

0.0 1.0 2.0 3.0 

0.0 1.0 2.0 3.0

 

 

 

 

 

> 

> 

> 

> 

0.0 1.0 2.0 3.0 

0.0 1.0 2.0 3.0

Sample Standard Deviation Fig. 12. Trellis graph of boxplots of sample standard deviation categorized by *n(X)*, *n(Y)*,

 

 

 

 

 

> 

> 

> 

> 

0.0 1.0 2.0 3.0 

0.0 1.0 2.0 3.0

 

 

 

 

 

> 

> 

> 

> 

0.0 1.0 2.0 3.0 

0.0 1.0 2.0 3.0

 

 

 

 

 

> 

> 

> 

> 

0.0 1.0 2.0 3.0 

0.0 1.0 2.0 3.0

 

 

 

 

 


   

 

 

 



 

 

 

 

 

> 

> 

> 

> 

> > -2 -1 0 1 2

 

 

 

 

 

> 

> 

> 

> 



Sample Mean Fig. 11. Trellis graph of boxplots of sample mean categorized by *n(X)*, *n(Y)*, and five data

 

 

 

 

> 

> 

> 

> 



 

 

 

 

 

> 

> 

> 

> 



 

 

 

 

 

> 

> 

> 

> 



 

 

 

 

 

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 117

 

The Trellis graphs of the boxplots shown in Figures 11 and 12 summarize respectively the simulation results of sample means and sample standard deviations. Several observations from the Trellis graphs can be made:


deviation. As a result, the smaller the sample standard deviation, the more uniform the construction quality of HMA. Also, from Figures 8b, 8c, 11, and 12, it is apparent that the change of variation decreases sharply at the beginning and the rate of change of variation stabilizes after *N* = 20 ~30.

Fig. 10. Examples of UD tables (a) *x10y6n10*, (b) *x30y6n30*, and (c) *x60y6n60.*

The Trellis graphs of the boxplots shown in Figures 11 and 12 summarize respectively the simulation results of sample means and sample standard deviations. Several observations

1. As *n*(*X*) increases, i.e., increase of *N*, the variations of sample mean and standard

2. For the segregation data patterns 2, 3, and 4, the increase of *n*(*Y*) does benefit the decrease of variation per *n*(*X*) and per data pattern. However, no apparent decrease of variation on random or block segregation patterns was perceived. This implies that the UD table provides a uniform sampling strategy. From the perspective of practice, it is suggested that *n*(*Y*) should be as large as possible to include all the possible data

3. It should be noted that the distributions of sample standard deviation at small *n*(*X*)s exhibit unsymmetrical and skewed distributions due to the intrinsic properties of

(chi-square) distribution. From the point of view of HMA construction, the

deviation. As a result, the smaller the sample standard deviation, the more uniform the construction quality of HMA. Also, from Figures 8b, 8c, 11, and 12, it is apparent that the change of variation decreases sharply at the beginning and the rate of change of

0 150 300 450 600 <sup>750</sup> 9000 150 300 450 600 <sup>750</sup> <sup>900</sup>

(a) *x10y6n10* (b) *x30y6n30*

one-sided upper bound is suggested for the judgment of sample standard

0 50 100 150 200 250 300

deviation reduce exponentially regardless of data patterns.

from the Trellis graphs can be made:

variation stabilizes after *N* = 20 ~30.

0 20 40 60 80 100

0 180 360 540 720 900

0 100 200 300 400 500 600

(c) *x60y6n60*

Fig. 10. Examples of UD tables (a) *x10y6n10*, (b) *x30y6n30*, and (c) *x60y6n60.*

patterns.

100 1 % 

 


Sample Mean

Fig. 11. Trellis graph of boxplots of sample mean categorized by *n(X)*, *n(Y)*, and five data patterns with *N = n(X).*


Fig. 12. Trellis graph of boxplots of sample standard deviation categorized by *n(X)*, *n(Y)*, and five data patterns with *N = n(X).*

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

, i.e., *Ex k*

sided upper bound 1.26.

distribution <sup>2</sup> *N*

,

2. Specify the α value to construct a 100 1 %

3. Determine the sample size based on the Equation 4,

least 3 levels per lane, i.e., *n*(*Y*) ≥ 3 per lane.

rejected and the agency has to reject the project.

**4. Why is it inappropriate to take only three samples?** 

random sample from a normal distribution with unknown mean

*S n* 

*nS*

where *X* is the sample mean and *S* the sample standard deviation. Now if

*X*

*nS*

*<sup>X</sup> <sup>T</sup>* <sup>0</sup> <sup>0</sup> <sup>0</sup>

If 0 , then it is called the central *n*<sup>1</sup> *t* distribution. When the true value of the mean is <sup>0</sup> , the distribution of *T*0 is termed the noncentral *n*<sup>1</sup> *t* distribution with the noncentrality parameter *Sn* . Based on the definition of type II error () : *P*{fail to reject *H*0 | *H*0 is false} under the hypothesis testing 0 <sup>0</sup> *H* : , 1 <sup>0</sup> *H* : , the Type II error is

distributed population with unknown mean

0

. The random variable *<sup>X</sup> <sup>T</sup>*

following steps:

2 

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 119

Accordingly, a proposed QA sampling guideline can be provided by the agency in the

1. Specify an error level (*E*) of sample mean in terms of standard deviation of the specified

.

4. Generate a uniform design table (UD table) as the sampling scheme, the *X* factor should have *n* (sample size) levels, i.e., *N* = *n*(*X*) = *n*. It is suggested the *Y* factor should have at

6. Check the sample mean *x* from *n* observations. If 2 2 *z z*

(Equation 3), then we accept the sample mean; otherwise, the sample mean has been

(Equation 5), then we accept the sample standard deviation; otherwise the project

It is not uncommon for agencies to base QA on three samples. However, the following discussion using *t* distribution is presented showing why it is inappropriate to take only this number of samples for quality assurance. When estimating the mean of a normally

distribution should be applied especially with small sample size. Let 12 n *X* , X , , X be a

confidence interval.

*n*

<sup>2</sup> *z*

*E* 

2

and unknown variance <sup>2</sup>

*S n*

has a *<sup>n</sup>* <sup>1</sup> *t* distribution with 1 *n* degrees of freedom,

*nS*

*X*

.

*x n n*

 

 

*<sup>n</sup> s n* 

> , the *t*

and unknown variance

 <sup>0</sup> , then  

5. Randomly take the measurement from each (,) *x y* cell specified in UD table.

7. Check the sample standard deviation *s* if step 6 has been satisfied. If

should be rejected because of non-uniformity of construction quality.

on the high side; the sample standard deviation 1.22 is a slightly less than the 95% one-

#### **3.4 UD demonstration example using two field sections**

In this demonstration example, the percent air-void content data of two field pavement sections each with 164 ft (50 m) in length and 36 ft (11 m) in width were acquired by the Pavement Quality Indicator (PQI), which is a non-nuclear density measurement device calibrated with core samples. The percent air-void content was taken by a 3.3 × 3.3-ft (1 × 1 m) square. These two pavement sections served as the "testing sections" of which the paving operation, compaction pattern/effort, and other construction details were verified and corrected (if necessary) by the contractor. Several performance tests were comprehensively conducted by the agency to guarantee that the pavement quality of the whole project met the specifications afterwards. The material properties of two pavement sections, AC-13 and AC-20, are as follows.


The measured percent air-void contents are illustrated in Figures 13a and 13b respectively for the AC-13 and AC-20 pavement sections. As can be seen from the figures, the AC-13 section presents high air-void content on the section edges and seems to have a wide variation of air-void content. The AC-20 section appears to have more uniform distribution of air-void content.

To illustrate the proposed QA approach, it was decided that 20 points (20 runs) will be sampled to ensure that the agency is 95% confident that the error *x* will not exceed 0.44σ, i.e., 0.44 percent (Table 3). Two UD tables (Figures 13c and 13d) were generated for both sections which are subdivided into 10 (*x*-direction) by 11 (*y*-direction), i.e., *x10y11n20*. In this case study, the sampling for each UD table was conducted only once. Figures 13e and 13f summarized the sampled, measured, and specified distributions of air-void content. Several findings can be addressed in the following:


on the high side; the sample standard deviation 1.22 is a slightly less than the 95% onesided upper bound 1.26.

Accordingly, a proposed QA sampling guideline can be provided by the agency in the following steps:


118 Modern Approaches To Quality Control

In this demonstration example, the percent air-void content data of two field pavement sections each with 164 ft (50 m) in length and 36 ft (11 m) in width were acquired by the Pavement Quality Indicator (PQI), which is a non-nuclear density measurement device calibrated with core samples. The percent air-void content was taken by a 3.3 × 3.3-ft (1 × 1 m) square. These two pavement sections served as the "testing sections" of which the paving operation, compaction pattern/effort, and other construction details were verified and corrected (if necessary) by the contractor. Several performance tests were comprehensively conducted by the agency to guarantee that the pavement quality of the whole project met the specifications afterwards. The material properties of two pavement

> Asphalt concrete with nominal maximum aggregate size (NMAS) 13 mm

**Binder type** SBS modified binder (equivalent to PG76-22)

**Design binder content** 5.6% 4.8% **Target air-void content** *N*(μ, σ2) = *N*(5, 1): mean 5%; standard deviation 1%.

sampled to ensure that the agency is 95% confident that the error *x*

AC-20 sampled *N*(5.41, 1.222) versus AC-20 measured *N*(5.12, 1.242).

acceptable because of its "inaccurate" and "imprecise" distribution.

Several findings can be addressed in the following:

**content range** 5±2 %, i.e., *P*(3 ≤ AV ≤ 7) = 0.95 of a *N*(5, 1) distribution The measured percent air-void contents are illustrated in Figures 13a and 13b respectively for the AC-13 and AC-20 pavement sections. As can be seen from the figures, the AC-13 section presents high air-void content on the section edges and seems to have a wide variation of air-void content. The AC-20 section appears to have more uniform distribution

To illustrate the proposed QA approach, it was decided that 20 points (20 runs) will be

0.44σ, i.e., 0.44 percent (Table 3). Two UD tables (Figures 13c and 13d) were generated for both sections which are subdivided into 10 (*x*-direction) by 11 (*y*-direction), i.e., *x10y11n20*. In this case study, the sampling for each UD table was conducted only once. Figures 13e and 13f summarized the sampled, measured, and specified distributions of air-void content.

1. The sampled distribution based on the UD table matches the measured distribution reasonably well: AC-13 sampled *N*(6.29, 1.402) versus AC-13 measured *N*(6.18, 1.432);

2. The sample mean, 6.29, of AC-13 section is outside the 95% CI (4.56, 5.44) (Table 3); therefore, it is identified as an "inaccurate" distribution. The sample standard deviation, 1.40 exceeds the 95% one-sided upper bound 1.26 (Table 3); thus, it is designated as an "imprecise" distribution. As a result, the construction quality of AC-13 section is not

3. On the contrary, the construction quality of AC-20 section is not rejected because of its "accurate" and "precise" distribution: the sample mean 5.41 lies in the 95% CI although

**Pavement Section AC-13 Pavement Section AC-20** 

hard rock) Granite (fully crushed)

Asphalt concrete with NMAS 20 mm

will not exceed

**3.4 UD demonstration example using two field sections** 

**Aggregate type** Diabase (fully crushed,

sections, AC-13 and AC-20, are as follows.

**Asphalt mix type** 

**Acceptable air-void** 

of air-void content.


(Equation 3), then we accept the sample mean; otherwise, the sample mean has been rejected and the agency has to reject the project.

7. Check the sample standard deviation *s* if step 6 has been satisfied. If 2 1 ,1 1 *<sup>n</sup> s n* 

(Equation 5), then we accept the sample standard deviation; otherwise the project should be rejected because of non-uniformity of construction quality.

## **4. Why is it inappropriate to take only three samples?**

It is not uncommon for agencies to base QA on three samples. However, the following discussion using *t* distribution is presented showing why it is inappropriate to take only this number of samples for quality assurance. When estimating the mean of a normally distributed population with unknown mean and unknown variance <sup>2</sup> , the *t* distribution should be applied especially with small sample size. Let 12 n *X* , X , , X be a random sample from a normal distribution with unknown mean and unknown variance

2 . The random variable *<sup>X</sup> <sup>T</sup> S n* has a *<sup>n</sup>* <sup>1</sup> *t* distribution with 1 *n* degrees of freedom,

where *X* is the sample mean and *S* the sample standard deviation. Now if <sup>0</sup> , then

$$T\_o = \frac{\overline{X} - \mu\_o}{S \sqrt{\sqrt{n}}} = \frac{\overline{X} - \mu\_o - \delta + \delta}{S \sqrt{\sqrt{n}}} = \frac{\overline{X} - (\mu\_o + \delta)}{S \sqrt{\sqrt{n}}} + \frac{\delta \sqrt{n}}{S}$$

If 0 , then it is called the central *n*<sup>1</sup> *t* distribution. When the true value of the mean is <sup>0</sup> , the distribution of *T*0 is termed the noncentral *n*<sup>1</sup> *t* distribution with the noncentrality parameter *Sn* . Based on the definition of type II error () : *P*{fail to reject *H*0 | *H*0 is false} under the hypothesis testing 0 <sup>0</sup> *H* : , 1 <sup>0</sup> *H* : , the Type II error is

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

*<sup>n</sup> <sup>t</sup>*

made only if *T t* 0 ,1

probability of Type II error

0 0 *H* : 

 , 1 0 *H* : 

1 ,1 1 *n n*

 

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 121

(a) δ > 0 and (b) δ < 0] where *T*0 is noncentral *<sup>n</sup>* <sup>1</sup> *t* distribution. Hence, we have the

where *n*1 is the distribution function with 1 *n* degrees of freedom. From Figure 14, it is apparent that the more positive δ value the larger the value, i.e., the smaller the power; on the contrary, the more negative δ value the smaller the value, i.e., the larger the power.

<sup>1</sup> <sup>0</sup> *H*<sup>0</sup> : <sup>0</sup> *H* :

*ondistributi t noncentral 1-n ondistributi t central 1-n*

*noncentral*

(a) > 0

<sup>1</sup> <sup>0</sup> *H*<sup>0</sup> : <sup>0</sup> *H* :

*ondistributi t noncentral 1-n ondistributi t central 1-n*

*S n*

*<sup>n</sup>* 1, *t* 0

(b) < 0

for the situations that (a) > 0 and (b) < 0.

Equation 6 indicates that power is a function of α, *n*, and δ/*S*. Figure 15 plots power versus

and *n* = 3, the interpretation of Figure 15 is that one will have power greater than 0.8 to reject the null hypothesis if δ/*S* ≤ -2.30; on the other hand, if δ/*S* ≥ -2.30, then the agency has

increase sample size from three to five, the agency will have power greater than 0.8 if δ/*S* ≤ -1.37; that is, the agency can detect smaller mean difference from 2.30*S* down to 1.37*S* by

*S n*

Fig. 14. Definition of type II error () of a *t* distribution under the hypothesis testing:

δ/*S* at various sample sizes. Under the hypothesis testing 0 0 *H* :

insufficient power to reject the null hypothesis that 0 0 *H* :

*<sup>n</sup>* 1, *t* 0

*<sup>n</sup>* [shown in Figures 14a and 14b respectively for the situations that

, i.e., 1 ,1 1 *n n*

*power t*

*n*

, (6)

*T*0

*T*0

 

   , 1 0 *H* : 

. It should be noted that, to

, α = 0.05

 

Fig. 13. Image plots of air-void measures for sections (a) AC-13 and (b) AC-20; UD tables *x10y11n20* for sections (c) AC-13 and (d) AC-20; the specification, measured, and sampled distributions for sections (e) AC-13 and (f) AC-20

(a) (b)

y

x

0 10 20 30 40 50

**Spec. Dist.: N(5, 12) Measured. Dist.: N(5.12, 1.242) Sampled. Dist.: N(5.41, 1.222)**

20200303004040050500

**AC-20**

0 5 10 15

**Percent Air-Void Content** 0 5 10 15

0 10 20 30 40 50

0 10 20 30 40 50

2 4 6 8 10 12 **AC-20**

(c) (d)

0 10 20 30 40 50 <sup>0</sup> <sup>100</sup> <sup>200</sup> <sup>300</sup> <sup>400</sup> <sup>500</sup> 0 20 40 60 80 100

024

00

6 8

10

(e) (f)

Fig. 13. Image plots of air-void measures for sections (a) AC-13 and (b) AC-20; UD tables *x10y11n20* for sections (c) AC-13 and (d) AC-20; the specification, measured, and sampled

0.0

0.00.1

0.1

0.2

0.2

**Prob.**

0.3

 0.30.4

0.4

**Spec. Dist.: N(5, 12)**

**Percent Air-Void Content** 0 5 10 15

distributions for sections (e) AC-13 and (f) AC-20

0 510 15

x

0 100 200 300 400 500

0 10 20 30 40 50

0 10 20 30 40 50

4 6 8 10 **AC-13**

0.0

0.0

0.1

0.1

 0.2

0.2

**Prob.**

 0.3

0.3

 0.4

0.4

00

024

6 8

10

y

**AC-13**

**Measured. Dist.: N(6.18, 1.432) Sampled. Dist.: N(6.29, 1.402)**

20200303004040050500

made only if *T t* 0 ,1 *<sup>n</sup>* [shown in Figures 14a and 14b respectively for the situations that (a) δ > 0 and (b) δ < 0] where *T*0 is noncentral *<sup>n</sup>* <sup>1</sup> *t* distribution. Hence, we have the probability of Type II error

$$\beta = 1 - \mathcal{T}\_{n-1} \left( -t\_{\alpha, n-1} - \frac{\delta \sqrt{n}}{\sigma} \right), \text{ i.e., } \text{ power} = 1 - \beta = \mathcal{T}\_{n-1} \left( -t\_{\alpha, n-1} - \frac{\delta \sqrt{n}}{\sigma} \right), \tag{6}$$

where *n*1 is the distribution function with 1 *n* degrees of freedom. From Figure 14, it is apparent that the more positive δ value the larger the value, i.e., the smaller the power; on the contrary, the more negative δ value the smaller the value, i.e., the larger the power.

Fig. 14. Definition of type II error () of a *t* distribution under the hypothesis testing: 0 0 *H* : , 1 0 *H* : for the situations that (a) > 0 and (b) < 0.

Equation 6 indicates that power is a function of α, *n*, and δ/*S*. Figure 15 plots power versus δ/*S* at various sample sizes. Under the hypothesis testing 0 0 *H* : , 1 0 *H* : , α = 0.05 and *n* = 3, the interpretation of Figure 15 is that one will have power greater than 0.8 to reject the null hypothesis if δ/*S* ≤ -2.30; on the other hand, if δ/*S* ≥ -2.30, then the agency has insufficient power to reject the null hypothesis that 0 0 *H* : . It should be noted that, to increase sample size from three to five, the agency will have power greater than 0.8 if δ/*S* ≤ -1.37; that is, the agency can detect smaller mean difference from 2.30*S* down to 1.37*S* by

Application of Sampling Strategies for Hot-Mix Asphalt Infrastructure:

and QA testing would enhance long-term pavement performance.

1. It is important to recognize that the agency can be 100 1 %

sample size required to stabilize the variation is around 20 ~30.

if the sample standard deviation is less than the 100 1 %

rational enough for both the agency and the contractor to agree upon.

approach can also be a basis for pay factor determination.

the following observations and suggestions are offered:

value.

*x* 

 <sup>2</sup> *nz E* 2 

assurance process.

been fulfilled simultaneously.

Quality Control-Quality Assurance Sampling; Specification for Performance Test Requirements 123

4. To ensure the success of the proposed QC/QA guidelines, the contractor's minimum value of the testing null hypothesis must exceed that required by the agency. 5. From the Caltrans case study, the min criterion depended not only on the contractor's value and the agency's power level as expected but also on the *k* value that the agency would select for use. The min criterion can be smaller if both the agency and the contractor require low power level and high level and/or the agency increases the *k*

A concluding general observation relates to the concern for developing longer lasting pavement at this period of time because of increased costs of both pavement materials and increased traffic that must be accommodated. The added costs of testing by both the contractor and the agency are a very small proportion of the total costs associated with long lasting pavements. Accordingly an "attitude adjustment" for both parties relative to QC

From above discussion of Case II for determining sample size, simulation results of the sampling size and sampling scheme using UD tables, along with a demonstration example,

will not exceed a specified amount *E* if and only if the sample size is

900 ft HMA paving simulation (Figures 8, 11, and 12) suggests that the minimum

2. The UD table not only provides the most representative sampling scheme with the sample size for a given specified error level by the agency but also minimizes the possible effect of the underlying data pattern. Moreover, the UD table gives the agency a more unbiased "random" sampling scheme that can be followed in the quality

3. The sample mean and sample standard deviation criteria proposed in the QA guideline demonstrates the accurate/inaccurate and precise/imprecise concept of sampling

interval, then it is accurate. Precision is a term to describe the degree of data dispersion;

then it is precise. The case study presents a very good example of an inaccurate/ imprecise case of the AC-13 field section and an accurate/precise case of the AC-20 field section. The quality of a project can only be accepted if and only if these criteria have

4. The proposed QA guideline with the introduction of the UD table is relatively simple, practical, and robust. The sample mean and sample standard deviation criteria are

5. It should be emphasized that the proposed QA approach could be applied with other performance measurement parameters to control the quality of the as-constructed mix, such as thickness, stabilometer testing as used in California, performance testing of fatigue and rutting, etc. Moreover, the decision-making based on this proposed QA

6. By taking only three samples out of a project, the agency will have insufficient power to reject 0 <sup>0</sup> *H* : given that *H*0 is false unless the quality of the project delivered by the

outcomes. If the sample mean is located in the range of 100 1 %

. The variations of sample mean and sample standard deviation for the

confident that the error

one-sided upper bound,

confidence

increasing two samples. In sum, by taking only three samples out of a project, the agency will have insufficient power to reject 0 0 *H* : given that *H*0 is false unless the quality of the project delivered by the contractor is so poor that the agency is confident enough to reject the project.

Fig. 15. Power versus δ/S curves at different sample sizes for the one-sided *t*-test at a significance level α = 0.05.

#### **5. Findings and conclusions**

For the Case I study, an attempt has been made to illustrate an approach and the extent of testing required using a performance test to insure reasonable quality in as-placed HMA. Stabilometer S-value test results were used in this example since extensive data were available. It should be emphasized that the same approach could be applied using other test parameters to control the quality of the as-constructed mix.

Based on stabilometer test results, the brief discussion of hypothesis testing, and the simulation results of sampling scheme and size, the following observations and suggestions are offered:


increasing two samples. In sum, by taking only three samples out of a project, the agency

the project delivered by the contractor is so poor that the agency is confident enough to


For the Case I study, an attempt has been made to illustrate an approach and the extent of testing required using a performance test to insure reasonable quality in as-placed HMA. Stabilometer S-value test results were used in this example since extensive data were available. It should be emphasized that the same approach could be applied using other test

Based on stabilometer test results, the brief discussion of hypothesis testing, and the simulation results of sampling scheme and size, the following observations and suggestions

1. Cooperation between the agency and the contractor is essential. It is necessary to have the testing process, test equipment, test results, and specimen preparation as consistent

2. The sampling simulation of the Case I demonstration example suggests that the sample size required to stabilize the sampling consistency and sampling stabilization is around

3. Likely, sampling as noted (2) is perhaps impractical. However, increasing the sample size is actually beneficial for both the agency and the contractor since it reduces the potential for dispute and guarantees the quality of the constructed mix. By extension, it is advisable for the agency to provide incentives to encourage the contractor to increase

given that *H*0 is false unless the quality of

**=** 

**H0: = <sup>0</sup> H1: < <sup>0</sup>**

 

n = 3 <sup>4</sup> <sup>5</sup> <sup>10</sup> <sup>20</sup><sup>30</sup> <sup>50</sup>


Fig. 15. Power versus δ/S curves at different sample sizes for the one-sided *t*-test at a

will have insufficient power to reject 0 0 *H* :

reject the project.

0

**5. Findings and conclusions** 

sampling size and testing.

parameters to control the quality of the as-constructed mix.

as possible between the two organizations.

50 ~ 70 for the placement of 15,000 tons HMA.

significance level α = 0.05.

are offered:

0.2

0.4

0.6

**Power**

0.8

1


A concluding general observation relates to the concern for developing longer lasting pavement at this period of time because of increased costs of both pavement materials and increased traffic that must be accommodated. The added costs of testing by both the contractor and the agency are a very small proportion of the total costs associated with long lasting pavements. Accordingly an "attitude adjustment" for both parties relative to QC and QA testing would enhance long-term pavement performance.

From above discussion of Case II for determining sample size, simulation results of the sampling size and sampling scheme using UD tables, along with a demonstration example, the following observations and suggestions are offered:


**7** 

*1MRC-Holland* 

*3Berg IT solutions The Netherlands* 

*2Free University Amsterdam* 

**Analysis of MLPA Data Using Novel Software** 

Genetic knowledge has increased tremendously in the last years, filling gaps and giving answers that were inaccessible before. Medical genetics seeks to understand how genetic variation relates to human health and disease (National Center for Biotechnology Information, 2008). Although genetics plays a larger role in general, the knowledge of the genetic origins of disease has increased our understanding of illnesses caused by abnormalities in the genes or chromosomes, offering the potential to improve the diagnosis and treatment of patients. Normally, every person carries two copies of every gene (with the exception of genes related to sex-linked traits), which cells can translate into a functional protein. The presence of mutant forms of genes (mutations, copy number changes, insertion/deletions and chromosomal alterations) may affect several processes concerning the production of these proteins often resulting in the development of genetic disorders. Genetic disease is either caused by changes in the DNA of somatic cells in the body or it is

Genetic testing is "the analysis of, chromosomes (DNA), proteins, and certain metabolites in order to detect heritable disease-related genotypes, mutations, phenotypes, or karyotypes for clinical purposes (Holtzman et al, 2002). In order to make this suitable for routine diagnostics dedicated, affordable, fast, easy-to-interpret and simple-to-use genetic tests are necessary. This allows scientists to easily access information that for instance can be used to: confirm or rule out a suspected genetic condition or help determine a person's chance of developing or passing on a genetic disorder. Several hundred genetic tests are currently in use, and more are being developed (Sequeiros et al, 2008). The Multiplex Ligationdependent Probe Amplification (MLPA) is a PCR-based technique, which allows the detecting of copy number changes in DNA or RNA. MLPA can quantify up to 50 nucleic acid sequences or genes in one simple reaction, with a resolution down to the single nucleotide level (Schouten et al., 2002) needing only 20 ng of DNA. The MLPA procedure itself needs little hands on work allowing up to 96 samples to be handled simultaneously while results can be obtained within 24 hours. These properties make it a very efficient technique for medium-throughput screening of many different diseases in both a research

inherited, e.g. by mutations in the germ cells of the parents.

and diagnostic settings (Ahn et al, 2007).

**1. Introduction** 

**Coffalyser.NET by MRC-Holland** 

Jordy Coffa1,2 and Joost van den Berg3

contractor is so poor that the angency is confident enough to reject the project. However, by increasing sample size from three to five, the agency can detect smaller mean difference from 2.30*S* down to 1.37*S* by simply increasing two samples.

7. It is likely that the proposed sampling size is impractical. In this regard, the alternative is to establish a "testing section" similar to those in the case study and follow the proposed QA approach with the minimum sampling size (at least greater than 20) to ensure that the compaction pattern/effort, paving operation, and other construction details are appropriate to guarantee that the pavement quality meets the specifications.

## **6. Acknowledgments**

The research associated with the first case study was conducted as a part of the Partnered Pavement Research Program supported by the California Department of Transportation (Caltrans) Division of Research and Innovation. Special thanks go to Mr. Kee Foo of Caltrans who provided the stability data from Caltrans projects. The contents of this paper reflect the views of the authors who are responsible for the facts and accuracy of the information presented and do not reflect the official views of the policies of the State of California or the Federal Highway Administration.

The field data associated with the second case study was sponsored by the Ministry of Transport of the People's Republic of China. The contents of this paper reflect the views of the authors who are responsible for the facts and accuracy of the information presented and do not reflect the official views of the policies of the Ministry of Transport of the People's Republic of China.

## **7. References**


## **Analysis of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland**

Jordy Coffa1,2 and Joost van den Berg3 *1MRC-Holland 2Free University Amsterdam 3Berg IT solutions The Netherlands* 

## **1. Introduction**

124 Modern Approaches To Quality Control

The research associated with the first case study was conducted as a part of the Partnered Pavement Research Program supported by the California Department of Transportation (Caltrans) Division of Research and Innovation. Special thanks go to Mr. Kee Foo of Caltrans who provided the stability data from Caltrans projects. The contents of this paper reflect the views of the authors who are responsible for the facts and accuracy of the information presented and do not reflect the official views of the policies of the State of

The field data associated with the second case study was sponsored by the Ministry of Transport of the People's Republic of China. The contents of this paper reflect the views of the authors who are responsible for the facts and accuracy of the information presented and do not reflect the official views of the policies of the Ministry of Transport of the People's

California Department of Transportation (Nov. 2007). *Standard Specifications,* Sacramento,

Cleveland, W.S. (1993). *Visualizing Data,* Hobart Press, ISBN 978-0963488404, Summit, NJ, USA Fang, K.T. (1980). The Uniform Design: Application of Number Theoretical Methods in

Fang, K.T.; Lin, D.K.J, Winker, P. & Zhang, Y. (2000). Uniform Design: Theory and Application. *Technometrics,* Vol.42, No.3, pp. 237-248, ISSN 1537-2723 Fang, K.T. & Lin, D.K.J (2003). Uniform Experimental Designs and Their Applications in

Montgomery, D.C. & Runger, G.C. (2010). *Applied Statistics and Probability for Engineers,* John

Stone, C.J.A. (1996). *Course in Probability and Statistics,* Duxbury Press, ISBN 0-534-23328-7,

Tsai, B.-W. & Monismith, C.L. (2009). Quality Control – Quality Assurance Sampling

Wiley & Sons, Inc., ISBN 978-0-470-05304-1, USA

*KeXue TongBao,* Vol.26, pp. 485-489, ISSN 0250-7862

Experimental Design. *Acta Mathematicae Applagatae Sinica,* Vol.3, pp. 353-372, ISSN

Industry. In: *Handbook of Statistics 22,* Khattree, R. & Rao, C.R., pp. 131-170, ISBN

Strategies for Hot-Mix Asphalt Construction. *In Transportation Research Board: Journal of the Transportation Research Board,* No.2098, pp. 51-62, ISSN 0361-1981 Wang, Y. & Fang, K.T. (1981). A Note on Uniform Distribution and Experimental Design.

mean difference from 2.30*S* down to 1.37*S* by simply increasing two samples. 7. It is likely that the proposed sampling size is impractical. In this regard, the alternative is to establish a "testing section" similar to those in the case study and follow the proposed QA approach with the minimum sampling size (at least greater than 20) to ensure that the compaction pattern/effort, paving operation, and other construction details are appropriate to guarantee that the pavement quality meets the specifications.

**6. Acknowledgments** 

Republic of China.

**7. References** 

Calif., USA

1618-3832

0444-506144

Pacific Grove, Calif., USA

California or the Federal Highway Administration.

contractor is so poor that the angency is confident enough to reject the project. However, by increasing sample size from three to five, the agency can detect smaller

> Genetic knowledge has increased tremendously in the last years, filling gaps and giving answers that were inaccessible before. Medical genetics seeks to understand how genetic variation relates to human health and disease (National Center for Biotechnology Information, 2008). Although genetics plays a larger role in general, the knowledge of the genetic origins of disease has increased our understanding of illnesses caused by abnormalities in the genes or chromosomes, offering the potential to improve the diagnosis and treatment of patients. Normally, every person carries two copies of every gene (with the exception of genes related to sex-linked traits), which cells can translate into a functional protein. The presence of mutant forms of genes (mutations, copy number changes, insertion/deletions and chromosomal alterations) may affect several processes concerning the production of these proteins often resulting in the development of genetic disorders. Genetic disease is either caused by changes in the DNA of somatic cells in the body or it is inherited, e.g. by mutations in the germ cells of the parents.

> Genetic testing is "the analysis of, chromosomes (DNA), proteins, and certain metabolites in order to detect heritable disease-related genotypes, mutations, phenotypes, or karyotypes for clinical purposes (Holtzman et al, 2002). In order to make this suitable for routine diagnostics dedicated, affordable, fast, easy-to-interpret and simple-to-use genetic tests are necessary. This allows scientists to easily access information that for instance can be used to: confirm or rule out a suspected genetic condition or help determine a person's chance of developing or passing on a genetic disorder. Several hundred genetic tests are currently in use, and more are being developed (Sequeiros et al, 2008). The Multiplex Ligationdependent Probe Amplification (MLPA) is a PCR-based technique, which allows the detecting of copy number changes in DNA or RNA. MLPA can quantify up to 50 nucleic acid sequences or genes in one simple reaction, with a resolution down to the single nucleotide level (Schouten et al., 2002) needing only 20 ng of DNA. The MLPA procedure itself needs little hands on work allowing up to 96 samples to be handled simultaneously while results can be obtained within 24 hours. These properties make it a very efficient technique for medium-throughput screening of many different diseases in both a research and diagnostic settings (Ahn et al, 2007).

Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 127

duplications even account for 65-70% of all mutations (Janssen et al., 2005). Since MLPA can detect sequences that differ only a single nucleotide, the technique is also widely used for the analysis of complicated diseases such as congenital adrenal hyperplasia and spinal muscular atrophy, where pseudo-genes and gene conversion complicate the analysis (Huang et al., 2007). Methylation-specific MLPA has also proven to be a very useful method for the detection of aberrant methylation patterns in imprinted regions such as can be found with the Prader-Willi/Angelman syndrome and Beckwith-Wiedemann syndrome (Scott et al., 2008). The MS-MLPA method can also be used for the analysis of aberrant methylation of CpG islands in tumour samples using e.g. DNA derived from formalin-fixed, paraffin-

MLPA kits generally contain about 40-50 oligo-nucleotide probes targeted to mainly the exonic regions of a single or multiple genes. The number of genes that each kit contains is dependent on the purpose of the designed kit. Each oligo-probe consists of two hemiprobes, which after denaturation of the sample DNA hybridize to adjacent sites of the target sequence during an overnight incubation. For each probe oligo-nucleotide in a MLPA kit there are about 600.000.000 copies present during the overnight incubation. An average MLPA reaction contains 60 ng of human DNA sample, which correlates to about 20.000 haploid genomes. This abundance of probes as compared to the sample DNA allows all target sequences in the sample to be covered. After the overnight hybridization adjacent hybridized hemi-probe oligo-nucleotides are then ligated using a ligase enzyme and the ligase cofactor NAD at a slightly lower temperature than the hybridization reaction (54 °C instead of 60 °C). The ligase enzyme used, Ligase-65, is heat-inactivated after the ligation reaction. Afterwards the non-ligated probe oligonucleotides do not have to be removed since the ionic conditions during the ligation reaction resemble those of an ordinary 1x PCR buffer. The PCR reaction can therefore be started directly after the ligation reaction by adding the PCR primers, polymerase and dNTPs. All ligated probes have identical end sequences, permitting simultaneous PCR amplification using only one primer pair. In the PCR reaction, one of the two primers is fluorescently labeled, enabling the detection and

The different length of every probe in the MLPA kit then allows these products to be separated and measured using standard capillary fragment electrophoresis. The unique length of every probe in the probe mix is used to associate the detected signals back to the original probe sequences. These probe product measurements are proportional to the amount of the target sequences present in a sample but cannot simply be translated to copy numbers or methylation percentages. To make the data intelligible, data of a probe originating from an unknown sample needs to be compared with a reference sample. This reference sample is usually performed on a sample that has a normal (diploid) DNA copy number for all target sequences. In case the signal strengths of the probes are compared with those obtained from a reference DNA sample known to have two copies of the chromosome, the signals are expected to be 1.5 times the intensities of the respective probes from the reference if an extra copy is present. If only one copy is present the proportion is expected to be 0.5. If the sample has two copies, the relative probe strengths are expected to be equal. In some circumstances reliable results can be obtained by comparing unknown samples can to reference samples by visual assessment, simply by overlaying two fragment profiles and

embedded tissues.

quantification of the probe products.

comparing relative intensities of fragments (figure 1).

Over a million of MLPA reactions were performed last year worldwide but researchers are still concerned with the application of tools to facilitate and improve MLPA data analysis on large, complex data sets. MLPA kits contain oligo-nucleotide probes that through a biochemical reaction can produce signals that are proportional to the amount of the target sequences present in a sample. These signals are detected and quantified on a capillary electrophoresis device producing a fragment profile. The signals of an unknown sample need to be compared to a reference in order to assess the copy number. Profile comparison is a matter of professional judgment and expertise. Diverse effects may furthermore systematically bias the probe measurements such as: quality of DNA extraction, PCR efficiency, label incorporation, exposure, scanning, spot detection, etc., making data analysis even more challenging. To make data more intelligible, the detected probe measurements of different samples need to be normalized thereby removing the systematic effects and bringing data of different samples onto a common scale.

Although several normalization methods have been proposed, they frequently fail to take into account the variability of systematic error within and between MLPA experiments. Each MLPA study is different in design, scope, number of replicates and technical considerations. Data normalization is therefore often context dependent and a general method that provides reliable results in all situations is hard to define. The most used normalization strategy therefore remains the use of in-house brew analysis spreadsheets that often cannot provide the reliability required for results with clinical purposes. These sheets furthermore do not provide easy handling of large amounts of data and file retrieval, storage and archival needs to be handled by simple file management systems. We therefore set out to develop software that could tackle all of these problems, and provide users with reliable results that are easy to interpreter.

In this chapter we show the features and integrated analysis methods of our novel MLPA analysis software called Coffalyser.NET. Our software uses an analysis strategy that can adapt to fit the researcher objectives while considering both the biological context and the technical limitations of the overall study. We use statistical parameters appropriate to the situation, and apply the most robust normalization method based on the biology and quality of the data. Most information required for the analysis is extracted directly from the MRC-Holland database, producer of the MLPA technology, needing only little user input about the experimental design to define an optimal analysis strategy. In the next section we review the MLPA technology in more detail and explain the principles of MLPA data normalization. Then in section 3, we describe the main features of our software and their significance. The database behind our software is reviewed in section 4 and section 5 explains the exact workflow of our program reviewing the importance and methodology of each analysis step in detail. In the final section, we summarize our paper and present the future directions of our research.

## **2. Background**

MLPA data is commonly used for sophisticated genomic studies and research to develop clinically validated molecular diagnostic tests, which e.g. can provide individualized information on response to certain types of therapy and the likelihood of disease recurrence. The most common application for MLPA is the detection of small genomic aberrations, often accounting for 10 to 30% of all disease-causing mutations (Redeker et al., 2008). In case of the very long DMD gene –involved in Duchenne Muscular Dystrophy— exon deletions and

Over a million of MLPA reactions were performed last year worldwide but researchers are still concerned with the application of tools to facilitate and improve MLPA data analysis on large, complex data sets. MLPA kits contain oligo-nucleotide probes that through a biochemical reaction can produce signals that are proportional to the amount of the target sequences present in a sample. These signals are detected and quantified on a capillary electrophoresis device producing a fragment profile. The signals of an unknown sample need to be compared to a reference in order to assess the copy number. Profile comparison is a matter of professional judgment and expertise. Diverse effects may furthermore systematically bias the probe measurements such as: quality of DNA extraction, PCR efficiency, label incorporation, exposure, scanning, spot detection, etc., making data analysis even more challenging. To make data more intelligible, the detected probe measurements of different samples need to be normalized thereby removing the systematic effects and

Although several normalization methods have been proposed, they frequently fail to take into account the variability of systematic error within and between MLPA experiments. Each MLPA study is different in design, scope, number of replicates and technical considerations. Data normalization is therefore often context dependent and a general method that provides reliable results in all situations is hard to define. The most used normalization strategy therefore remains the use of in-house brew analysis spreadsheets that often cannot provide the reliability required for results with clinical purposes. These sheets furthermore do not provide easy handling of large amounts of data and file retrieval, storage and archival needs to be handled by simple file management systems. We therefore set out to develop software that could tackle all of these problems, and provide users with

In this chapter we show the features and integrated analysis methods of our novel MLPA analysis software called Coffalyser.NET. Our software uses an analysis strategy that can adapt to fit the researcher objectives while considering both the biological context and the technical limitations of the overall study. We use statistical parameters appropriate to the situation, and apply the most robust normalization method based on the biology and quality of the data. Most information required for the analysis is extracted directly from the MRC-Holland database, producer of the MLPA technology, needing only little user input about the experimental design to define an optimal analysis strategy. In the next section we review the MLPA technology in more detail and explain the principles of MLPA data normalization. Then in section 3, we describe the main features of our software and their significance. The database behind our software is reviewed in section 4 and section 5 explains the exact workflow of our program reviewing the importance and methodology of each analysis step in detail. In the final section, we summarize our paper and present the

MLPA data is commonly used for sophisticated genomic studies and research to develop clinically validated molecular diagnostic tests, which e.g. can provide individualized information on response to certain types of therapy and the likelihood of disease recurrence. The most common application for MLPA is the detection of small genomic aberrations, often accounting for 10 to 30% of all disease-causing mutations (Redeker et al., 2008). In case of the very long DMD gene –involved in Duchenne Muscular Dystrophy— exon deletions and

bringing data of different samples onto a common scale.

reliable results that are easy to interpreter.

future directions of our research.

**2. Background** 

duplications even account for 65-70% of all mutations (Janssen et al., 2005). Since MLPA can detect sequences that differ only a single nucleotide, the technique is also widely used for the analysis of complicated diseases such as congenital adrenal hyperplasia and spinal muscular atrophy, where pseudo-genes and gene conversion complicate the analysis (Huang et al., 2007). Methylation-specific MLPA has also proven to be a very useful method for the detection of aberrant methylation patterns in imprinted regions such as can be found with the Prader-Willi/Angelman syndrome and Beckwith-Wiedemann syndrome (Scott et al., 2008). The MS-MLPA method can also be used for the analysis of aberrant methylation of CpG islands in tumour samples using e.g. DNA derived from formalin-fixed, paraffinembedded tissues.

MLPA kits generally contain about 40-50 oligo-nucleotide probes targeted to mainly the exonic regions of a single or multiple genes. The number of genes that each kit contains is dependent on the purpose of the designed kit. Each oligo-probe consists of two hemiprobes, which after denaturation of the sample DNA hybridize to adjacent sites of the target sequence during an overnight incubation. For each probe oligo-nucleotide in a MLPA kit there are about 600.000.000 copies present during the overnight incubation. An average MLPA reaction contains 60 ng of human DNA sample, which correlates to about 20.000 haploid genomes. This abundance of probes as compared to the sample DNA allows all target sequences in the sample to be covered. After the overnight hybridization adjacent hybridized hemi-probe oligo-nucleotides are then ligated using a ligase enzyme and the ligase cofactor NAD at a slightly lower temperature than the hybridization reaction (54 °C instead of 60 °C). The ligase enzyme used, Ligase-65, is heat-inactivated after the ligation reaction. Afterwards the non-ligated probe oligonucleotides do not have to be removed since the ionic conditions during the ligation reaction resemble those of an ordinary 1x PCR buffer. The PCR reaction can therefore be started directly after the ligation reaction by adding the PCR primers, polymerase and dNTPs. All ligated probes have identical end sequences, permitting simultaneous PCR amplification using only one primer pair. In the PCR reaction, one of the two primers is fluorescently labeled, enabling the detection and quantification of the probe products.

The different length of every probe in the MLPA kit then allows these products to be separated and measured using standard capillary fragment electrophoresis. The unique length of every probe in the probe mix is used to associate the detected signals back to the original probe sequences. These probe product measurements are proportional to the amount of the target sequences present in a sample but cannot simply be translated to copy numbers or methylation percentages. To make the data intelligible, data of a probe originating from an unknown sample needs to be compared with a reference sample. This reference sample is usually performed on a sample that has a normal (diploid) DNA copy number for all target sequences. In case the signal strengths of the probes are compared with those obtained from a reference DNA sample known to have two copies of the chromosome, the signals are expected to be 1.5 times the intensities of the respective probes from the reference if an extra copy is present. If only one copy is present the proportion is expected to be 0.5. If the sample has two copies, the relative probe strengths are expected to be equal. In some circumstances reliable results can be obtained by comparing unknown samples can to reference samples by visual assessment, simply by overlaying two fragment profiles and comparing relative intensities of fragments (figure 1).

Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 129

recognize gains and losses (González, 2008). Probe ratios of below 0.7 or above 1.3 are for instance regarded as indicative of a heterozygous deletion (copy number change from two to one) or duplication (copy number change from two to three), respectively. A delta value of 0.3 is a commonly accepted empirically derived threshold value for genetic dosage quotient analysis (Bunyan et al. 2004). To get more conclusive results probes may be arranged according to chromosomal location as this may reveal more subtle changes such as

Our software is compatible with binary data files produced by all major capillary electrophoresis systems including: ABIF files (\*.FSA, \*.AB1, \*.ABI) produced by Applied Biosystems devices, SCF and RSD files produced by MegaBACE™ systems (Amersham) and SCF and ESD files produced by CEQ systems (Beckman). We can also import fragment lists in text or comma separate format, produced by different fragment analysis software programs such as Genescan (Applied Biosystems), Genemapper (Applied Biosystems), CEQ Fragment analysis software (Beckman) and Genetools. Raw data files are however preferred since they allow more troubleshooting and quality check options as compared to size called fragment lists. Next to this, raw and analyzed data are then stored in a single database and

All applied algorithms in our software are specifically designed to suit MLPA or MLPA-like applications. We designed an algorithm for peak detection and quantification specifically for MLPA peak patterns. Most peak detection algorithms simply identify peaks based on amplitude, ignoring the additional information in the shape of the peaks. In our experience, 'true' peaks have characteristic shapes, and including fluorescence of artifacts may introduce ambiguity into the analysis and interpretation process. Our algorithm has the ability to differentiate most spurious peaks and artifacts from peaks that originate from a probe product. We differentiate a number of different peak artifacts, such as: shoulder peaks, printout spikes, dye artifacts, split peaks, pull-up peaks, stutter peaks and nontemplate additions. It is often difficult to identify the correct peaks due to appearance of nonspecific peaks in the vicinity of the main allele peak. Our algorithm is therefore optimized to discriminate the different artifacts from the probe signals by usage of minimum and maximum threshold values on the peak -amplitude, -area, -width and length. Next to this, it may also recognize split and shoulder -peaks by means of shape recognition, making correct identification of probe signals even more reliable. Following peak detection, quantification and size calling, our software allows one or more peaks to be linked to the original MLPA probe target sequence. This pattern matching is greatly simplified as compared to other genotyping programs and additionally provides a powerful technique for identifying and separating signal due to capillary electrophoresis artifacts. Our software may employ three different metrics to reflect the amount of probe fluorescence: peak height, peak area and peak area including its siblings. Peak siblings are the peak artifacts that are created during the amplification of the true MLPA products but have received an alternative length. To determine which metric should be used for data normalization, our program uses an algorithm that compares the signal level of each metric

those observed in mosaic cases.

**3.1 Support wide range of file format** 

more advanced reports can be created.

**3.2 Optimized peak detection / quantification method for MLPA** 

**3. Key features** 

Fig. 1. MLPA fragment profile of a patient sample with Duchenne disease (bottom) and that of a reference sample (top). Duchenne muscular dystrophy is the result of a defect in the DMD gene on chromosome Xp21. The fragment profile shows that the probe signals targeted to exon 45-50 of the DMD gene have a 100% decrease as compared to the reference, which may be the result of a homozygous deletion.

It may however not be feasible to obtain reliable results out of such a visual comparison if:


To make (complex) MLPA data easier understandable unknown and reference samples have to be brought on a common scale. This can be done by normalization, the division of multiple sets of data by a common variable in order to cancel out that variable's effect on the data. In MLPA kits, so called reference probes are usually added, which may be used in multiple ways in order to comprise a common variable. Reference probes are usually are targeted to chromosomal regions that are assumed to remain normal (diploid) in DNA of applicable samples. The results of data normalization are probe ratios, which display the balance of the measured signal intensities between sample and reference. In most MLPA studies, comparing the calculated MLPA probe ratios to a set of arbitrary borders is used to recognize gains and losses (González, 2008). Probe ratios of below 0.7 or above 1.3 are for instance regarded as indicative of a heterozygous deletion (copy number change from two to one) or duplication (copy number change from two to three), respectively. A delta value of 0.3 is a commonly accepted empirically derived threshold value for genetic dosage quotient analysis (Bunyan et al. 2004). To get more conclusive results probes may be arranged according to chromosomal location as this may reveal more subtle changes such as those observed in mosaic cases.

## **3. Key features**

128 Modern Approaches To Quality Control

Fig. 1. MLPA fragment profile of a patient sample with Duchenne disease (bottom) and that of a reference sample (top). Duchenne muscular dystrophy is the result of a defect in the DMD gene on chromosome Xp21. The fragment profile shows that the probe signals

targeted to exon 45-50 of the DMD gene have a 100% decrease as compared to the reference,

It may however not be feasible to obtain reliable results out of such a visual comparison if:

2. The MLPA kit contains probes targeted to a number of different genes or different

4. The DNA was isolated tumor tissue, which often shows DNA profiles with altered

To make (complex) MLPA data easier understandable unknown and reference samples have to be brought on a common scale. This can be done by normalization, the division of multiple sets of data by a common variable in order to cancel out that variable's effect on the data. In MLPA kits, so called reference probes are usually added, which may be used in multiple ways in order to comprise a common variable. Reference probes are usually are targeted to chromosomal regions that are assumed to remain normal (diploid) in DNA of applicable samples. The results of data normalization are probe ratios, which display the balance of the measured signal intensities between sample and reference. In most MLPA studies, comparing the calculated MLPA probe ratios to a set of arbitrary borders is used to

which may be the result of a homozygous deletion.

reference probes

1. The DNA quality of the samples and references is incomparable.

chromosomal regions, resulting in complex fragment profiles 3. The data set is very large, making visual assessment very laborious.

## **3.1 Support wide range of file format**

Our software is compatible with binary data files produced by all major capillary electrophoresis systems including: ABIF files (\*.FSA, \*.AB1, \*.ABI) produced by Applied Biosystems devices, SCF and RSD files produced by MegaBACE™ systems (Amersham) and SCF and ESD files produced by CEQ systems (Beckman). We can also import fragment lists in text or comma separate format, produced by different fragment analysis software programs such as Genescan (Applied Biosystems), Genemapper (Applied Biosystems), CEQ Fragment analysis software (Beckman) and Genetools. Raw data files are however preferred since they allow more troubleshooting and quality check options as compared to size called fragment lists. Next to this, raw and analyzed data are then stored in a single database and more advanced reports can be created.

## **3.2 Optimized peak detection / quantification method for MLPA**

All applied algorithms in our software are specifically designed to suit MLPA or MLPA-like applications. We designed an algorithm for peak detection and quantification specifically for MLPA peak patterns. Most peak detection algorithms simply identify peaks based on amplitude, ignoring the additional information in the shape of the peaks. In our experience, 'true' peaks have characteristic shapes, and including fluorescence of artifacts may introduce ambiguity into the analysis and interpretation process. Our algorithm has the ability to differentiate most spurious peaks and artifacts from peaks that originate from a probe product. We differentiate a number of different peak artifacts, such as: shoulder peaks, printout spikes, dye artifacts, split peaks, pull-up peaks, stutter peaks and nontemplate additions. It is often difficult to identify the correct peaks due to appearance of nonspecific peaks in the vicinity of the main allele peak. Our algorithm is therefore optimized to discriminate the different artifacts from the probe signals by usage of minimum and maximum threshold values on the peak -amplitude, -area, -width and length. Next to this, it may also recognize split and shoulder -peaks by means of shape recognition, making correct identification of probe signals even more reliable. Following peak detection, quantification and size calling, our software allows one or more peaks to be linked to the original MLPA probe target sequence. This pattern matching is greatly simplified as compared to other genotyping programs and additionally provides a powerful technique for identifying and separating signal due to capillary electrophoresis artifacts. Our software may employ three different metrics to reflect the amount of probe fluorescence: peak height, peak area and peak area including its siblings. Peak siblings are the peak artifacts that are created during the amplification of the true MLPA products but have received an alternative length. To determine which metric should be used for data normalization, our program uses an algorithm that compares the signal level of each metric

Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 131

to evaluate the magnitude of each probe ratio in combination with it's significance in the population. The significance of each ratio can be estimated by the quality of the performed normalization, which can be assessed two factors: the robustness of the normalization factor

During the analysis our software estimates the reproducibility of each sample type in a performed experiment by calculating the standard deviation of each probe ratio in that sample type population. Since reference samples are assumed to be genetically equal, the effect of sample-to-sample variation on probe ratios of test probes is estimated by the reproducibility of these probes in the reference sample population. These calculations may be more accurate under circumstances where reference samples are randomly distributed across the performed experiment. Our program therefore provides an option to create a specific experimental setup following these criteria, thereby producing a worksheet for the wet analysis and a setup file for capillary electrophoresis devices. DNA sample names can be selected from the database and may be typed as a reference or test sample, positive control or negative control. This setup file replaces the need for filling in the sample names

in the capillary electrophoresis run software thereby minimizing data entry errors.

To evaluate the robustness of the normalization factor our algorithm calculates the discrepancies computed between the probe ratios of the reference probes within each sample. Our normalization makes use of each reference probe for normalization of each test probe; thereby producing as many dosage quotients (DQ) as there are references probes. The median of these DQ's will then be used as the definite ratio. The median of absolute deviations between the computed dosage quotients may reflects the introduced mathematical imprecision of the used normalization factor. Next, our software calculates the effect of both types of variation on each test sample probe ratio and determines a 95% confidence range. By comparing each sample's test probe ratio and its 95% confidence range to the available data of each sample type population in the experiment, we can conclude if the found results are significantly different from e.g. the reference sample population or equal to a positive sample population. The algorithm then completes the analysis by evaluating these results in combination with the familiar set of arbitrary borders used to recognize gains and losses. A probe signal in concluded to be aberrant to the reference samples; if a probe signal is significantly different as from that reference sample populations and if the extent of this change meets certain criteria. The results are finally translated into easy to understand bar charts (figure 2) and sample reports allowing users to make a

The database behind our software is designed in SQL and is based on a relational database management system (RDBMS). In short this means that data is stored in the form of tables and the relationship among the data is also stored in the form of tables. Our database setup contains a large number of subtraction levels, not only allowing users to efficiently store and review experimental sample data, but also allowing users to get integrative view on comprehensive data collections as well as supplying an integrated platform for comparative genomics and systems biology. While all data normalization occurs per experiment, experiments can be organized in projects, allowing advanced data-mining options enabling users to retrieve and review data in many different ways. Users can for instance review multiple MLPA sample runs from a single patient in a single report view. Results of multiple MLPA mixese may be clustered together, allowing users gain more confidence on

and the reproducibility of the sample reactions.

reliable and astute interpretation of the results.

**3.5 Advanced data mining options** 

over the reference probes in all samples, and compares this to the amount of noise over the same signals. The metric that has the largest level signal to noise is then used in the following normalization steps.

## **3.3 Performances and throughput**

After a user logs in, analysis of a complete experiment can be performed in two simple steps: the processing of raw data and the comparison of different samples. Depending on the analysis setup and type of computer, the complete analysis may be completed in less than a minute for 24 samples. Our software can also make use of extra cores running in a computer, multiplying the speed of the analysis almost by two for each core. Because of problems arising from poor sample preparations, presence of PCR artifacts, irregular stutter bands, and incomplete fragment separations, a typical MLPA project requires manual examination of almost all sample data. Our software was designed to eliminate this bottleneck by substantially minimizing the need to review data. By creating a series of quality scores to the different processes users can easily pinpoint the basis for the failed analysis. These scores include quality assessment related to: the sample DNA, MLPA reaction, capillary separation and normalization steps (figure 6). The quality of each step can fall roughly into three categories.


When the analysis is finished the results can be visualized in a range of different display and reporting options designed to meet the requirement of modern research and diagnostic facilities. Results effortlessly can be exported to all commonly used medical report formats such as: pdf, xls, txt, csv, jpg, gif, png etc.

#### **3.4 Reliable recognition of aberrant probes**

Results interpretation of clinically relevant tests can be one of the most difficult aspects of MLPA analysis and is a matter of professional judgment and expertise. In practice, most users only consider the magnitude of a sample test probe ratio, comparing the ratio against a threshold value. This criterion alone may often not provide the conclusive results required for diagnosing disease. MLPA probes all have their own characteristics and the level of increase or decrease that a probe ratio displays that was targeted to a region that contains a heterozygous gain or loss, may differ for each probe. Interpretation of normalized data may even be more complicated due to shifts in ratios caused by sample-to-sample variation such as: dissimilarities in PCR efficiency and size to signal sloping. Other reasons for fluctuations in probe ratios may be: poor amplification, misinterpretation of an artifact peak/band as a true probe signal, incorrect interpretation of stutter patterns or artifact peaks, contamination, mislabeling or data entry errors (Bonin et al., 2004). To make result interpretation more reliable our software combines effect-size statistics and statistical interference allowing users

over the reference probes in all samples, and compares this to the amount of noise over the same signals. The metric that has the largest level signal to noise is then used in the

After a user logs in, analysis of a complete experiment can be performed in two simple steps: the processing of raw data and the comparison of different samples. Depending on the analysis setup and type of computer, the complete analysis may be completed in less than a minute for 24 samples. Our software can also make use of extra cores running in a computer, multiplying the speed of the analysis almost by two for each core. Because of problems arising from poor sample preparations, presence of PCR artifacts, irregular stutter bands, and incomplete fragment separations, a typical MLPA project requires manual examination of almost all sample data. Our software was designed to eliminate this bottleneck by substantially minimizing the need to review data. By creating a series of quality scores to the different processes users can easily pinpoint the basis for the failed analysis. These scores include quality assessment related to: the sample DNA, MLPA reaction, capillary separation and normalization steps (figure 6). The quality of each step can

1. High-quality or green. The results of these analysis steps can be accepted without

2. Low-quality or red. These steps represent samples with contamination and other failures, which render the resulted data unsuitable to continue with. This data can quickly be rejected without reviewing; recommendations can be reviewed in

3. Intermediate-quality or yellow. The results of these steps fall between high- and lowquality. The related data and additional recommendations can be reviewed in

When the analysis is finished the results can be visualized in a range of different display and reporting options designed to meet the requirement of modern research and diagnostic facilities. Results effortlessly can be exported to all commonly used medical report formats

Results interpretation of clinically relevant tests can be one of the most difficult aspects of MLPA analysis and is a matter of professional judgment and expertise. In practice, most users only consider the magnitude of a sample test probe ratio, comparing the ratio against a threshold value. This criterion alone may often not provide the conclusive results required for diagnosing disease. MLPA probes all have their own characteristics and the level of increase or decrease that a probe ratio displays that was targeted to a region that contains a heterozygous gain or loss, may differ for each probe. Interpretation of normalized data may even be more complicated due to shifts in ratios caused by sample-to-sample variation such as: dissimilarities in PCR efficiency and size to signal sloping. Other reasons for fluctuations in probe ratios may be: poor amplification, misinterpretation of an artifact peak/band as a true probe signal, incorrect interpretation of stutter patterns or artifact peaks, contamination, mislabeling or data entry errors (Bonin et al., 2004). To make result interpretation more reliable our software combines effect-size statistics and statistical interference allowing users

following normalization steps.

**3.3 Performances and throughput** 

fall roughly into three categories.

Coffalyser.NET and used for troubleshooting.

such as: pdf, xls, txt, csv, jpg, gif, png etc.

**3.4 Reliable recognition of aberrant probes** 

Coffalyser.NET and used to optimize the obtained results.

reviewing.

to evaluate the magnitude of each probe ratio in combination with it's significance in the population. The significance of each ratio can be estimated by the quality of the performed normalization, which can be assessed two factors: the robustness of the normalization factor and the reproducibility of the sample reactions.

During the analysis our software estimates the reproducibility of each sample type in a performed experiment by calculating the standard deviation of each probe ratio in that sample type population. Since reference samples are assumed to be genetically equal, the effect of sample-to-sample variation on probe ratios of test probes is estimated by the reproducibility of these probes in the reference sample population. These calculations may be more accurate under circumstances where reference samples are randomly distributed across the performed experiment. Our program therefore provides an option to create a specific experimental setup following these criteria, thereby producing a worksheet for the wet analysis and a setup file for capillary electrophoresis devices. DNA sample names can be selected from the database and may be typed as a reference or test sample, positive control or negative control. This setup file replaces the need for filling in the sample names in the capillary electrophoresis run software thereby minimizing data entry errors.

To evaluate the robustness of the normalization factor our algorithm calculates the discrepancies computed between the probe ratios of the reference probes within each sample. Our normalization makes use of each reference probe for normalization of each test probe; thereby producing as many dosage quotients (DQ) as there are references probes. The median of these DQ's will then be used as the definite ratio. The median of absolute deviations between the computed dosage quotients may reflects the introduced mathematical imprecision of the used normalization factor. Next, our software calculates the effect of both types of variation on each test sample probe ratio and determines a 95% confidence range. By comparing each sample's test probe ratio and its 95% confidence range to the available data of each sample type population in the experiment, we can conclude if the found results are significantly different from e.g. the reference sample population or equal to a positive sample population. The algorithm then completes the analysis by evaluating these results in combination with the familiar set of arbitrary borders used to recognize gains and losses. A probe signal in concluded to be aberrant to the reference samples; if a probe signal is significantly different as from that reference sample populations and if the extent of this change meets certain criteria. The results are finally translated into easy to understand bar charts (figure 2) and sample reports allowing users to make a reliable and astute interpretation of the results.

#### **3.5 Advanced data mining options**

The database behind our software is designed in SQL and is based on a relational database management system (RDBMS). In short this means that data is stored in the form of tables and the relationship among the data is also stored in the form of tables. Our database setup contains a large number of subtraction levels, not only allowing users to efficiently store and review experimental sample data, but also allowing users to get integrative view on comprehensive data collections as well as supplying an integrated platform for comparative genomics and systems biology. While all data normalization occurs per experiment, experiments can be organized in projects, allowing advanced data-mining options enabling users to retrieve and review data in many different ways. Users can for instance review multiple MLPA sample runs from a single patient in a single report view. Results of multiple MLPA mixese may be clustered together, allowing users gain more confidence on

Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 133

3. Non-parametric tests (distribution-free) used to compare two or ore independent

Our software uses a SQL client–server database model to store all project/experimentrelated data. The client-server model has one main application (server) that deals with one or several slave applications (clients). Clients may communicate to a server over the network, allowing data sharing within and even beyond their institutions. Even though this system may provide great convenience e.g. for people who are working on a single project but are working on different locations, both client and server may also reside in the same system. Having both client and server on the same system has some advances over running both separately: the database is better protected and both client and server will always have the same version number. In case an older client will try to connect to a server that has a newer version number, the client needs to be updated first. A client does not share any of its resources, but requests a server's content or service function. Clients therefore initiate communication sessions with servers that await incoming requests. When a new client is installed on a computer it will implement a discovery protocol in order to search for a server by means of broadcasting. The server application will then answer with its dynamic address

In addition to serving as a common data archive, the database provides user authentication, robust and scalable data management, and flexible archive capabilities via the utilities provided within Software. Our database model acts in accordance with a simple legal system, linking users to one or multiple organizations. Each user receives a certain role within each organization to which certain right are linked. These rights may for instance include denial of access to certain data but may also be used to deny access to certain parts of the program. These same levels may also be applied on project level. Projects will have project administrators and project members. The initial project creators will also be the

As soon as a user makes a connection with the server a session will be started with a unique identifier. Subsequent made changes by any user will be held to this identifier, in order to keep track of the made changes. This number is also used to secure experiment data when in use; this ensures no two users try to edit essential data simultaneously (data concurrency). When a user logs in on a certain system, all previously open session of that user will be closed. Every user can thus only be active on a single system. On closing a session, either by

In our software is equipped with MLPA sheet manager software, allowing users to obtain information about commercial MLPA kits and size markers directly from the MRC-Holland database. Next to this, the sheet manager also allows users to create custom MLPA mixes.

project administrators who are responsible for user management of that project.

4. Classification methods that can be used for predicting medical diagnosis.

groups of data.

**4. About the database** 

**4.2 User access** 

**4.3 Sessions** 

**4.1 Client server database model** 

that resolves any issues with dynamic IP addresses.

logout or by double login all old user locks will disappear.

**4.4 Data retrieval and updates** 

any found results. The database can further handle an almost unlimited number of specimens for each patient, and each specimen can additionally handle an almost unlimited number of MLPA sample runs. To each specimen additional information can be related such as sample type, tissue type, DNA extraction method, and other clinical relevant data, which can be used for a wide range of data mining operations for discovery purposes. Some of these operations include:


Fig. 2. Ratio chart of the results of a tumor sample analyzed with the P335 MLPA kit. Red dots display the probe ratios and the error bars the 95% confidence ranges. The orange box plots in the background show the 95% confidence range of the used reference samples. Map view locations are displayed on the x-axis and ratio results on the Y-axis. The red and green lines at ratio 0.7 and 1.3 indicate the arbitrary borders for loss and gain respectively. The displayed sample contains several aberrations and extra caution with interpretation is needed due to normal cell contamination.


## **4. About the database**

132 Modern Approaches To Quality Control

any found results. The database can further handle an almost unlimited number of specimens for each patient, and each specimen can additionally handle an almost unlimited number of MLPA sample runs. To each specimen additional information can be related such as sample type, tissue type, DNA extraction method, and other clinical relevant data, which can be used for a wide range of data mining operations for discovery purposes. Some of

2. Evidence based medicine, where the information extracted from the medical literature and the corresponding medical decisions are key information to leverage the decision

Fig. 2. Ratio chart of the results of a tumor sample analyzed with the P335 MLPA kit. Red dots display the probe ratios and the error bars the 95% confidence ranges. The orange box plots in the background show the 95% confidence range of the used reference samples. Map view locations are displayed on the x-axis and ratio results on the Y-axis. The red and green lines at ratio 0.7 and 1.3 indicate the arbitrary borders for loss and gain respectively. The displayed sample contains several aberrations and extra caution with interpretation is

1. Segmenting patients accurately into groups with similar health patterns.

these operations include:

made by the professional.

needed due to normal cell contamination.

## **4.1 Client server database model**

Our software uses a SQL client–server database model to store all project/experimentrelated data. The client-server model has one main application (server) that deals with one or several slave applications (clients). Clients may communicate to a server over the network, allowing data sharing within and even beyond their institutions. Even though this system may provide great convenience e.g. for people who are working on a single project but are working on different locations, both client and server may also reside in the same system. Having both client and server on the same system has some advances over running both separately: the database is better protected and both client and server will always have the same version number. In case an older client will try to connect to a server that has a newer version number, the client needs to be updated first. A client does not share any of its resources, but requests a server's content or service function. Clients therefore initiate communication sessions with servers that await incoming requests. When a new client is installed on a computer it will implement a discovery protocol in order to search for a server by means of broadcasting. The server application will then answer with its dynamic address that resolves any issues with dynamic IP addresses.

#### **4.2 User access**

In addition to serving as a common data archive, the database provides user authentication, robust and scalable data management, and flexible archive capabilities via the utilities provided within Software. Our database model acts in accordance with a simple legal system, linking users to one or multiple organizations. Each user receives a certain role within each organization to which certain right are linked. These rights may for instance include denial of access to certain data but may also be used to deny access to certain parts of the program. These same levels may also be applied on project level. Projects will have project administrators and project members. The initial project creators will also be the project administrators who are responsible for user management of that project.

#### **4.3 Sessions**

As soon as a user makes a connection with the server a session will be started with a unique identifier. Subsequent made changes by any user will be held to this identifier, in order to keep track of the made changes. This number is also used to secure experiment data when in use; this ensures no two users try to edit essential data simultaneously (data concurrency). When a user logs in on a certain system, all previously open session of that user will be closed. Every user can thus only be active on a single system. On closing a session, either by logout or by double login all old user locks will disappear.

## **4.4 Data retrieval and updates**

In our software is equipped with MLPA sheet manager software, allowing users to obtain information about commercial MLPA kits and size markers directly from the MRC-Holland database. Next to this, the sheet manager also allows users to create custom MLPA mixes.

Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 135

Fig. 3. Schematic overview of the Coffalyser.NET software workflow.

data but one should use these options with caution.

colors.

manipulate the data.

files are made up as a sequence of bytes, which our program decodes back into lists of the different measurements. The most important measurement being the laser induced fluorescence of the covalently bound fluorescent tags on the probe products and the size marker. The frequency at which these measurements occur depends on the type of system. A complete scan will always check all filters (or channels) and result in one data point. Almost all capillary systems are able to detect multicolor dyes permitting the usage of an internal size marker providing a more accurate size call than the usage of external size marker. Multicolor dyes may also permit the analysis of loci with overlapping size ranges, thus allowing multiple MLPA mixes to be run simultaneously in different dye

After data has been imported into your SQL Server database, users can start the analysis. Users can choose to analyze the currently imported data or data that was imported in the past or a combination of both. Due to the relative nature of all MLPA data, it is recommended to analyze data within the confinements of each experiment. There do exist circumstances in which better results may be obtained by applying older collected reference

Exporting data is usually a less frequent occurrence. Coffalyser.NET therefore does not have standard tools to export raw capillary data but rather depends on the provided tools and features of the SQL server. The data may be exported to a text file and then be read by third party applications such as Access or Microsoft Excel, which can then be used to view or

The sheet manager software can be used to check if updates to any of the MLPA mixes are available. The sheet manager can further carry out automatic checks for updates at the frequency you choose, or it can be used to make manual checks whenever you wish. It can display scheduled update checks and can work completely in the background if you choose. With just one click, you can check to see if there are new versions of the program, or updated MLPA mix sheets. If updates are available, you can download them quickly and easily. In case some MLPA mixes are already in use, users may choose to hold on to both the older version and updated versions of the mix or replace the older version.

## **5. Coffalyser.NET workflow**

Figure 3 shows the graphical representation of the workflow of our software. After creating an empty solution, users can add new or existing items to the empty solution by using the "add new project" or "add new experiment" command from the client software context menu. By creating projects, users can collect data of different experiments in one collection. Next, data files can then be imported to the database and linked to an experiment. Users then need to define for each used channel or dye stream of each capillary (sample run) what the contents are. Each detectable dye channel can be set as a sample (MLPA kit) or a size marker. Samples may further be typed as: MLPA test sample, MLPA reference sample, MLPA positive control, or MLPA digested sample. The complete analysis of each MLPA experiment can be divided in 2 steps: raw data analysis and comparative analysis. Raw data analysis includes all independent sample processes such as: the recognition and signal determination of peaks in the raw data streams of imported data files, the determination of the sizes of these peaks in nucleotides and the process of linking these peaks to their original probe target sequences. After raw data analysis is finished, users can evaluate a number of quality scores (figure 6), allowing users to easily assess the quality of the produced fragment data for each sample. Users may now reject, accept and adjust sample types before starting the comparative analysis. During the comparative part of the analysis several normalization and regression analysis methods are applied in order to isolate and correct the amount of variation that was introduced over the repeated measured data. Found variation that could not be normalized out of the equation is measured and used to define confidence ranges. The software finally calculates the variation of the probes over samples of the same types, allowing subsequent by classification of unknown samples. After the comparative analysis is finished, users may again evaluate a number of quality scores this time concerning the quality of different properties related to the normalization. The users can finally evaluate the results by means of reporting and visualization methods.

## **5.1 Import / export of capillary data**

Importing data is the process of retrieving data from files to the SQL Server™ (for example, an ABIF file) and inserting it into SQL Server tables. Importing data from an external data source is likely to be the first step you perform after setting up your database. Our software contains several algorithms to decode binary files from the most commonly used capillary electrophoresis devices (see paragraph 2.1). Capillary devices usually store measurements of relative fluorescent units (RFU) and other related data that is collected during fragment separation in computer files encoded in binary form. Binary

The sheet manager software can be used to check if updates to any of the MLPA mixes are available. The sheet manager can further carry out automatic checks for updates at the frequency you choose, or it can be used to make manual checks whenever you wish. It can display scheduled update checks and can work completely in the background if you choose. With just one click, you can check to see if there are new versions of the program, or updated MLPA mix sheets. If updates are available, you can download them quickly and easily. In case some MLPA mixes are already in use, users may choose to hold on to both the

Figure 3 shows the graphical representation of the workflow of our software. After creating an empty solution, users can add new or existing items to the empty solution by using the "add new project" or "add new experiment" command from the client software context menu. By creating projects, users can collect data of different experiments in one collection. Next, data files can then be imported to the database and linked to an experiment. Users then need to define for each used channel or dye stream of each capillary (sample run) what the contents are. Each detectable dye channel can be set as a sample (MLPA kit) or a size marker. Samples may further be typed as: MLPA test sample, MLPA reference sample, MLPA positive control, or MLPA digested sample. The complete analysis of each MLPA experiment can be divided in 2 steps: raw data analysis and comparative analysis. Raw data analysis includes all independent sample processes such as: the recognition and signal determination of peaks in the raw data streams of imported data files, the determination of the sizes of these peaks in nucleotides and the process of linking these peaks to their original probe target sequences. After raw data analysis is finished, users can evaluate a number of quality scores (figure 6), allowing users to easily assess the quality of the produced fragment data for each sample. Users may now reject, accept and adjust sample types before starting the comparative analysis. During the comparative part of the analysis several normalization and regression analysis methods are applied in order to isolate and correct the amount of variation that was introduced over the repeated measured data. Found variation that could not be normalized out of the equation is measured and used to define confidence ranges. The software finally calculates the variation of the probes over samples of the same types, allowing subsequent by classification of unknown samples. After the comparative analysis is finished, users may again evaluate a number of quality scores this time concerning the quality of different properties related to the normalization. The users can finally evaluate the results

Importing data is the process of retrieving data from files to the SQL Server™ (for example, an ABIF file) and inserting it into SQL Server tables. Importing data from an external data source is likely to be the first step you perform after setting up your database. Our software contains several algorithms to decode binary files from the most commonly used capillary electrophoresis devices (see paragraph 2.1). Capillary devices usually store measurements of relative fluorescent units (RFU) and other related data that is collected during fragment separation in computer files encoded in binary form. Binary

older version and updated versions of the mix or replace the older version.

**5. Coffalyser.NET workflow** 

by means of reporting and visualization methods.

**5.1 Import / export of capillary data** 

Fig. 3. Schematic overview of the Coffalyser.NET software workflow.

files are made up as a sequence of bytes, which our program decodes back into lists of the different measurements. The most important measurement being the laser induced fluorescence of the covalently bound fluorescent tags on the probe products and the size marker. The frequency at which these measurements occur depends on the type of system. A complete scan will always check all filters (or channels) and result in one data point. Almost all capillary systems are able to detect multicolor dyes permitting the usage of an internal size marker providing a more accurate size call than the usage of external size marker. Multicolor dyes may also permit the analysis of loci with overlapping size ranges, thus allowing multiple MLPA mixes to be run simultaneously in different dye colors.

After data has been imported into your SQL Server database, users can start the analysis. Users can choose to analyze the currently imported data or data that was imported in the past or a combination of both. Due to the relative nature of all MLPA data, it is recommended to analyze data within the confinements of each experiment. There do exist circumstances in which better results may be obtained by applying older collected reference data but one should use these options with caution.

Exporting data is usually a less frequent occurrence. Coffalyser.NET therefore does not have standard tools to export raw capillary data but rather depends on the provided tools and features of the SQL server. The data may be exported to a text file and then be read by third party applications such as Access or Microsoft Excel, which can then be used to view or manipulate the data.

Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 137

The application of this criterion can consists of 3-4 steps:

signal increases above zero fluorescence.

peak is recognized as a true peak.

threshold.

3. Model-based criterion:

zero.

5. Peak width filter:

6. Peak pattern recognition:

**5.2.3 Peak size calling** 

of 2 different size-calling algorithms: 1. Local least squares method 2. 1st, 2nd or 3rd order least squares

minimal and maximal threshold values.

current signal. 4. Median signal peak filter:

amount of fluorescence times one hundred. The peak area ratio percentage of a peak must be larger than the minimum threshold and lower than the maximum set

Locate the start point for each peak: a candidate peak is recognized as soon as the

 Check if the candidate peak meets minimal requirements: the peak signal intensity is first expected to increase, if the top of the peak is reached and the candidate peak meets the set thresholds for peak intensity and peak area ratio percentage, then the

 Discarding peak candidates: if the median signal of the previous 20 data points is smaller then the current peak intensity or if the current peak intensity returns to

 Detect the peak end: the signal is usually expected to drop back to zero designating the peak end. In some cases the signal does not return to zero, a peak end will therefore also be designated if the signal drops at least below half the intensity of the peak top and if the median signal of the 14 last data points is lower than the

The median peak signal is calculated by the percentage of intensity of each peak as opposed to the median peak signal intensity of all detected peaks. Since the minimum and maximum thresholds are dependent on detected peaks, this filter will be applied

After peak end points have been identified, the peak width is computed as the difference of right end point and left end point. The peak width should be within a

This method is only applied for the size marker channel, and involves the calculation of the correlation between the data point of the peak top of the detected peak list (based on the criteria point 1-5) and the expected lengths of the set size marker. In case the correlation is less than 0.999, the previous thresholds will be automatically adapted and peak detected will be restarted. These adaptations mainly include adjustment of

Size calling is a method that compares the detected peaks of a MLPA sample channels against a selected size standard. Lengths of unknown (probe) peaks can then be predicted using a regression curve between the data points and the expected fragment lengths of the used size standard, resulting in a fragment profile (figure 4). Coffalyser.NET allows the use

The local least squares method is the default size calling method for our software. It determines the sizes of fragments (nucleotides) by using the local linear relationship

given range. This filter is also applied after an initial peak detection procedure.

after an initial peak detection procedure based on the criteria point 1-3.

#### **5.2 Raw data analysis 5.2.1 Baseline correction**

When performing detection of fluorescence in capillary electrophoresis devices it is some times the case that spectra can be contaminated by fluorescence. Baseline curvature and offset are generally caused by the sample itself and little can be designed in an instrument to avoid these interferences (Nancy T. Kawai, 2000). Non-specific fluorescence or background auto fluorescence should be subtracted from the fluorescence obtained from the probe products to obtain the relative fluorescence as a result of the incorporation of the fluorophore. The baseline wander of the fluorescence signals may cause problems in the detection of peaks and should be removed before starting peak detection. Our software corrects for this baseline by applying two times a median signal filter on the raw signals. First, the signals of the first 200 data points of each dye channel are extracted and its median was calculated. Then for every 200 subsequent data points till the end of the data stream, the same procedure is carried out. These median values are then subtracted from the signal of the original data stream to remove the baseline wander, resulting in baseline 1. This corrected baseline 1 is then fed as input for a filter that calculates the median signal over every 50 subsequent data points. These median values are then subtracted from all the signals that are below 300 RFU (for ABI-devices) on baseline 1, resulting in baseline 2. This second baseline is often necessary due to the relatively short distance between the peaks that derive from probe products with only a few nucleotides difference. By applying this second baseline correction solely on the signals that are in the lower range of detection, even peaks that reside close to each other may reside back to zero-signal, without subtracting too much fluorescence that originates from the probe products. Program administrators can modulate the default baseline correction settings, and also may store different defaults for each used capillary system.

### **5.2.2 Peak detection**

In capillary-based MLPA data analysis, peak detection is an essential step for subsequent analysis. Even though various peak detection algorithms for capillary electrophoresis data exist, most of them are designed for detection of peaks in sequencing profiles. While peak detection and peak size calling are very important processes for sequencing applications, peak quantification is not so important. Due to the relatively nature of the MLPA data, peak quantification is particularly important and has a large influence on the final results. Our peak detection algorithm exists of two separate steps; the first step exists of peak detection by comparison of the intensities of fluorescent units to set arbitrary thresholds and shape recognition, the second step exist of filtering of the generated peak list by relative comparison. Program administrators can modulate the peak detection algorithm thresholds, which make use of the following criteria:


Peak area is computed as the area under the curve within the distance of a peak candidate. Peak area ratio percentage is computed as the peak area divided by the total amount of fluorescence times one hundred. The peak area ratio percentage of a peak must be larger than the minimum threshold and lower than the maximum set threshold.

3. Model-based criterion:

136 Modern Approaches To Quality Control

When performing detection of fluorescence in capillary electrophoresis devices it is some times the case that spectra can be contaminated by fluorescence. Baseline curvature and offset are generally caused by the sample itself and little can be designed in an instrument to avoid these interferences (Nancy T. Kawai, 2000). Non-specific fluorescence or background auto fluorescence should be subtracted from the fluorescence obtained from the probe products to obtain the relative fluorescence as a result of the incorporation of the fluorophore. The baseline wander of the fluorescence signals may cause problems in the detection of peaks and should be removed before starting peak detection. Our software corrects for this baseline by applying two times a median signal filter on the raw signals. First, the signals of the first 200 data points of each dye channel are extracted and its median was calculated. Then for every 200 subsequent data points till the end of the data stream, the same procedure is carried out. These median values are then subtracted from the signal of the original data stream to remove the baseline wander, resulting in baseline 1. This corrected baseline 1 is then fed as input for a filter that calculates the median signal over every 50 subsequent data points. These median values are then subtracted from all the signals that are below 300 RFU (for ABI-devices) on baseline 1, resulting in baseline 2. This second baseline is often necessary due to the relatively short distance between the peaks that derive from probe products with only a few nucleotides difference. By applying this second baseline correction solely on the signals that are in the lower range of detection, even peaks that reside close to each other may reside back to zero-signal, without subtracting too much fluorescence that originates from the probe products. Program administrators can modulate the default baseline correction settings, and also may store different defaults for each used

In capillary-based MLPA data analysis, peak detection is an essential step for subsequent analysis. Even though various peak detection algorithms for capillary electrophoresis data exist, most of them are designed for detection of peaks in sequencing profiles. While peak detection and peak size calling are very important processes for sequencing applications, peak quantification is not so important. Due to the relatively nature of the MLPA data, peak quantification is particularly important and has a large influence on the final results. Our peak detection algorithm exists of two separate steps; the first step exists of peak detection by comparison of the intensities of fluorescent units to set arbitrary thresholds and shape recognition, the second step exist of filtering of the generated peak list by relative comparison. Program administrators can modulate the peak detection algorithm thresholds,

This threshold is used to filter out small peaks in flat regions. The minimal and maximal peak amplitudes are arbitrary units and default values are provided for each different

Peak area is computed as the area under the curve within the distance of a peak candidate. Peak area ratio percentage is computed as the peak area divided by the total

**5.2 Raw data analysis 5.2.1 Baseline correction** 

capillary system.

**5.2.2 Peak detection** 

which make use of the following criteria: 1. Detection/Intensity threshold:

capillary system. 2. Peak area ratio percentage: The application of this criterion can consists of 3-4 steps:


The median peak signal is calculated by the percentage of intensity of each peak as opposed to the median peak signal intensity of all detected peaks. Since the minimum and maximum thresholds are dependent on detected peaks, this filter will be applied after an initial peak detection procedure based on the criteria point 1-3.

5. Peak width filter:

After peak end points have been identified, the peak width is computed as the difference of right end point and left end point. The peak width should be within a given range. This filter is also applied after an initial peak detection procedure.

6. Peak pattern recognition:

This method is only applied for the size marker channel, and involves the calculation of the correlation between the data point of the peak top of the detected peak list (based on the criteria point 1-5) and the expected lengths of the set size marker. In case the correlation is less than 0.999, the previous thresholds will be automatically adapted and peak detected will be restarted. These adaptations mainly include adjustment of minimal and maximal threshold values.

## **5.2.3 Peak size calling**

Size calling is a method that compares the detected peaks of a MLPA sample channels against a selected size standard. Lengths of unknown (probe) peaks can then be predicted using a regression curve between the data points and the expected fragment lengths of the used size standard, resulting in a fragment profile (figure 4). Coffalyser.NET allows the use of 2 different size-calling algorithms:


The local least squares method is the default size calling method for our software. It determines the sizes of fragments (nucleotides) by using the local linear relationship

Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 139

<sup>1</sup> <sup>333</sup> ; ; <sup>3</sup> <sup>111</sup>

and 6, we need to solve equation 5.

**5.2.4 Peak identification** 

5

*Y XY XY X* (4)

 

*X* (5)

(3)

35 35 35 35 46 46 46 46 57 57 57 57

111

*XY X Y XY X Y XY X Y*

333

*XX XX XX*

 

55 66 <sup>1</sup> \*; \* <sup>2</sup> 

Once all peaks have been size called, the profiles must be aligned to compare the fluorescence of the different targets across samples, an operation that is perhaps the single most difficult task in raw data analysis. Peaks corresponding to similar lengths of nucleotides may still be reported with slight differences or drifts due to secondary structures or bound dye compounds. These shifts in length make a direct numerical alignment based on the original probe lengths all but impossible. Our software uses an algorithm that automatically considers what the same peaks are between different samples, allowing easy peak to probe linkage. This procedure follows a window-based peak binning approach, whereby all peaks within a given window across different samples are considered to be the same peak (figure 5). Our software algorithm follows four steps: reference profile analysis, applying and prediction of new probe

lengths, reiteration of profile analysis and data filtering of all samples.

Fig. 5. Visualization of the collection of bins for a MLPA mix (x-axis) and the signal intensities in relative fluorescent units for detected peaks of a sample (y-axis).

 

<sup>5</sup> 3 5 4 6 5 7 <sup>1</sup> \*; \*; \* <sup>3</sup>

To calculate the length of an unknown fragment our algorithm uses the calculated coefficient and intercepts calculated over the surrounded size marker peaks above and one below its peak. Each unknown point will be predicted twice where after the average value will be stored for that peak. If we wish to predict the length (Y) of an unknown fragment (X) of which the data point of the peak top is in between the data points of known fragments 5

2 22 22 2 35 35 46 46 57 57

between fragment length and mobility (data points). Local linearity is a property of functions that have graph that appear smooth, but they need not to be smooth in a mathematical sense. The local linear least squares method makes use of a function that is only once differentiable at a point where it is locally linear. Different from the other methods, this function is not differentiable, because the slope of the tangent line is undefined. To solve the local linear function our algorithm first calculates the intercept and coefficient for each size marker point of the curve by use of a moving predictor. A local linear size of 3 points provides three predictions for each point along its curve that is surrounded by at least 2 points. The average intercept and coefficient are then stored for that point. Points at the beginning and the end of the curve will receive a single prediction, since they do not have any surrounding known values. The coefficient () and intercept () are calculated by solving the following equations 1 and 2.

Fig. 4. MLPA fragment length profile displaying the lengths of all detected peaks from a sample. Peak lengths were determined by comparison of the data against a GS500-ABI size marker and determination of the length using the local least squares method.

$$\beta = \left(\frac{\sum X\_i Y\_i - \frac{1}{n} \sum X\_i \sum Y\_i}{\sum X\_i^2 - \frac{1}{n} (\sum X\_i)^2}\right) \tag{1}$$

$$\alpha \ll \left(\overline{Y} - \left(\beta \ll \overline{X}\right)\right) \tag{2}$$

E.g. if we use a size marker that has 15 known points and a local linear size of 3 points, the coefficient and intercept of point 5 will be calculated by equation 3 and 4.

$$\beta\_5 = \frac{1}{3} \sum \frac{\sum \mathbf{X}\_{3-5} \mathbf{Y}\_{3-5} - \frac{1}{3} \sum \mathbf{X}\_{3-5} \sum \mathbf{Y}\_{3-5}}{\sum \mathbf{X}\_{3-5}^2 - \frac{1}{3} (\sum \mathbf{X}\_{3-5})^2}; \frac{\sum \mathbf{X}\_{4-6} \mathbf{Y}\_{4-6}}{\sum \mathbf{X}\_{4-6}^2 - \frac{1}{3} (\sum \mathbf{X}\_{4-6})^2}; \frac{\sum \mathbf{X}\_{5-7} \mathbf{Y}\_{5-7} - \frac{1}{3} \sum \mathbf{X}\_{5-7} \sum \mathbf{Y}\_{5-7}}{\sum \mathbf{X}\_{5-7}^2 - \frac{1}{3} (\sum \mathbf{X}\_{5-7})^2} \tag{3}$$

$$a\_5 = \frac{1}{3} \sum (\overline{Y} - \left(\beta\_{3-5} \ast \overline{X}\right)); \big(\overline{Y} - \left(\beta\_{4-6} \ast \overline{X}\right)\big); \big(\overline{Y} - \left(\beta\_{5-7} \ast \overline{X}\right)\big) \tag{4}$$

To calculate the length of an unknown fragment our algorithm uses the calculated coefficient and intercepts calculated over the surrounded size marker peaks above and one below its peak. Each unknown point will be predicted twice where after the average value will be stored for that peak. If we wish to predict the length (Y) of an unknown fragment (X) of which the data point of the peak top is in between the data points of known fragments 5 and 6, we need to solve equation 5.

$$\mathbf{Y} = \frac{1}{2}\sum \alpha\_5 + \beta\_5 \, ^\ast \mathbf{X} \, \boldsymbol{\mu}\_6 + \beta\_6 \, ^\ast \mathbf{X} \tag{5}$$

#### **5.2.4 Peak identification**

138 Modern Approaches To Quality Control

between fragment length and mobility (data points). Local linearity is a property of functions that have graph that appear smooth, but they need not to be smooth in a mathematical sense. The local linear least squares method makes use of a function that is only once differentiable at a point where it is locally linear. Different from the other methods, this function is not differentiable, because the slope of the tangent line is undefined. To solve the local linear function our algorithm first calculates the intercept and coefficient for each size marker point of the curve by use of a moving predictor. A local linear size of 3 points provides three predictions for each point along its curve that is surrounded by at least 2 points. The average intercept and coefficient are then stored for that point. Points at the beginning and the end of the curve will receive a single prediction, since they do not have any surrounding known values. The coefficient () and intercept () are

Fig. 4. MLPA fragment length profile displaying the lengths of all detected peaks from a sample. Peak lengths were determined by comparison of the data against a GS500-ABI size

2 2

*Y X* \* (2)

(1)

*i i*

*XY X Y n X X n*

1

1 *ii i i*

 

E.g. if we use a size marker that has 15 known points and a local linear size of 3 points, the

marker and determination of the length using the local least squares method.

coefficient and intercept of point 5 will be calculated by equation 3 and 4.

calculated by solving the following equations 1 and 2.

Once all peaks have been size called, the profiles must be aligned to compare the fluorescence of the different targets across samples, an operation that is perhaps the single most difficult task in raw data analysis. Peaks corresponding to similar lengths of nucleotides may still be reported with slight differences or drifts due to secondary structures or bound dye compounds. These shifts in length make a direct numerical alignment based on the original probe lengths all but impossible. Our software uses an algorithm that automatically considers what the same peaks are between different samples, allowing easy peak to probe linkage. This procedure follows a window-based peak binning approach, whereby all peaks within a given window across different samples are considered to be the same peak (figure 5). Our software algorithm follows four steps: reference profile analysis, applying and prediction of new probe lengths, reiteration of profile analysis and data filtering of all samples.

Fig. 5. Visualization of the collection of bins for a MLPA mix (x-axis) and the signal intensities in relative fluorescent units for detected peaks of a sample (y-axis).

Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 141

Fig. 6. Coffalyser.NET screenshot. FRSS means fragment run separation score. FMRS means fragment MLPA reaction score. Probes, displays the number of found signals to the number of expected signals. The last columns display the quality of the DNA concentration and

DNA itself as described before (Coffa, 2008). The quality scores allow users to easily find problems due to: the fragment separation process, MLPA reaction, DNA concentration or DNA denaturation. Users may then reject, accept and adjust sample types before starting

During the comparative part of the analysis we aim to isolate the amount of variation that was introduced over the repeated measured data and provide the user with meaningful data by means of reporting and visualization methods. The program is equipped with several normalization strategies in order to allow underlying characteristics of the different types of data sets to be compared. During normalization we bring MLPA data (probe peak signals) of unknown and reference samples to a common scale allowing easier understandable data to be generated. In MLPA, normalization refers to the division of multiple sets of data by a common variable or normalization constant in order to cancel out that variable's effect on the data. In MLPA kits, so called reference probes are usually added, which are targeted to chromosomal regions that are assumed to remain normal (diploid) in

Our algorithm is able to make use of the reference probes in multiple ways in order to comprise a common variable. In case a MLPA kit does not contain any reference probes, the common variable can be made out of probes selected by the user or the program will make

denaturation and the presence of the X and Y- fragments.

the comparative analysis.

**5.3 Comparative analysis** 

the DNA of all used samples.

The crucial task in data binning is to create a common probe length reference vector (or bin). In the first step our algorithm applies a bin set that searches for all peaks with a length closely resembling that of the design length of that probe. Next, the largest peak in each temporary bin is assumed to be the real peak descending from the related probe product. To create a stable bin, we calculate the average length over all real peaks of all used reference samples. If no reference samples exist, the median length over all collected real peak from all samples will be used. Since some probes may have a large difference between their original and detected length the previously created results may often not suffice. We therefore check if the length that we have related to each probe is applicable in our sample set. We do this by calculating how much variation exists over collected peaks length in each of the previous bins. If the variation was too large (standard deviation > 0.2) or no peak at all was found in any of the bins, the expected peak length for that probe will be estimated by prediction. The expected probe peak lengths may be predicted by using a second-order polynomial regression on using the available data of the probes for which reproducible data was found. Even though a full collection of bins is now available, the lengths of the probe products that were predicted may not be very accurate. The set of bins for each probe in the selected MLPA mix will therefore be improved by iteration of the previous steps. The lengths provided for the bins are now based on the previously detected or predicted probe product lengths allowing a more accurate detection of the real probe peaks. Probes that were not found are again predicted and a final length reference vector or bin is constructed for each probe. This final bin set can be used directly for data filtering but may also be edited manually in case the automatically created bin set may not suffice.

Data filtering is the actual process where the detected fragments of each sample are linked with gene information to a probe target or control fragment. Our algorithm assumes that peaks within each sample that fall within the same provided window or bin and have sufficient fluorescence intensity are the same probe (figure 4). Our algorithm is also able to link more than one peak to a probe within one sample. The amount of fluorescence of each probe product may then be expresses the peak height, peak area of the main peak and the summarized peak area of all peaks in a bin. An algorithm can then be used to compare these metrics and decide which should optimally be used as described at 3.2, alternatively users may set a default metric. The summarized peak area may reflect the amount of fluorescence best if peaks are observed that show multiple tops which all originate from the amplification of the same ligation product. Such peaks may be observed if:


#### **5.2.5 Raw data quality control**

In the final step of the raw data analysis the software performs several quality checks and translates this into simple scores (figure 6).

These quality checks are the result of a comparison of sample specific properties such as: baseline height, peak signal intensity, signal to size drop, incorporated percentage of primer etc., to expected standards specific for each capillary system. Several quality checks are furthermore performed using the control fragments providing information about the used


Fig. 6. Coffalyser.NET screenshot. FRSS means fragment run separation score. FMRS means fragment MLPA reaction score. Probes, displays the number of found signals to the number of expected signals. The last columns display the quality of the DNA concentration and denaturation and the presence of the X and Y- fragments.

DNA itself as described before (Coffa, 2008). The quality scores allow users to easily find problems due to: the fragment separation process, MLPA reaction, DNA concentration or DNA denaturation. Users may then reject, accept and adjust sample types before starting the comparative analysis.

#### **5.3 Comparative analysis**

140 Modern Approaches To Quality Control

The crucial task in data binning is to create a common probe length reference vector (or bin). In the first step our algorithm applies a bin set that searches for all peaks with a length closely resembling that of the design length of that probe. Next, the largest peak in each temporary bin is assumed to be the real peak descending from the related probe product. To create a stable bin, we calculate the average length over all real peaks of all used reference samples. If no reference samples exist, the median length over all collected real peak from all samples will be used. Since some probes may have a large difference between their original and detected length the previously created results may often not suffice. We therefore check if the length that we have related to each probe is applicable in our sample set. We do this by calculating how much variation exists over collected peaks length in each of the previous bins. If the variation was too large (standard deviation > 0.2) or no peak at all was found in any of the bins, the expected peak length for that probe will be estimated by prediction. The expected probe peak lengths may be predicted by using a second-order polynomial regression on using the available data of the probes for which reproducible data was found. Even though a full collection of bins is now available, the lengths of the probe products that were predicted may not be very accurate. The set of bins for each probe in the selected MLPA mix will therefore be improved by iteration of the previous steps. The lengths provided for the bins are now based on the previously detected or predicted probe product lengths allowing a more accurate detection of the real probe peaks. Probes that were not found are again predicted and a final length reference vector or bin is constructed for each probe. This final bin set can be used directly for data filtering but may also be edited

Data filtering is the actual process where the detected fragments of each sample are linked with gene information to a probe target or control fragment. Our algorithm assumes that peaks within each sample that fall within the same provided window or bin and have sufficient fluorescence intensity are the same probe (figure 4). Our algorithm is also able to link more than one peak to a probe within one sample. The amount of fluorescence of each probe product may then be expresses the peak height, peak area of the main peak and the summarized peak area of all peaks in a bin. An algorithm can then be used to compare these metrics and decide which should optimally be used as described at 3.2, alternatively users may set a default metric. The summarized peak area may reflect the amount of fluorescence best if peaks are observed that show multiple tops which all originate from the amplification

1. Too much input DNA is added the amplification reaction and the polymerase was

2. Peaks were discovered which are one base pair longer than the actual target due to non-

3. The polymerase was unable to complete the adenine addition on all products that resulted in the presence of shoulder peaks or +A/-A peaks (Applied Biosystems, 1988).

In the final step of the raw data analysis the software performs several quality checks and

These quality checks are the result of a comparison of sample specific properties such as: baseline height, peak signal intensity, signal to size drop, incorporated percentage of primer etc., to expected standards specific for each capillary system. Several quality checks are furthermore performed using the control fragments providing information about the used

unable to complete the extension for all amplicons (Clark, J. M. 1988).

manually in case the automatically created bin set may not suffice.

of the same ligation product. Such peaks may be observed if:

template addition.

**5.2.5 Raw data quality control** 

translates this into simple scores (figure 6).

During the comparative part of the analysis we aim to isolate the amount of variation that was introduced over the repeated measured data and provide the user with meaningful data by means of reporting and visualization methods. The program is equipped with several normalization strategies in order to allow underlying characteristics of the different types of data sets to be compared. During normalization we bring MLPA data (probe peak signals) of unknown and reference samples to a common scale allowing easier understandable data to be generated. In MLPA, normalization refers to the division of multiple sets of data by a common variable or normalization constant in order to cancel out that variable's effect on the data. In MLPA kits, so called reference probes are usually added, which are targeted to chromosomal regions that are assumed to remain normal (diploid) in the DNA of all used samples.

Our algorithm is able to make use of the reference probes in multiple ways in order to comprise a common variable. In case a MLPA kit does not contain any reference probes, the common variable can be made out of probes selected by the user or the program will make

Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 143

Fig. 7. MLPA fragment profile of a sample with a large drop in signal as size. This effect may

aberrant reference signals. The following 7 steps are performed in a single comparative

1. Normalization of all data in population mode. Each sample will be applied as a

2. Determination of significance of the found results by automatic evaluation using effectsize statistics and comparison of samples to the available sample type populations. 3. Measure of the relative amount of signal to size drop. If the relative drop is less than 12% a direct normalization will suffice, any larger drop will automatically be corrected

4. Before correction of the actual amount of signal to size drop, samples are corrected for the MLPA mix specific probe signal bias. This can be done by calculating the extent of this bias in each reference run by regressing the probe signals and probe lengths using a local median least squares method. Correction factors for these probe specific biases are then computed by dividing the actual probe signal through its predicted signal. The final probe-wise correction factor is then determined by taking a median of the calculated values over all reference runs. This correction factor is then applied to all runs to reduce the effect of probe bias due to particular probe properties on the

reference sample and each probe will be applied as a reference probe.

have a similar result on the dosage quotients if not corrected for.

by means of regression analysis (step 4-5).

forthcoming regression normalization.

analysis round:

an auto-selection. After normalization the relative amount fluorescence related to each probe can be expressed in dosage quotients, which is the usual method of interpreting MLPA data (Yau SC, 1996). This dosage quotient or ratio is a measure for the ratio in which the target sequence is present in the sample DNA as compared to the reference DNA, or relative ploidy. To make the normalization more robust our algorithm makes use of every MLPA probe signal, set as a reference probe for normalization to produce an independent ratio (*DQi, h, j, z*). The median of all produced ratios is then taken as the final probe ratio (*DQi, h j*). This allows for the presence of aberrant reference signals without profoundly changing the outcome. If we want to calculate the dosage quotient for test Probe J of unknown Sample I as compared to t reference Sample H, by making use of reference Probes Z (1-n), we need to solve the equation 6.

$$DQQ\_{i,h,j} = med \left( \frac{\left[\left.S\_i P\_j \;/\;S\_i P\_{z=1}\right]}{\left[\left.S\_h P\_j \;/\;S\_h P\_{z=1}\right]} \right] \frac{\left[\left.S\_i P\_j \;/\;S\_i P\_{z=2}\right]}{\left[\left.S\_h P\_j \;/\;S\_h P\_{z=2}\right]} \right] \cdots \frac{\left[\left.S\_i P\_j \;/\;S\_i P\_{z=n}\right]}{\left[\left.S\_h P\_j \;/\;S\_h P\_{z=n}\right]} \right] \right) \tag{6}$$

The data for each test probe of each sample (*DQi, h, j*) will be compared to each available reference sample (Sh=n), producing as many dosage quotients as there are reference samples. The final ratio (*DQi, j*) will then estimated by calculating the average over these dosage quotients. In case no reference samples are set, each sample will be used as reference and the median over the ratios be calculated.

#### **5.3.1 Dealing with sample to sample variation**

Each MLPA probe is multiplied during the amplification reaction with a probe specific efficiency that is mainly determined by the sequence of the probe, resulting in a probe specific bias. Even though the relative difference of these probes in signal intensity between different samples can be determined by normalization or visual assessment (figure 1), the calculated ratio results may not always be easy to understand by employing arbitrary thresholds only. This is mainly due to sample-to-sample variation or more specific, a difference in the amplification efficiency of probe targets between reference and sample targets. Chemical remnants from the DNA extraction procedure and other treatments sample tissue was subjected to, may allot to impurities that influence the *Taq* DNA polymerase fidelity. Alternatively target DNA sequences may have been modified by external factors, e.g. by aggressive chemical reactants and/or UV irradiation which may result in differences in amplification rate or extensive secondary structures of the template DNA that may prevent access to region of the target DNA by the polymerase enzyme (Elizatbeth van Pelt-Verkuil, 2008). An effect that is commonly seen with MLPA data is a drop of signal intensity that is proportional with the length of the MLPA product fragments (figure 7). This signal to size drop is caused by a decreasing efficiency of amplification of the larger MLPA probes and may be intensified by sample contaminants or evaporation during the hybridization reaction. Signal to size drop may further be influenced by injection bias of the capillary system and diffusion of the MLPA products within the capillaries.

In order to minimize the amount of variation in and between reference and sample data and create a robust normalization strategy our algorithm follows 7 steps. By automatic interpretation of results after each step our algorithm can adjust the parameters used for the next step thereby minimizing the amount of error that may be introduced by the use of

an auto-selection. After normalization the relative amount fluorescence related to each probe can be expressed in dosage quotients, which is the usual method of interpreting MLPA data (Yau SC, 1996). This dosage quotient or ratio is a measure for the ratio in which the target sequence is present in the sample DNA as compared to the reference DNA, or relative ploidy. To make the normalization more robust our algorithm makes use of every MLPA probe signal, set as a reference probe for normalization to produce an independent ratio (*DQi, h, j, z*). The median of all produced ratios is then taken as the final probe ratio (*DQi, h j*). This allows for the presence of aberrant reference signals without profoundly changing the outcome. If we want to calculate the dosage quotient for test Probe J of unknown Sample I as compared to t reference Sample H, by making use of reference Probes Z (1-n), we need

1 2

*SP SP SP SP SP SP*

*SP SP SP SP SP SP*

1 2 // / , ,... // / *i j i z i j i z i j i zn*

The data for each test probe of each sample (*DQi, h, j*) will be compared to each available reference sample (Sh=n), producing as many dosage quotients as there are reference samples. The final ratio (*DQi, j*) will then estimated by calculating the average over these dosage quotients. In case no reference samples are set, each sample will be used as reference

Each MLPA probe is multiplied during the amplification reaction with a probe specific efficiency that is mainly determined by the sequence of the probe, resulting in a probe specific bias. Even though the relative difference of these probes in signal intensity between different samples can be determined by normalization or visual assessment (figure 1), the calculated ratio results may not always be easy to understand by employing arbitrary thresholds only. This is mainly due to sample-to-sample variation or more specific, a difference in the amplification efficiency of probe targets between reference and sample targets. Chemical remnants from the DNA extraction procedure and other treatments sample tissue was subjected to, may allot to impurities that influence the *Taq* DNA polymerase fidelity. Alternatively target DNA sequences may have been modified by external factors, e.g. by aggressive chemical reactants and/or UV irradiation which may result in differences in amplification rate or extensive secondary structures of the template DNA that may prevent access to region of the target DNA by the polymerase enzyme (Elizatbeth van Pelt-Verkuil, 2008). An effect that is commonly seen with MLPA data is a drop of signal intensity that is proportional with the length of the MLPA product fragments (figure 7). This signal to size drop is caused by a decreasing efficiency of amplification of the larger MLPA probes and may be intensified by sample contaminants or evaporation during the hybridization reaction. Signal to size drop may further be influenced by injection bias of

the capillary system and diffusion of the MLPA products within the capillaries.

In order to minimize the amount of variation in and between reference and sample data and create a robust normalization strategy our algorithm follows 7 steps. By automatic interpretation of results after each step our algorithm can adjust the parameters used for the next step thereby minimizing the amount of error that may be introduced by the use of

*h j h z h j h z h j h zn*

 

(6)

to solve the equation 6.

, ,

*ihj*

*DQ med*

and the median over the ratios be calculated.

**5.3.1 Dealing with sample to sample variation** 

Fig. 7. MLPA fragment profile of a sample with a large drop in signal as size. This effect may have a similar result on the dosage quotients if not corrected for.

aberrant reference signals. The following 7 steps are performed in a single comparative analysis round:


Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 145

Discrepancies on estimated dosage quotient by the used reference probes and/or reference samples may lead to an increase of the width of this confidence range, indicating a poor normalization. Since 95% is commonly taken as a threshold indicating virtual certainty (ZAR, J.H., 1984), our algorithm on default uses 1.96 standard deviations (equation 10) to

95% <sup>2</sup> <sup>2</sup>

The previous sections explained how probe ratio are calculated and how our algorithm estimates the amount of introduced variation. In this section, we reflect on what those results mean for empirical comparison of users. To make data interpretation easier our program allows the use advanced visualization methods but also contains an algorithm allowing automatic data interpretation. Our algorithm compares the ratio and standard deviation of a test probe from a single sample to the behavior of that probe within a subcollection of samples. This allows the program for instance to recognize if a result from an unknown sample is significantly different from the results found in the reference sample population. Alternatively, it may find if a sample is equal to a sample population, for instance a group of positive control samples. To make an estimation of the behavior of a probe ratio within a sample population, we calculate the average value and standard deviation for each probe over samples with the same sample type. In order to calculate the confidence range of probe J in for instance the reference sample population, we need to solve equation 11. N in this case refers to all probe ratio results (*DQi, j*) from samples that

were defined in the normalization setup with the sample type: reference sample (h).

<sup>1</sup> / 1.96 \*

<sup>2</sup> 95%

*i j ij ij ref <sup>j</sup> i h DQ DQ DQ DQ <sup>N</sup>*

Probe result of each sample are then classified in three categories, by comparison to the confidence ranges of available sample types. A probe result of a sample is either significantly different to a sample population, equal to a sample population or the result is ambiguous. To define if a probe result of an unknown sample is significantly different (>>\*)

1. The difference in the magnitude of the probe ratio, as compared to the average of that probe calculated over samples with the same sample type, needs to exceed a delta value of 0.3. In case an unknown sample is compared to the reference sample population, the

2. The confidence range of the probe of the unknown sample (equation 10) cannot overlap with the confidence range of that probe in a sample population (equation 11). An unknown sample in classified to be equal (=) to the population of a certain sample type if: 1. The difference in the magnitude of the probe ratio, as compared to the average of that

2. The probe ratio of the unknown sample falls within the confidence range of that probe

probe calculated over samples with the same sample type, is less than 0.3.

, , ,

*N*

*DQ MAD*

/ 1.96 \* 1.4826 \* *ij ij*

, ,

(11)

(10)

calculate the confidence ranges for probe ratios.

,

*i j*

**5.3.2 Interpretation of the calculated dosage quotients** 

to sample population, our algorithm employs 2 criteria:

in a sample population (equation 11).

average ratio for each probe is always approaches 1.


Our algorithm then measures the amount variation that could not be resolved in the final normalization to aid in results interpretation and automatic sample classification. To measure the imprecision of the normalization constant, each time a sample is normalized against a reference, the median of absolute deviations (MAD *i, h, j*) is calculated between the final probe ratio (*DQi, h, j*) and the independent dosage quotients using each reference probe (*DQi, h, j, z*). The average of all collected MAD *i, j* values over the samples are then average to estimate the final amount of variation introduced by the imprecision of reference probes. Our algorithm estimates the final MAD *i, j* for each probe J in sample I and by equation 7.

$$MAD\_{i,j} = \frac{1}{N} \sum\_{z=1}^{N} med^m\_{\;\,z=1} \left( \left\| DQ\_{i,h,j,z} - DQ\_{i,h,j} \right\| \right) \tag{7}$$

Since the final probe ratio (*DQi, j*) for each probe in each sample is estimated by the average over the dosage quotients (*DQi, h, j*) that were calculated using each reference sample (equation 8), the amount variation that was introduced over the different samples is estimated by calculating the standard deviation over these probe ratios (equation 9).

$$\sigma\_{i,j} = \frac{1}{N} \sum\_{h=1}^{N} \left( D Q\_{i,h,j} - D Q\_{i,j} \right)^2 \tag{8}$$

$$DQ\_{i,j} = \frac{1}{N} \sum\_{h=1}^{N} \stackrel{M}{\text{Med}} \left( \frac{\left[ S\_i P\_j \;/\ S\_i P\_z \right]}{\left[ S\_h P\_j \;/\ S\_h P\_z \right]} \right) \tag{9}$$

Our algorithm then estimates the 95% confidence range of each probe ratio (*DQi, j*) of each sample by following 3 steps:


5. Next we calculate the amount of signal to size drop for every sample by using a function where the log-transformed probe bias corrected signals are regressed with the probe lengths using a 2nd order least squares method. Signals from aberrant targets are left out of this function, by applying an outlier detection method that makes use of the results found at step 2 as well as correlation measurements of the predicted line. The signal to size corrected values can then be obtained by calculating the distance of each

6. Normalization of signal to size corrected data in the user selected mode and

Our algorithm then measures the amount variation that could not be resolved in the final normalization to aid in results interpretation and automatic sample classification. To measure the imprecision of the normalization constant, each time a sample is normalized against a reference, the median of absolute deviations (MAD *i, h, j*) is calculated between the final probe ratio (*DQi, h, j*) and the independent dosage quotients using each reference probe (*DQi, h, j, z*). The average of all collected MAD *i, j* values over the samples are then average to estimate the final amount of variation introduced by the imprecision of reference probes. Our algorithm estimates the final MAD *i, j* for each probe J in sample I and by equation 7.

, 1 ,,, ,,

*<sup>N</sup> <sup>M</sup> ij iz*

/

*SP SP*

(7)

2

(8)

(9)

*i j z ih j z ih j*

Since the final probe ratio (*DQi, j*) for each probe in each sample is estimated by the average over the dosage quotients (*DQi, h, j*) that were calculated using each reference sample (equation 8), the amount variation that was introduced over the different samples is

> , ,, , 1

*i j ihj ij*

*DQ DQ <sup>N</sup>*

1 /

*i j <sup>z</sup> <sup>h</sup> hj hz*

Our algorithm then estimates the 95% confidence range of each probe ratio (*DQi, j*) of each

1. Conversion of the MAD values to standard deviations by multiplying with 1.4826

2. Calculation of a single standard deviation for each probe ratio by combining the calculated value of step 1 with the standard deviation calculated over the reference samples by equation 9. This can be done by first converting both standard variations to variations by converting the values to the power of two. Then we sum up the outcome

3. Defining the limits of the confidence range by adding and subtracting a number of

standard deviations of the final probe ratio (*DQi, j*) from equation 8.

*DQ Med <sup>N</sup> SP SP*

*MAD med DQ DQ <sup>N</sup>*

estimated by calculating the standard deviation over these probe ratios (equation 9).

1 *<sup>N</sup>*

sample by following 3 steps:

of both and take the square root.

Albert, J. (2007)

*h*

, <sup>1</sup> <sup>1</sup>

1 1 *<sup>N</sup> <sup>m</sup>*

*z*

log transformed pre-normalized signal to its predicted signal.

determination of significance of the found results.

Discrepancies on estimated dosage quotient by the used reference probes and/or reference samples may lead to an increase of the width of this confidence range, indicating a poor normalization. Since 95% is commonly taken as a threshold indicating virtual certainty (ZAR, J.H., 1984), our algorithm on default uses 1.96 standard deviations (equation 10) to calculate the confidence ranges for probe ratios.

$$\stackrel{\text{95\%}}{D}\_{i,j} = + / - 1.96 \, ^\star \left( \sqrt{\left( 1.4826 \, ^\star \left( \text{MAD}\_{i,j} \right) \right)^2 + \left( \sigma\_{i,j} \right)^2} \right) \tag{10}$$

#### **5.3.2 Interpretation of the calculated dosage quotients**

The previous sections explained how probe ratio are calculated and how our algorithm estimates the amount of introduced variation. In this section, we reflect on what those results mean for empirical comparison of users. To make data interpretation easier our program allows the use advanced visualization methods but also contains an algorithm allowing automatic data interpretation. Our algorithm compares the ratio and standard deviation of a test probe from a single sample to the behavior of that probe within a subcollection of samples. This allows the program for instance to recognize if a result from an unknown sample is significantly different from the results found in the reference sample population. Alternatively, it may find if a sample is equal to a sample population, for instance a group of positive control samples. To make an estimation of the behavior of a probe ratio within a sample population, we calculate the average value and standard deviation for each probe over samples with the same sample type. In order to calculate the confidence range of probe J in for instance the reference sample population, we need to solve equation 11. N in this case refers to all probe ratio results (*DQi, j*) from samples that were defined in the normalization setup with the sample type: reference sample (h).

$$\stackrel{\text{95\%}}{D\underset{\text{ref\\_j}}{\text{V}}} = \overline{D\underline{Q}\_{i,j}} + / - 1.96\* \left(\frac{1}{N} \sum\_{i=h}^{N} (D\underline{Q}\_{i,j} - \overline{D\underline{Q}\_{i,j}})^2\right) \tag{11}$$

Probe result of each sample are then classified in three categories, by comparison to the confidence ranges of available sample types. A probe result of a sample is either significantly different to a sample population, equal to a sample population or the result is ambiguous. To define if a probe result of an unknown sample is significantly different (>>\*) to sample population, our algorithm employs 2 criteria:


An unknown sample in classified to be equal (=) to the population of a certain sample type if:


Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 147

Automatic data interpretation cannot replace the specialist judgment of a researcher. Knowledge about the expected genetic defect of the target DNA and other sample information may be crucial. To assist the user with data interpretation, our software automatically sorts all probe results based on the last updated map view locations of the probes. Chromosomal aberrations often-span larger regions (M. Hermsen, 2002), which allow probes targeted to that region to cluster together by sorting. Our software can then make a single page PDF reports, containing a summary of all relevant data, probe ratios

(figure 8), statistics, quality controls and charts (figure 2 & 4) of a single sample.

Fig. 9. Screen shot of from a tumor sample analyzed with the P335 MLPA kit. Probe ratio results of targets estimated as significantly increased as opposed to the reference population

are marker green; those estimated as significantly decreased are marked red.

**5.3.3 Reporting and visualization** 

Probe results that are ambiguous, consequently only meet one of the two criteria in order to characterize the result to be different or equal. Ambiguous probe results that do show a difference in the magnitude of the probe ratio, as compared to the average of that probe calculated over samples with the same sample type, but have overlapping 95% confidence ranges will be marked with an asterisk (\*). In case the overlap of the confidence ranges is less than 50% the probe results will be marked with a smaller or greater than symbol plus asterisk (<\* or >\*). Ambiguous probe results that do not show a difference in the magnitude of the probe ratio, but do show a difference in confidence ranges may be displayed with a single and double smaller or greater than symbols, depending on the size of the difference.


Fig. 8. Part of a pdf report from a tumor sample analyzed with the P335 MLPA kit. The report shows clear aberrations at 9p21.3, 9p13.2 and 12q21.33. Less clear is the ratio of RB1, which displays a slight decrease in signal as opposed to the reference population, but doesn't surpass the threshold value, due to sample mosaicism.

#### **5.3.3 Reporting and visualization**

146 Modern Approaches To Quality Control

Probe results that are ambiguous, consequently only meet one of the two criteria in order to characterize the result to be different or equal. Ambiguous probe results that do show a difference in the magnitude of the probe ratio, as compared to the average of that probe calculated over samples with the same sample type, but have overlapping 95% confidence ranges will be marked with an asterisk (\*). In case the overlap of the confidence ranges is less than 50% the probe results will be marked with a smaller or greater than symbol plus asterisk (<\* or >\*). Ambiguous probe results that do not show a difference in the magnitude of the probe ratio, but do show a difference in confidence ranges may be displayed with a single and double smaller or greater than symbols, depending on the size of the difference.

Fig. 8. Part of a pdf report from a tumor sample analyzed with the P335 MLPA kit. The report shows clear aberrations at 9p21.3, 9p13.2 and 12q21.33. Less clear is the ratio of RB1, which displays a slight decrease in signal as opposed to the reference population, but

doesn't surpass the threshold value, due to sample mosaicism.

Automatic data interpretation cannot replace the specialist judgment of a researcher. Knowledge about the expected genetic defect of the target DNA and other sample information may be crucial. To assist the user with data interpretation, our software automatically sorts all probe results based on the last updated map view locations of the probes. Chromosomal aberrations often-span larger regions (M. Hermsen, 2002), which allow probes targeted to that region to cluster together by sorting. Our software can then make a single page PDF reports, containing a summary of all relevant data, probe ratios (figure 8), statistics, quality controls and charts (figure 2 & 4) of a single sample.


Fig. 9. Screen shot of from a tumor sample analyzed with the P335 MLPA kit. Probe ratio results of targets estimated as significantly increased as opposed to the reference population are marker green; those estimated as significantly decreased are marked red.

Analyzing of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland 149

Elizatbeth van Pelt-Verkuil, Alex Van Belkum, John P. Hays (2008). Principles and technical

González J. 2008. Probe-specific mixed model approach to detect copy number differences

Hermsen M., Postma C. (2002). Colorectal adenoma to carcinoma progression follows multiple pathways of chromosomal instability, Gastroenterology, 123 (1109-1119) Holtzman NA, Murphy PD, Watson MS, Barr PA (1997). "Predictive genetic testing: from basic research to clinical practice". *Science (journal)* 278 (5338): 602–5. Huang, C.H., Chang, Y.Y., Chen, C.H., Kuo, Y.S., Hwu, W.L., Gerdes, T. and Ko, T.M. (2007).

Janssen, B., Hartmann, C., Scholz, V., Jauch, A. and Zschocke, J. (2005). MLPA analysis for

Kluwe, L., Nygren, A.O., Errami, A., Heinrich, B., Matthies, C., Tatagiba, M. and Mautner,

Michils, G., Tejpar, S., Thoelen, R., van Cutsem, E., Vermeesch, J.R., Fryns, J.P., Legius, E.

Nakagawa, Shinichi; Cuthill, Innes C (2007). "Effect size, confidence interval and statistical

Schouten, J.P. (2002), Relative quantification of 40 nucleic acid sequences by multiplex

Scott, R.H., Douglas, J., Baskcomb, L., Nygren, A.O., Birch, J.M., Cole, T.R., Cormier-Daire,

Sequeiros, Jorge; Guimarães, Bárbara (2008). Definitions of Genetic Testing EuroGentest

Taylor, C.F., Charlton, R.S., Burn, J., Sheridan, E. and Taylor, GR. (2003). Genomic deletions

"NCBI: Genes and Disease". NIH: National Center for Biotechnology Information (2008). Redeker, E.J., de Visser, A.S., Bergen, A.A. and Mannens, M.M. (2008). Multiplex ligation-

dependent probe amplification. *Genet Med*. 4, 241-248.

dystrophin gene: potential and pitfalls. *Neurogenetics*. 1, 29-35.

using multiplex ligation dependent probe amplification (MLPA), BMC

Copy number analysis of survival motor neuron genes by multiplex ligation-

the detection of deletions, duplications and complex rearrangements in the

V. (2005). Screening for large mutations of the NF2 gene. *Genes Chromosomes Cancer*.

and Matthijs, G. (2005). Large deletions of the APC gene in 15% of mutationnegative patients with classical polyposis (FAP): a Belgian study. *Hum Mutat.* 2,

significance: a practical guide for biologists". *Biological Reviews Cambridge* 

dependent probe amplification (MLPA) enhances the molecular diagnosis of

ligation-dependent probe amplification. Nucleic Acids Research, 20 (12):

V., Eastwood, D.M., Garcia-Minaur, S., Lupunzina, P., Tatton-Brown, K., Bliek, J., Maher, E.R. and Rahman, N. (2008). Methylation-specific multiplex ligationdependent probe amplification (MS-MLPA) robustly detects and distinguishes 11p15 abnormalities associated with overgrowth and growth retardation. *J Med* 

in MSH2 or MLH1 are a frequent cause of hereditary non-polyposis colorectal cancer: identification of novel and recurrent deletions by MLPA. *Hum Mutat.* 6,

aspects of PCR amplification.

bioinformatics, 9:261

42, 384-391.

125-34.

e57

*Genet.* 45, 106-13.

428-33.

Network of Excellence Project.

*Philosophical Society* 82 (4): 591–605

aniridia and related disorders. *Mol Vis.* 14, 836-840.

Our software enables users further, to display MLPA sample results in large array of different chart types figure 2 & 4). Charts may all be exported to different formats such as: jpg, gif, tiff, png, bmp. The results of a complete experiment may be plot together in grids and heat map algorithms may be applied to provide users a simple overview (figure 9). These grids may be exported to file formats (XML, txt, csv) that may be opened in Microsoft Excel. Alternatively these grids may also be exported to PDF files or several imaging formats.

## **6. Conclusions and future research**

In this chapter we showed the options and applied algorithms of our MLPA analysis software, called Coffalyser.NET. Our software integrates new technologies enhancing the speed, accuracy and ease of MLPA analysis. Recognition of aberrations is improved by companioning effect-size statistics with statistical interference allowing users to interpreter units of measurement that are meaningful on a practical level (L. Wilkinson, 1999), while also being able to draw conclusions from data that are subject to random variation, for example, sampling variation (Bickel, Peter J.; Doksum, Kjell A., 2001). Our software contains extensive methods for results reporting and interpretation. It may also provide an alternative to software such as: Applied BioSystems Genotyper® and GeneScan® or GeneMapper®software; LiCor's SAGA, MegaBACE® Genetic Profiler and Fragment Profiler. Compatible with outputs from all major sequencing systems i.e. ABI Prism®, Beckman CEQ and MegaBACE® platforms. Coffalyser.NET is public freeware and can be downloaded from the MRC-Holland website.

Using data-mining techniques such as support vector machines in the large volumes of data obtained by large-scale MLPA experiments, may serve as a powerful and promising mechanism for recognizing of results patterns, which can be used for classification. Our future directions therefore concentrate on developing novel methods and algorithms that can improve recognition of disease related probe ratio patterns optimizing results in terms of validity, integrity and verification.

## **7. References**


Our software enables users further, to display MLPA sample results in large array of different chart types figure 2 & 4). Charts may all be exported to different formats such as: jpg, gif, tiff, png, bmp. The results of a complete experiment may be plot together in grids and heat map algorithms may be applied to provide users a simple overview (figure 9). These grids may be exported to file formats (XML, txt, csv) that may be opened in Microsoft Excel. Alternatively these grids may also be exported to PDF files or several imaging

In this chapter we showed the options and applied algorithms of our MLPA analysis software, called Coffalyser.NET. Our software integrates new technologies enhancing the speed, accuracy and ease of MLPA analysis. Recognition of aberrations is improved by companioning effect-size statistics with statistical interference allowing users to interpreter units of measurement that are meaningful on a practical level (L. Wilkinson, 1999), while also being able to draw conclusions from data that are subject to random variation, for example, sampling variation (Bickel, Peter J.; Doksum, Kjell A., 2001). Our software contains extensive methods for results reporting and interpretation. It may also provide an alternative to software such as: Applied BioSystems Genotyper® and GeneScan® or GeneMapper®software; LiCor's SAGA, MegaBACE® Genetic Profiler and Fragment Profiler. Compatible with outputs from all major sequencing systems i.e. ABI Prism®, Beckman CEQ and MegaBACE® platforms. Coffalyser.NET is public freeware and can be

Using data-mining techniques such as support vector machines in the large volumes of data obtained by large-scale MLPA experiments, may serve as a powerful and promising mechanism for recognizing of results patterns, which can be used for classification. Our future directions therefore concentrate on developing novel methods and algorithms that can improve recognition of disease related probe ratio patterns optimizing results in terms

Ahn, J.W. (2007). Detection of subtelomere imbalance using MLPA: validation, development

Applied Biosystems. (1988). AmpFℓSTR® Profiler Plus™ PCR Amplification Kit user's

Ellis, Paul D. (2010). The Essential Guide to Effect Sizes: An Introduction to Statistical

Power, Meta-Analysis and the Interpretation of Research Results. United Kingdom:

Bickel, Peter J.; Doksum, et al. (2001). Mathematical statistics: Basic and selected topics. 1 Clark, J. M. (1988). Novel non-templated nucleotide addition reactions catalyzed by procaryotic and eucaryotic DNA polymerases. *Nucleic Acids Res* 16 (20): 9677–86. Coffa, J. (2008). MLPAnalyzer: data analysis tool for reliable automated normalization of

Albert, J. (2007) Bayesian Computation with R. Springer, New York

MLPA fragment data, Cellular oncology, 30(4): 323-35

of an analysis protocol, and application in a diagnostic centre, BMC Medical

formats.

**6. Conclusions and future research** 

downloaded from the MRC-Holland website.

Cambridge University Press.

of validity, integrity and verification.

Genetics, 8:9

manual.

**7. References** 


**Part 3** 

**Quality Control for Biotechnology** 

