**7. Addressing statistical issues**

### **7.1. Simple comparisons against a reference method**

Validation in the clinical setting is usually performed by comparing readings from the method being tested against a reference method. Traditionally single bolus thermodilution cardiac output performed using a PAC has been used. The average of three thermodilution readings is used, and aberrant readings that differ by more than 10% are rejected, in order to improve the precision. However, thermodilution is not a gold standard method and significant measurement errors, both random and systematic, arise when it is used. It is generally accepted that thermodilution has a precision error of ±20%. True gold standard methods such as aortic flow probes have precisions errors of less than ±5%. Thus, thermodilution is an imprecise reference method and its use greatly influences the statistical analysis. Most of the benchmarks against which the outcomes of validation studies are judged are based on this precision of ±20%.

Other more precise and gold standard reference methods could be used, such as the Fick method or a flow probe surgically placed on the aorta. However, in the clinical setting their use is inappropriate and thus the current clinical standard for cardiac output measurement thermodilution via a PAC is used. The current decline in the clinical use of PACs has left a void. Thus, some recently published validation studies have used transpulmonary thermodi‐ lution using the PiCCO system or oesophageal Doppler monitoring using the CardioQ as alternative reference methods.

#### **7.2. The precision error of thermodilution**

Recently, the precision of ±20% for thermodilution has come under scrutiny. The reason that thermodilution is said to have a precision error of ±20% can be attributed to our 1999 publi‐ cation on bias and precision statistics which first proposed percentage error [39]. In the 1990's consensus of opinion was that for a monitor to be accepted into clinical use it should be able to detect at least a change in cardiac output of 1 L/min when the mean cardiac output was 5 L/min, which was a 20% change [40,41]. Furthermore, Stetz and colleagues meta-analysis of studies from the 1970's validating the thermodilution method suggested that it had a precision of 13-22% [42]. The 30% benchmark percentage error that everyone today quotes was based on a precision error of ±20% for thermodilution. However, it is now seems that the precision of thermodilution can be very variable and depends on type of patient and measurement system used [43]. Recently Peyton and Chong have suggested that the precision of thermodi‐ lution may be as large as ±30% [44].

#### **7.3. Study design**

providing the readings reliably show the changes. This division into two roles may at first seem a little pedantic, but a monitor that does not measure cardiac output accurately may still be useful clinically if it detects trends reliably. As most bedside cardiac output monitors used today are now able to measure cardiac output continuously, although many are not particu‐ larly accurate, the issue of being a reliable trend monitor becomes very relevant. Unfortunately,

If I use a measuring tape to measure the heights of patients attending a clinic, my readings may vary by few millimeters from the true height of each patient. This is random error. But if the measuring tape is stretched by 2 to 3 centimeters, then every reading I take will consistently under read the height of each patient by a few centimeters. This is a systematic error. The division of measurement error into random and systematic components plays an important

One of main sources of systematic error is imprecise calibration. Calibration is performed by (a) measuring cardiac output using a second method such as thermodilution, or (b) using population data to derive cardiac output from the patient's demographics, (i.e. age, height and weight)). Unfortunately, cardiac output, and related parameters vary between individuals. In the Nidorf normogram used to predict aortic valve size when using suprasternal Doppler cardiac output the range of possible values about the mean for valve size at each height is ±16% [23]. This gives rise to a significant systematic error between patients and this error impacts upon accuracy when Bland-Altman comparisons are made against a reference method [38]. However, reliability during trending may still be preserved because trending involves a series of readings from one single patient. Providing the systematic error remains constant, and the random measurement errors between the series of readings are acceptably low, the monitor

The accepted method of presenting errors in validation statistics is to use (a) percentages of mean cardiac output and (b) 95% confidence intervals, which approximates to two standard deviations. The term precision error is used, and should not be confused with the percentage

Validation in the clinical setting is usually performed by comparing readings from the method being tested against a reference method. Traditionally single bolus thermodilution cardiac

the majority of published validation studies have only addressed accuracy [37].

The error that arises when measuring cardiac output has two basic components:

**i.** Random error that arises from act of measuring and

role in the choice of statistical techniques used for validation.

can still detect changes in cardiac output reliably.

**7. Addressing statistical issues**

error which is one of the outcomes of Bland-Altman analysis.

**7.1. Simple comparisons against a reference method**

**ii.** Systematic error that arises from the measurement system.

**6.2. Understanding errors**

64 Artery Bypass

Study design becomes significant when ability to detection trends, in addition to accuracy, is investigated. To determine accuracy one needs only a single pair of cardiac output readings, test and reference, from each patient. Test refers to the new method being validated and reference to the clinical standard thermodilution, though ideally a gold standard method should be used. Readings, test and reference, should ideally be performed simultaneously, because cardiac output is not a static parameter and fluctuates between cardiac cycles. The size of the study usually includes twenty or more pairs of readings.

Study design becomes more complicated if the ability to detect trends is being investigated. A series of paired readings from the same patient are now needed that show changes in cardiac output. A wide range of values of cardiac output readings is also needed. A new parameter called delta cardiac output (∆CO) is calculated for both test and reference data which uses the difference between consecutive readings. Trend analysis is performed on the ∆COs. The data can be collected (a) at random or (b) at predetermined time points. Readings collected at random can lead to uneven data distribution. Thus, a more rigid protocol with data being collected at predetermined time points tends to be used. Commonly 6 to 10 time points are used. A typical protocol for a patient having cardiac surgery might be: (T1) - before anaesthesia, (T2) – after induction, (T3) - after sternotomy, (T4) – after by-pass, (T5) – after closure of the chest and (T6-8) - at set times on the intensive care.

**8.2. The Bland-Altman plot**

The agreement between two measurement techniques, test and reference, is evaluated by cal‐ culating the bias, which is the difference between the each pair of readings, test minus refer‐ ence. In the Bland-Altman plot the bias of each pair of readings (y-axis) is plotted against the average of the two readings (x-axis) (Figure 13). Then, three horizontal lines are added to the plot: (a) The mean bias for all the data points and (b) The two 95% confidence interval lines for the bias (1.96 x standard deviation of the bias) known as the "Limits of Agreement". Sufficient

Minimally Invasive Cardiac Output Monitoring in the Year 2012

http://dx.doi.org/10.5772/54413

67

**Figure 13.** Bland and Altman plot showing test and reference cardiac output (CO) data points. The mean bias and limits of agreement lines (dashed) have been added to plot. 95% of the data points falls between these limits. The percentage error has been calculated from the mean CO and limits of agreement. Note the slightly skewed distribu‐

**i.** Some investigators argue that the best estimate of cardiac output (x-axis), or the

**ii.** When the study protocol collects more than one set of data from each patient the

**iii.** The Bland-Altman plot assumes that both the test and reference methods have the

described a logarithmic transformation to deal with this scenario [45].

limits of agreement should be adjusted for repeated measures. The effect of having multiple readings from the same subject is to reduce the influence of systematic errors, thus decreasing the standard deviation of the bias and narrowing the limits of agreement. As a consequence the limits become falsely small. Two recent articles describe how to perform a correction for repeated measures [46,47]. The models used

same calibrated scales for measuring cardiac output. Otherwise, the distribution of data will be sloping and the limits of agreement falsely wide. Bland and Altman

reference value, should be used instead of the average.

in the two corrective methods are slightly different.

data should also be provided to allow the calculation of percentage error.

tion of the data shown by the sloping regression line (dotted).

**8.3. Modifications to the B-A plot**
