**4.1 Descriptive statistics and inferential statistics**

Statistics is a branch of Mathematics. The word "statistics" derives from the Latin word *status*, meaning "manner of standing" or "position". Statistics were first used by tax assessors to collect information for determining assets and assessing taxes. Nowadays, the application of statistics is broad and includes business, marketing, economics, agriculture, education, medicine and others. Statistics applied to medicine and other health disciplines is called biostatistics or biometrics. For those who would like to review or study this subject we recommend the book of Zar, 2010.

Statistics is divided into two branches: descriptive and inferential. The goal of descriptive statistics is to organize and summarize data. And the goal of inferential statistics is to draw inferences and reach conclusions about a population, when only a sample from the population has been studied. A population is a complete set of observations, patients, measurements, and so forth. A sample is a subset of a certain population.

To organize data and summarize their main characteristics we can use *tables, graphs* and *quantitative indices*. Tables are often used to present qualitative and quantitative data. Graphs are used widely to provide a visual display data. The bar diagram, histogram and frequency polygon are three graphic formats that are commonly used to present medical data. A table or a graph in which all values of a variable of interest are displayed with their corresponding frequency is called a *frequency distribution,* or simply a *distribution.* We shall see in the next subsection that in inferential statistics we are interested in a special type of distribution: a probability distribution.

Quantitative indices are numbers that describe the center and the variation of a distribution. Quantitative indices that describe the center of a distribution are referred to as *measures of* 

Epigenetics in Cancer: The Myelodysplastic Syndrome as a

Fig. 5. Gaussian distribution.

situations, as shown in Table 5.

with the proportions of a normal distribution, assuming k = 1, 2, 3 or 4.

Model to Study Epigenetic Alterations as Diagnostic and Prognostic Biomarkers 37

distribution. The Chebyshev theorem claims that, for any probability distribution, the probability of data be located k standard deviation around the mean is, at least, 1-1/k2, where k is a positive number greater than 1. Table 4 compares the Chebyshev's proportions

**k Any distribution Normal distribution** 

If we decide to approximate clinical measurements by a normal curve, we are deciding to use a parametric hyphotesis test. A *hypothesis test* asks if an effect (difference) exists or not using statistics tests to verify the *hypothesis that there is no difference*. This is the *null hypothesis* and designated H0. The hypothesis that contradicts H0 is the *alternative hypothesis* and written HA. In the null hypothesis we use the words *no difference* or *equal to* and in the alternative hypothesis we use the words *different from*, *less than* or *greater than*. But let us mention that, in fact, we should say *no statistical difference*, *statistically equal to*, *statistically different from, statistically less than* or *statistically greater than*, because we are dealing with the probabilities of an event happens or not. When we retain HA (equivalently reject H0), we say the results are significant and when we retain H0 (equivalently reject HA), we say the results are not significant. Because we are dealing with probabilities, this implies in making two possible errors from four possible relations between statistical conclusions and real

no information 68% ≥ 75% 95% ≥ 88% 99,7% ≥ 93,75 % 99,9% Table 4. Chebyshev'proportions compared with proportions of a normal distribution.

*central tendency*. The mean, known also as the arithmetic mean, median and mode are three common measures of central tendency. Quantitative indices that describe the variation or dispersion of a distribution are referred to as *measures of dispersion.* The range, variance and standard deviation are three common measures of dispersion. In medicine we usually work with other quantitative indices as risk difference, relative risk and odds ratio. To draw inferences and reach conclusions about a population, when only a sample from that population has been studied, we need to know probability, because we use *statistical hypothesis testing* and *estimates*.

Statistical hypothesis tests can be *parametric* or *nonparametric*. Surveys of statistical methods used in journals indicate that the *t test* is one of the most commonly used statistical tests. The percentages of articles that use the t test range from 10% to more than 60%. Williams and colleagues (1997) noted a number of problems in using the t test. Welch and Gabbe (1996) found a number of errors in using the t test when a nonparametric procedure is called for. Thus, being able to use these techniques needs some skills in order to choose the correct statistical test.

#### **4.2 Parametric and nonparametric tests**

Combining the notion of a frequency distribution (descriptive statistics) and probability, we can explain what means a probability distribution. A *probability distribution* is a table or a graph that describes the probability of an event occurs. It describes what probably will happen instead of describing what really happened.

A probability distribution can be discrete or continuous. An important example of a discrete probability distribution is the *binomial distribution* and an important example of a continuous distribution is the *normal distribution.*

The binomial distribution is generated from a series of Bernoulli trials, named in honor of James Bernoulli (1654-1705). The binomial distribution is used when we have only two possible outcomes, which are mutually exclusive. For example, survived/died, male/female, adequate/inadequate, and others.

The normal distribution is known also as the Gaussian distribution, in honor of Carl F. Gauss (1777-1855), who made significant contributions in the beginning of the 19th century to its development. The geometric representation of such a distribution is a symmetric and bell-shaped curve, known as the normal curve. The most important characteristic of the normal curve is the following: if perpendiculars are erected at the distance of one standard deviation above and one standard deviation below the mean, approximately 68% of the total area is between these perpendiculars, the x-axis and the curve. If perpendiculars are constructed at a distance of two standard deviations above and below the mean, approximately 95% of the total area is enclosed. If perpendiculars are set at a distance of three standard deviations to the left and right of the mean, approximately 99,7% of the total area is included, as shown in Figure 5. Since there is a correspondence between area and probability, we have information about the probability of data be located k standard deviation around the mean, for k = 1, 2 or 3.

Comparing the above information with the information obtained by the Chebyshev theorem, we note that we can obtain more accurate results when data present a normal distribution. The Chebyshev theorem claims that, for any probability distribution, the probability of data be located k standard deviation around the mean is, at least, 1-1/k2, where k is a positive number greater than 1. Table 4 compares the Chebyshev's proportions with the proportions of a normal distribution, assuming k = 1, 2, 3 or 4.

Fig. 5. Gaussian distribution.

36 Biomarker

*central tendency*. The mean, known also as the arithmetic mean, median and mode are three common measures of central tendency. Quantitative indices that describe the variation or dispersion of a distribution are referred to as *measures of dispersion.* The range, variance and standard deviation are three common measures of dispersion. In medicine we usually work with other quantitative indices as risk difference, relative risk and odds ratio. To draw inferences and reach conclusions about a population, when only a sample from that population has been studied, we need to know probability, because we use *statistical* 

Statistical hypothesis tests can be *parametric* or *nonparametric*. Surveys of statistical methods used in journals indicate that the *t test* is one of the most commonly used statistical tests. The percentages of articles that use the t test range from 10% to more than 60%. Williams and colleagues (1997) noted a number of problems in using the t test. Welch and Gabbe (1996) found a number of errors in using the t test when a nonparametric procedure is called for. Thus, being able to use these techniques needs some skills in order to choose the correct

Combining the notion of a frequency distribution (descriptive statistics) and probability, we can explain what means a probability distribution. A *probability distribution* is a table or a graph that describes the probability of an event occurs. It describes what probably will

A probability distribution can be discrete or continuous. An important example of a discrete probability distribution is the *binomial distribution* and an important example of a continuous

The binomial distribution is generated from a series of Bernoulli trials, named in honor of James Bernoulli (1654-1705). The binomial distribution is used when we have only two possible outcomes, which are mutually exclusive. For example, survived/died,

The normal distribution is known also as the Gaussian distribution, in honor of Carl F. Gauss (1777-1855), who made significant contributions in the beginning of the 19th century to its development. The geometric representation of such a distribution is a symmetric and bell-shaped curve, known as the normal curve. The most important characteristic of the normal curve is the following: if perpendiculars are erected at the distance of one standard deviation above and one standard deviation below the mean, approximately 68% of the total area is between these perpendiculars, the x-axis and the curve. If perpendiculars are constructed at a distance of two standard deviations above and below the mean, approximately 95% of the total area is enclosed. If perpendiculars are set at a distance of three standard deviations to the left and right of the mean, approximately 99,7% of the total area is included, as shown in Figure 5. Since there is a correspondence between area and probability, we have information about the probability of data be located k standard

Comparing the above information with the information obtained by the Chebyshev theorem, we note that we can obtain more accurate results when data present a normal

*hypothesis testing* and *estimates*.

**4.2 Parametric and nonparametric tests** 

distribution is the *normal distribution.*

happen instead of describing what really happened.

male/female, adequate/inadequate, and others.

deviation around the mean, for k = 1, 2 or 3.

statistical test.


Table 4. Chebyshev'proportions compared with proportions of a normal distribution.

If we decide to approximate clinical measurements by a normal curve, we are deciding to use a parametric hyphotesis test. A *hypothesis test* asks if an effect (difference) exists or not using statistics tests to verify the *hypothesis that there is no difference*. This is the *null hypothesis* and designated H0. The hypothesis that contradicts H0 is the *alternative hypothesis* and written HA. In the null hypothesis we use the words *no difference* or *equal to* and in the alternative hypothesis we use the words *different from*, *less than* or *greater than*. But let us mention that, in fact, we should say *no statistical difference*, *statistically equal to*, *statistically different from, statistically less than* or *statistically greater than*, because we are dealing with the probabilities of an event happens or not. When we retain HA (equivalently reject H0), we say the results are significant and when we retain H0 (equivalently reject HA), we say the results are not significant. Because we are dealing with probabilities, this implies in making two possible errors from four possible relations between statistical conclusions and real situations, as shown in Table 5.

Epigenetics in Cancer: The Myelodysplastic Syndrome as a

significant statistically.

Model to Study Epigenetic Alterations as Diagnostic and Prognostic Biomarkers 39

An increasing number of journals require that investigators include p-values in their manuscripts. When p-values are given, we are able to compare this probability to our own decision rule, which is the value of **α**. If the p-value is less than **α**, we say that the results are significant statistically. If the p-value is greater than **α**, we say that the results are not

**4.3 Statistical methods for evaluation of biomarkers in myelodysplastic syndrome** 

hypothesis tests. Let us make few comments about estimates and survival tables.

occur. Interval estimates are also called confidence intervals.

**Measure Definition** 

not detectable

after a free period

prognosis when we consider a fixed period of time.

Table 7. Common measures that describe prognosis.

called a survival curve.

Biostatistics gives us important tools to evaluate biomarkers in myelodysplastic syndrome and other diseases. Quantitative indices, estimates, hypothesis tests and survival tables are useful to point out biomarkers. We already discussed about quantitative indices and

In many situations, populations are so large that it is impossible to describe their central tendency and dispersion by studying 100% of their members, or by studying a sufficiently large portion of population to justify treating sample statistics as population parameters. In other situations, clinicians may study a new phenomenon with little basis to determine a population parameter. In these cases, we use estimates. Two types of estimates of a population parameter can be used: a point estimate and an interval estimate. A point estimate is a single numerical value of a sample statistic used to estimate the corresponding population parameter. Point estimates are not used widely because the value of some statistic, such as the sample mean, varies from sample to sample. So, an interval estimative is typically used. An interval estimate is a range of values which the parameter is likely to

Survival tables are used to describe prognosis. Prognosis is a prediction of the future course of a disease following its onset (Fletcher et al., 1988). We can describe the prognosis of a disease considering a fixed period of time (measures or taxes) or considering varying periods of time (survival tables). Table 7 shows the common measures used to describe

**Five-year survival** Percentage of patients who survive for five years from a certain

**Response** Percentage of patients who show some evidence of improvement after a procedure or an intervention **Remission** Percentage of patients who start a period in which the disease is

**Recurrence** Percentage of patients who present reappearance of the disease

Survival tables can handle situations in which patients enter in some trial at different times and are followed for varying periods. We usually consider the length of time in a certain trial as being days, weeks or months and the end point may be, in the MDS case, death or the reappearance of the disease. The usual method used to construct a survival table is the Kaplan-Meier method. The curve obtained from the data presented in a survival table is

time of the course of the disease


Table 5. Relations between statistical conclusions and real situations.

The two errors mentioned in the previous paragraph are known as Type I error and Type II error. A Type I error leads to a *false positive* conclusion. The probability of such an error occurs is noted by **α**. Mathematically**, α** is a conditional probability: **α** is the probability of reject H0 when there is no real difference. A Type II error leads to a *false negative* conclusion. The probability of such an error occurs is noted by **β**. Mathematically, **β** is a conditional probability: **β** is the probabilities of retain H0 when there is a real difference.

Statistical tests are used to estimate the probability of a Type I error. In the literature, we usually use **α <** 0.05. This means we are assuming a probability less than 0.05 of rejecting H0 when there is no real difference between treatments, drugs or procedures. In other words, if the study were repeated one-hundred times, we *probably* would find five outcomes showing H0 should be accepted.

There are several tests commonly used in the medical literature; they are resumed in Table 6. Investigators should decide by using a parametric or a nonparametric test. This choice depends on the purpose of the study, the size of the sample, the type of the variables involved at the study, for instance. To use a parametric test we need to guarantee that the sampling distribution is normal or approximately normal. Because normal distribution has nice mathematical properties (bell-shaped, symmetric, and so on), using a parametric test leads to better statistical results compared with a nonparametric test. In other words, we say that nonparametric tests are less powerful, in the sense that they lead to a small probability to reject H0, when H0 is false.


Table 6. Statistical tests usually used in the medical literature.

When we use a statistic test we compute a *p-value*. The *p-value* is the probability of obtaining a result as extreme or more extreme than the sample value, assuming that the null hypothesis is true. The sample value is calculated. Depending on the test we use, there is a specific formula to calculate the sample value. Appropriate computer software can do such a calculation.

The two errors mentioned in the previous paragraph are known as Type I error and Type II error. A Type I error leads to a *false positive* conclusion. The probability of such an error occurs is noted by **α**. Mathematically**, α** is a conditional probability: **α** is the probability of reject H0 when there is no real difference. A Type II error leads to a *false negative* conclusion. The probability of such an error occurs is noted by **β**. Mathematically, **β** is a conditional

Statistical tests are used to estimate the probability of a Type I error. In the literature, we usually use **α <** 0.05. This means we are assuming a probability less than 0.05 of rejecting H0 when there is no real difference between treatments, drugs or procedures. In other words, if the study were repeated one-hundred times, we *probably* would find five outcomes showing

There are several tests commonly used in the medical literature; they are resumed in Table 6. Investigators should decide by using a parametric or a nonparametric test. This choice depends on the purpose of the study, the size of the sample, the type of the variables involved at the study, for instance. To use a parametric test we need to guarantee that the sampling distribution is normal or approximately normal. Because normal distribution has nice mathematical properties (bell-shaped, symmetric, and so on), using a parametric test leads to better statistical results compared with a nonparametric test. In other words, we say that nonparametric tests are less powerful, in the sense that they lead to a small probability

**To test the statistical significance of the difference between ... Two or more proportions** Chi-square nonparametric **Two proportions** Fisher's exact parametric **Two medians** Mann-Whitney nonparametric **Two means** t-Student parametric **More than two means** Kruskal-Wallis (one-factor) nonparametric **More than two means** ANOVA (one-factor) parametric **More than two means** ANOVA (more-factors) parametric

When we use a statistic test we compute a *p-value*. The *p-value* is the probability of obtaining a result as extreme or more extreme than the sample value, assuming that the null hypothesis is true. The sample value is calculated. Depending on the test we use, there is a specific formula to calculate the sample value. Appropriate computer software can do such

Table 6. Statistical tests usually used in the medical literature.

Results are

Results are not

Table 5. Relations between statistical conclusions and real situations.

probability: **β** is the probabilities of retain H0 when there is a real difference.

**Conclusion of the statistical test** 

H0 should be accepted.

to reject H0, when H0 is false.

a calculation.

**Real difference**  Presence Absence

significant True Type I error

significant Type II error True

An increasing number of journals require that investigators include p-values in their manuscripts. When p-values are given, we are able to compare this probability to our own decision rule, which is the value of **α**. If the p-value is less than **α**, we say that the results are significant statistically. If the p-value is greater than **α**, we say that the results are not significant statistically.
