**1. Introduction**

Anybody examining sudden changes in data needs to ask, "Does this mean what I think it means? Are there other explanations?" Further, the evidence needed to overturn the acceptance of a generally held position requires high probative value, supporting the proposed position and addressing the accepted one; and should be convincing to the investigator and others.

This paper addresses this issue by presenting a systematic approach that combines the development of probative criteria with error statistical testing, illustrating it with a specific investigation of climate. The approach is developed from previous work on the philosophy of statistics, which is relatively new to climate work [1–6].

Climate, like many areas of natural science, depends heavily on statistical induction for the interpretation of physically-based behavior. Many popular

statistical tools are generalized tests, framed against broad statistical assumptions that may be challenged by complex physical processes. The implicit assumptions of tests must be considered, as must the linkage between those processes, the accessible data, and statistical models. Where competing alternatives cannot be correctly distinguished by the tests and specific data chosen for that purpose, the data is misspecified with respect to the statistical models or the model selection processes.

survived the test so well were H false." [3]. They propose that a severity criterion supplies a meta-statistical principle for evaluating statistical inferences (their page 328), where the severity of testing is not assigned to hypothesis H, but to the testing

Severe testing is beginning to be picked up by the climate community (e.g. [18–20]). It was applied to an analysis of optimal fingerprint methods in climatology [18] and to address issues of model tuning in climate projections [20]. Severe testing forms a core methodology of JR2017, R2019, and a conference paper [21] (RJ2017).

**3. The theoretical mechanistic/statistical inductive (TM/SI) framework**

The TM/SI framework borrows from a strong body of earlier work (e.g. [4, 14, 15, 22, 23]) and was outlined in Section 2 of JR2017 to provide support for reasoning about climate where the scientific debate had been muddied by competing claims

The approach follows Haig [15], and employs the concept of severe testing [3],

The theoretical-mechanistic part consists of the physical aspects, components,

The goal is to construct a chain of reasoning that ties physical hypotheses, *H1 .. Hn*, to statistical hypotheses *h1 .. hn*. That is, features of the world map to defined outcomes of statistical tests (preferably one to one). One to one mapping meets a requirement of severe testing. Misspecification testing assists this mapping.

Suppes [14] suggested that science employs a values hierarchy of models that ranges from experimental experience to theory, claiming that theoretical models, high on the hierarchy, are not compared directly with empirical data, which are low on the hierarchy. Rather, he said, they are compared with models of the data, which are higher than data on the hierarchy. Following on, Haig describes an egalitarian framework in which three different types of models are interconnected and serve to

He describes: Primary models which break a research question into a set of local hypotheses; Experimental models which "structure the particular models", and link Primary models to Data models; which in turn generate and model raw data, and check that the data satisfies the assumptions of experimental models. Although Haig does not fully explain experimental models which "structure the particular models" it seems implicit that they map hypotheses to model components and processes. He leaves to his data models the role of checking that data meets the assumptions of

The statistical-inductive part consists of the process of drawing conclusions about specified hypotheses concerning the system given the physical model, real-

and in keeping with it, error-statistical methods [4, 23]. It requires a carefully reasoned matching between scientific hypotheses about the physical world, with

test that probably would have found flaws, were they present."

*Severe Testing and Characterization of Change Points in Climate Time Series*

*DOI: http://dx.doi.org/10.5772/intechopen.98364*

from outside the science community.

statistical hypotheses about the observed data.

relationships and measurable quantities.

structure error-statistical inquiry [15].

world data and statistical tests.

**3.1 The TM/SI structure**

experimental models.

**211**

In the preface of [6] we read the following, "If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data. … A claim is severely tested to the extent that it has been subjected to and passes a

procedure.

The particular aspect addressed here is model specification with respect to the data. Probative criteria drawing from theory and interpretations of physical behavior cannot be applied correctly, if the tests to not adequately represent those criteria, or distinguish between them.

#### **1.1 Illustrative example: abrupt shifts in climate signals**

A number of publications now address an area of some controversy – the hypothesis that under greenhouse gas-induced radiative forcing, climate changes in a step-like manner [7–12]. The controversy arises because it is almost universally accepted that the forced response of climate change, especially global mean surface temperature (GMST), responds rapidly to forcing and hence is trend-like; albeit embedded in a very complex "error" process which yields highly structured residuals.

Our paper from 2017 (JR2017) [12] and the PhD thesis of Ricketts (R2019) [13], in addressing this controversy, required the development of automated, reliable and unbiased detection of shifts, and importantly various means of ensuring that presumptive shifts were not artefacts of the detection method and the structured residuals.

We built on the concepts of severe testing [3] and misspecification testing [2], and we adapted a framework of models to connect theory and data [14, 15]. Thus we could severely test two propositions: (*H1*) forced warming and natural variability proceed gradually and independently, with the response to forced warming best represented as trend-like; and (*H2*) forced warming and natural variability interact so that patterns of response may project onto modes of climate variability – either one-way as proposed by Corti et al. [16] or two-way as proposed by Branstator [17] – in either case giving rise to abrupt state-like transitions in the signal.

JR2017 showed that *H2* was preferred to *H1* in all of six tests of a severe testing regime; R2019 also showed that abrupt shifts relate directly to warming; in their extent, frequency and intensity; and more so at finer scale.

#### **1.2 Structure of the rest of the paper**

Firstly we very briefly introduce severe testing.

Then we introduce our version of a framework that connects hypotheses about the physical world to statistically based tests that license inductions about models of the world.

Next we spend more time on misspecification testing (M-S), which was proposed as an approach to determining whether the assumptions needed to reliably model the statistical variables are met [2].

### **2. Severe testing**

Severe testing, proposed by Mayo and Spanos, is based on the intuition that "Data x0 in test T provide good evidence for inferring H (just) to the extent that H passes severely with x0, i.e., to the extent that H would (very probably) not have

#### *Severe Testing and Characterization of Change Points in Climate Time Series DOI: http://dx.doi.org/10.5772/intechopen.98364*

survived the test so well were H false." [3]. They propose that a severity criterion supplies a meta-statistical principle for evaluating statistical inferences (their page 328), where the severity of testing is not assigned to hypothesis H, but to the testing procedure.

In the preface of [6] we read the following, "If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data. … A claim is severely tested to the extent that it has been subjected to and passes a test that probably would have found flaws, were they present."

Severe testing is beginning to be picked up by the climate community (e.g. [18–20]). It was applied to an analysis of optimal fingerprint methods in climatology [18] and to address issues of model tuning in climate projections [20]. Severe testing forms a core methodology of JR2017, R2019, and a conference paper [21] (RJ2017).
