**2. Data and methodology**

#### **2.1 Study area**

The Yangtze River is the longest and largest river in China, which originates in the Tanggula Mountains, flows through 19 provinces from west to east, and finally flows into the East China Sea. The Yangtze River Basin (YRB; 24–35°N, 90–122°E), with a drainage area of about 1.8 million km<sup>2</sup> , is located in the subtropical and temperate climate zones dominated by the southeast monsoon. In this study, the MLRYRB (25–34°N, 108–122°E; **Figure 1**) is selected as the study area, which is one of the important water sources and an important economic center in East China. The average annual temperature is between 14 and 18°C, and the average annual precipitation is between 1000 and 1400 mm. Affected by factors such as subtropical monsoons and typhoons, the MLRYRB is one of China's heavy rain-prone areas and areas with the highest flood intensity.

#### **2.2 Data**

The data used in this study are obtained from the China Meteorological Data Service Center (http://data.cma.cn/), and there are 62 meteorological stations with the complete sequence of daily rainfall from 1960 to 2020 in the MLRYRB (**Figure 1**). The meteorological stations selected in this study include all types of topography and climate regions in the MLRYRB (**Figure 1**).

To find out the impacts of climate change, the correlations between the extreme rainfall indices and ENSO/WPSH are studied. Here, the Nino3.4 index, which can be obtained from National Center for Atmospheric Research (NCAR) (https://climateda taguide.ucar.edu/climate-data), is used to express the ENSO phenomenon and its changes. The WPSH index, which can be obtained from the National Climate Center of China Meteorological Administration (https://cmdp.ncc-cma.net/Monitoring/cn\_ stp\_wpshp.php), is used to analyze the strong effects of extreme rainfall in the MLRYRB [31, 32].

**Figure 1.** *Location and topography of the MLRYRB in China.*

### **2.3 Methodology**

#### *2.3.1 K-mean clustering*

The topography of the MLRYRB is complex and the Spatiotemporal distributions of rainfall are uneven due to the influence of the monsoon. Therefore, to reveal the steady changing process and regional differences of extreme rainfall more reasonably, the K-mean clustering method is used to identify similar rainfall characteristics and divide the whole study area into sub-regions. The K-mean clustering analysis is suitable for processing large data sets, which is a commonly used non-hierarchical clustering algorithm with high computational efficiency [11, 34]. This method can reduce the impacts of regional differences on extreme rainfall to a certain extent. For the specific calculation process, please refer to the study of Yang et al. [35].

The results of the K-means clustering analysis vary with the initial cluster centers and the number of selected clusters. Therefore, silhouette coefficient (SC) is usually selected to evaluate whether the number of clusters is reasonable. The value range of SC is from -1 to 1. Generally, the larger the SC value, the better the clustering effect. When the SC value is negative, the clustering result is incorrect. When the average SC value reaches the maximum and the number of negative values is the least, it is the optimal number of clustering. For the specific calculation process, please refer to the study of Liu et al. [19].

#### *2.3.2 Bernaola-Galvan segmentation algorithm*

Bernaola-Galvan Segmentation Algorithm (BGSA), a sequence segmentation method proposed by Bernaola-Galvan et al. based on the sliding T-test, is used to detect *Threshold Recognition Based on Non-Stationarity of Extreme Rainfall in the Middle… DOI: http://dx.doi.org/10.5772/intechopen.109866*

the mutation points of the extreme rainfall time series. Compared with traditional methods such as the sliding T/F test, the Mann-Kendall (MK) test, and the rank sum test, this method performs better in dealing with highly nonlinear time series [36].

*X* is a time series with a length of *N*. One of the segmentation points, *i*, slides from the left of the sequence to the right in turn. The average values of the left and right parts of the segmentation point are *μ*1(*i*) and *μ*2(*i*), and their standard deviations are *s*1(*i*) and *s*2(*i*), respectively. Then, the merged deviation *SD*(*i*) at the split point *i* can be expressed as Eq. (1).

$$\mathbf{S}\_{D} = \left(\frac{(N\_1 - \mathbf{1})\mathbf{s}\_1^2 + (N\_2 - \mathbf{1})\mathbf{s}\_2^2}{N\_1 + N\_2 - 2}\right)^{1/2} \mathbf{\dot{\mathbf{s}}} \left(\frac{\mathbf{1}}{N\_1} + \frac{\mathbf{1}}{N\_2}\right)^{\frac{1}{2}} \tag{1}$$

In Eq. (1), *N*<sup>1</sup> and *N*<sup>2</sup> represent the sequence length on the left and right sides of the dividing point *i*, respectively. The difference between the subsequences on both sides of the split point can be represented by the T test statistic *T*(*i*) (Eq. (2)).

$$T(i) = \left| \frac{\mu\_1(i) - \mu\_2(i)}{\mathbf{S}\_D} \right| \tag{2}$$

In Eq. (2), the *T* value represents the difference between the subsequences on both sides of the split point. The statistical significance *P*(*tmax*) corresponding to the maximum *T* is calculated as Eq. (3).

$$P(t\_{\max}) \approx \left\{ \mathbf{1} - I\_{\left[\frac{r}{\left(v + \frac{r^2}{\left(v + \frac{r^2}{\max}\right)}\right)}\right]^\eta} \right\}^\eta \tag{3}$$

In Eq. (3), *Ix*(*a,b*) is an incomplete beta function, where *x* corresponds to *v*/(*v+t*2 *max*), *a* corresponds to *δv*, and *b* corresponds to *δ*. According to the Monte Carlo simulation, *η*=4.19*lnN*-11.54, *δ*=0.40, *v*=*N*-2.

If *P*(*tmax*) is greater than or equal to the threshold of *P*0(0.95)*,* the difference between the average values is considered to be statistically significant. Then, the time series will be split, and the iteration of the above process will continue until the effective value obtained is less than the threshold or the length of the acquired subsequence is less than the minimum length *l0* (*l0* ≥ 25). On the contrary, if *P*(*tmax*) is less than *P*0(0.95), the sequence will not be split.

#### *2.3.3 Distribution fitting*

In the identification of regional extreme rainfall, the POT method is currently used to screen the rainfall sequence, and then the parametric method is used to fit the filtered extreme rainfall sequence. The statistical distribution models used in this study for extreme rainfall research are the GPD and GEV. The distribution functions of GPD and GEV are as follows.

For the GPD in Eq. (4), *k* is the shape parameter, *β* is the threshold, and *α* is the scale parameter.

$$F(\mathbf{x}) = \frac{1}{a} \left[ \mathbf{1} + k \left( \frac{\mathbf{x} - \boldsymbol{\beta}}{a} \right) \right]^{-1 - \frac{1}{k}} \\ k \neq \mathbf{0}, \boldsymbol{\beta} \le \mathbf{x} \le \frac{a}{k} \tag{4}$$

For the GEV in Eq. (5), *μ* is location parameter, *α* is scale parameter, and *k* is shape parameter.

$$F(\mathbf{x}) = \frac{1}{a} \exp\left[ -\left( \mathbf{1} + k\left(\frac{\mathbf{x} - \boldsymbol{\mu}}{a}\right) \right)^{-\frac{\mathbf{1}}{k}} \right] \left( \mathbf{1} + k\left(\frac{\mathbf{x} - \boldsymbol{\mu}}{a}\right) \right)^{-1 - \frac{\mathbf{1}}{k}} k \neq \mathbf{0} \tag{5}$$

Klomogorov-Smirnov test (KS-test) is used to compare the results of different distributions. The larger the *p* value calculated by the KS-test, the better the fitting result. The standard uses a confidence level of 0.05 (*p*=0.05). After obtaining the distribution parameters, the corresponding return period in different years is Eq. (6).

$$R = \beta + \frac{a}{k} \left( \mathbf{1} - \left( \mathbf{y}T \right)^{-k} \right) k \neq \mathbf{0} \tag{6}$$

In Eq. (6), *R* is the rainfall value corresponding to the return period; *y* is the number of rainfalls exceeding the threshold in one year, taking the multi-year average value; *T* is the year of the return period; *β* is the threshold; *α* is the scale parameter; and *k* is the shape parameter.

#### *2.3.4 Cross wavelet analysis*

The cross wavelet analysis proposed by Hudgins et al. is an effective tool family for studying the correlation of time series [37]. Cross wavelet transform (CWT) combines wavelet transform and cross spectrum analysis, and can display two sets of timecorrelated time-domain sequences. For two time series X and Y, their cross wavelet transform can be defined as Eq. (7).

$$\mathcal{W}\_{n}^{\text{xy}}(\mathfrak{s}) = \mathcal{W}\_{n}^{\text{x}}(\mathfrak{s})\mathcal{W}\_{n}^{\text{y}\*}\left(\mathfrak{s}\right) \tag{7}$$

In Eq. (7), *W y\* n(s)* is the complex conjugate of *W y n(s)*, and *s* is the time lag. The cross wavelet power spectrum is defined as |*W xy n(s)*|, and its value indicates the degree of correlation between the two time series.

For two time series *X* and *Y*, the expected spectra are *Px k* and *Py k*, then the cross wavelet power spectrum distribution is expressed as Eq. (8).

$$D\left\{\frac{\mathcal{W}\_n^{\mathbf{x}}(s)\mathcal{W}\_n^{\mathbf{y},\*}(s)}{\sigma\_{\mathbf{x}}\sigma\_{\mathbf{y}}}\right\} = \frac{z\_v(p)}{v}\sqrt{P\_k^{\mathbf{x}}P\_k^{\mathbf{y}}}\tag{8}$$

In Eq. (8), *σ<sup>x</sup>* and *σ<sup>y</sup>* are the standard deviations of the time series *X* and *Y,* respectively. *zv (p)* is the confidence level related to the probability *p*, and *v* is the degree of freedom. The calculation program of cross wavelet analysis can be found in the study of Torrence and Compo [38], and the code can be downloaded from http:// noc.ac.uk/using-science/crosswavelet-wavelet-coherence.
