**4. A case study: using statistical models to evaluate the importance of local factors on glacier changes in Eastern Tian Shan**

In this case study, we aim to investigate the relationship between the local factors and glacier changes since the LIA in the eastern Tian Shan. The study area (42°40′N–44°00′N, 85°40′E–94°50′E) is entirely bounded within Xinjiang, China, and contains several mountain ranges trending west-east (**Figure 1**). Glaciers exist beyond 3000 m a.s.l., where monthly mean temperatures are −16 to −13°C in winter and 3–5°C in summer [87]. We assumed that at the regional scale, the group of glaciers was impacted by climate change at a similar magnitude since the LIA, and the varied behaviour of any individual glacier reflects the influence from non-climatic factors, including the local topography and glacial geometry.

#### **4.1. Data**

over the past 150 years, after the end of the LIA, and many small glaciers have disappeared [32]. Glaciers of different sizes are proven to have strong correlations with the rate of glacier changes, and small glaciers are undergoing stronger retreat compared to large-sized glaciers, e.g. [77–80]. The presence of many small glaciers today are formed due to the rapid disintegration of medium-sized glaciers and are likely to disappear if the warming trend of climate continues [80]. Topographic conditions also matter to the sensitivity and the response time of glaciers to an instantaneous change in climate. Large glaciers that have multiple tributaries and low slope gradient take a longer time to adjust their extent, whereas smaller glaciers respond faster with a higher changing rate; thus, small glaciers are more sensitive to climate change. The influence of the aspect factor can be interpreted in terms of solar radiation receipt, as well as the wind effects to areas with moderate relief [77]. According to the analysis using large data sets from the World Glacier Inventory, Evans [81] measured slope aspect of a total of 66,084 glaciers in 51 regions over the world and found a broadly consistent poleward aspect

A limited number of studies have been conducted in the Tian Shan to examine the relationship between local factors and glacier changes. In northern Kyrgyz Tian Shan, Bolch [78] compared glacier retreats from 1955 to 1999 among several valleys. He mentioned that the heterogeneity of glacier changes is dependent on the size, as well as the climatic regime divided among northern slopes and southern slopes. Our previous study [82] in the central Chinese Tian Shan was the first to examine the importance of topographic factors to the spatial variability of changes in glacier area and equilibrium line altitude (ELA) since the LIA. The results showed that glacier size and mean elevation range are the two key factors and could explain up to 64% of the relative area change, but the ELA change cannot be well explained by the regression of the local factors [82]. Debris cover is another factor that can have either positive or negative impact on surface ablation because it modifies ice melt rates and the heat conduction in the ablation zone [83]. The thickness of the supraglacial debris was considered to be related to the ablation rate [84]. However, how debris cover influences glaciological processes is quite complex and varies case by case [85, 86]. Most glaciers in the central Tian Shan are the debris-

covered type, and glaciers in the eastern Tian Shan are mostly clean-ice type.

the role of local factors played in resulting non-uniform glacier responses.

**local factors on glacier changes in Eastern Tian Shan**

A glacier advances or retreats as a result of both local conditions, i.e. topography and its geometry and climate conditions. Understanding these internal and external factors on glacier changes is helpful to discuss the sensitivity of glacier response and is important to interpret paleoclimatic conditions and future glacier evolution. With nearly uniform changes in climate at local scales, more research needs to be focusing on comprehensively investigating

**4. A case study: using statistical models to evaluate the importance of** 

In this case study, we aim to investigate the relationship between the local factors and glacier changes since the LIA in the eastern Tian Shan. The study area (42°40′N–44°00′N, 85°40′E–94°50′E) is entirely bounded within Xinjiang, China, and contains several mountain

in middle and high latitudes.

46 Glacier Evolution in a Changing World

The glacier change is defined as the ratio of the areal difference between LIA and the modern glaciers to the area of LIA extents. The extent of modern glaciers was obtained from the Second Glacier Inventory of China [33]. The vector shape file of targeted glaciers in the study area contains attributes of glacier area, elevations and many others following the GLIMS standards. The LIA extent of glaciers is based on the manual delineation of the fresh moraines showing similar features with the sites already dated as LIA moraines in the Tian Shan (see Section 2.3 above). Further details about the delineation can be found in Ref. [88]. In total, 640 LIA glacial extents with corresponding 865 modern glaciers are included in the study. Three sub-regions were defined based on their geographical ranges: the Boro-Eren (BE) range, the Bogeda (BG) range and the Karlik (KL) range. The relative area change for each individual glacier is the dependent variable and seven local factors (glacier area, median elevation, slope, aspect, solar radiation, longitude and latitude) are taken into account as independent variables (**Table 1**). The 1-arc second (~30 m) Shuttle Radar Topography Mission (SRTM) DEM was used to calculate topographic factors that are derived based on elevation. The summary statistics of both the dependent and independent variables used in our model are shown in **Table 2**.


**Table 1.** A list of factors considered in the analysis (modified based on **Table 1** in [89]).


**Table 2.** Descriptive summary statistics of variables.

#### **4.2. Methods**

To minimize the possible bias due to the non-Gaussian distribution of the data, a logarithmic transformation was applied to glacier area. To incorporate circular data of aspect into the calculation, the first harmonic of Fourier transformation was used to convert the aspect degrees to the cosine and sine components. Such a transformation on aspect data was suggested in the past statistical analysis of glaciers: the sine term represents the difference along a westeast direction, whereas the cosine term measures north-south differences [81]. The sub-region codes, 'BE', 'BG' and 'KL', were included as a dummy variable in the statistical regression.

Pearson's pairwise correlation was used to examine if each of the independent variables is significantly positively or negatively correlated with the relative glacier area change and with the others. Considering the multicollinearity between some of the independent variables as well as the possible non-linear relationship between our independent variable and the dependent variable, we used the random forest regression model to identify the factors that have the greatest impact on glacier change. Random forest is similar to regression trees in that a hierarchical classification is used to recursively split data into smaller subsets based on dependent variables. The original data set will be partitioned based on the relationships between the dependent and independent variables, forming different splitting conditions (nodes), and the process of partitioning will not stop until a perfect tree is created, or until pre-defined conditions are met for terminating the expansion of the tree. The random forest does not make an assumption regarding the distribution of the data, and as the bootstrapping procedures are used, the predictability of the random forest model is improved meanwhile the possibility of over-fitting is minimized [90]. The random forest model used the difference in mean square error (MSE) for a test sample with and without a certain variable randomly permuted to determine the importance of each variable. The difference is averaged among all the trees generated and is considered a measure of how the predictive ability is reduced when a certain variable is excluded from the model. Traditional theories concerning how the aforementioned local factors interact with glacier change are often based on scientific abduction, with limited field/lab experiment supporting them, as these types of experiments are labour-intensive, expensive, and time-consuming. We used the randomForest package [90] along with other packages in R Language (http://www.r-project.org) to do all statistical analyses in this study.

#### **4.3. Results and discussion**

**4.2. Methods**

**Table 2.** Descriptive summary statistics of variables.

48 Glacier Evolution in a Changing World

To minimize the possible bias due to the non-Gaussian distribution of the data, a logarithmic transformation was applied to glacier area. To incorporate circular data of aspect into the calculation, the first harmonic of Fourier transformation was used to convert the aspect degrees to the cosine and sine components. Such a transformation on aspect data was suggested in the past statistical analysis of glaciers: the sine term represents the difference along a westeast direction, whereas the cosine term measures north-south differences [81]. The sub-region codes, 'BE', 'BG' and 'KL', were included as a dummy variable in the statistical regression.

**Variable N Mean St. Dev. Min. Max.** x 640 88.035 2.512 85.781 94.643 y 640 43.373 0.261 42.975 43.875 solar 640 1254.391 174.661 792.918 1909.040 elev 640 3825.692 172.419 3369 4382 gslope 640 26.562 4.643 13.939 43.036 area 640 1.237 1.602 0.086 17.438 darea 640 49.803 18.823 3.288 95.607 cosa 640 0.629 0.468 −1.000 1.000 sina 640 0.016 0.621 −1.000 1.000

Pearson's pairwise correlation was used to examine if each of the independent variables is significantly positively or negatively correlated with the relative glacier area change and with the others. Considering the multicollinearity between some of the independent variables as well as the possible non-linear relationship between our independent variable and the dependent variable, we used the random forest regression model to identify the factors that have the greatest impact on glacier change. Random forest is similar to regression trees in that a hierarchical classification is used to recursively split data into smaller subsets based on dependent variables. The original data set will be partitioned based on the relationships between the dependent and independent variables, forming different splitting conditions (nodes), and the process of partitioning will not stop until a perfect tree is created, or until pre-defined conditions are met for terminating the expansion of the tree. The random forest does not make an assumption regarding the distribution of the data, and as the bootstrapping procedures are used, the predictability of the random forest model is improved meanwhile the possibility of over-fitting is minimized [90]. The random forest model used the difference in mean square error (MSE) for a test sample with and without a certain variable randomly permuted to determine the importance of each variable. The difference is averaged among all the trees generated and is considered a measure of how the predictive ability is reduced when a certain variable is excluded from the model. Traditional theories concerning how the aforementioned local factors interact with glacier change are often based on scientific abduction, with limited Among nine variables (**Table 2**), the glacier area, the longitude ('x'), the latitude ('y'), the sine aspect ('sina') and cosine aspect ('cosa') are not normally distributed (**Figure 3**). The longitude variable reflects the distribution of glaciers in the study area, and three clusters can be identified. The cosine component of glacier aspect shows a highly skewed distribution, as a lot of glaciers are north-facing in our study area (cosine ~ 1), whereas the sine component shows a rather uniform distribution, suggesting that glaciers in our data set do not show any clustering pattern concerning east-west facings.

Pearson's pairwise correlation shows that the glacier change is highly correlated with glacier area, elevation, longitude, solar radiation, slope and aspect at 0.01 significance level (**Table 3**). For example, the correlation coefficient between the cosine aspect and the solar radiation is as low as −0.75 (*p* < 0.0001), indicating a high collinearity between the variables. For traditional regression models, the highly correlated variables might result in a potential redundancy using the two variables. The receipt of solar radiation already reflects the aspect factor of a glacier: in the northern hemisphere, a north-facing glacier, which has a higher value of cosine aspect, receives less solar energy compared to a south-facing glacier. If both solar radiation and aspect factors are taken into account, the collinearity between the variables might result in biased estimation of the coefficients in traditional regression models.

**Figure 3.** Histograms of the dependent variable ('darea') and independent variables.


**Table 3.** Pearson's correlation matrix.

The random forest regression model calculates an R<sup>2</sup> of 0.541, which suggests that in our study area, 54.1% of the variance in glacier changes can be explained by these local factors. We used 500 trees, the default value for the random forest regression, and the model stabilized when the number of trees exceeds 200 (**Figure 4**). We used a linear regression between the predicted glacier changes calculated from the random forest model and the observed glacier changes to assess the model's predicative ability, and the linear regression had an R<sup>2</sup> of 0.54 (*p* < 0.0001; **Figure 4**). There was no apparent sign of the model over- or under-predicting the glacier change in our study area, and the standard error for our model stayed stationary for different observations.

The relative importance of each independent variable in determining the glacier change is shown in **Figure 4**. The importance of each independent variable in our model is determined by testing how the accuracy of the model will be affected if any individual variable is randomly permuted. The increase in mean squared error (MSE) is calculated for any individual variable when the variable is randomly permuted in the model. The importance of each variable in each tree is calculated, averaged and then normalized using the standard error. Our model suggests that the most important variables affecting glacier changes are elevation and area, which increase the MSE by 60.5 and 52.4%, respectively. Slope, latitude, longitude and region show moderate importance—increasing the MSE by 24.4, 22.6, 22.4 and 19.4%, respectively. Solar radiation is the least important variable controlling glacier change—only increasing the MSE by 19.4%. Elevation and glacier size are two most important factors to control glacier area changes in the random forest model. This result is in good agreement with our previous results built on other statistical models, i.e. multivariate stepwise regression [82] and partial least squares regression [89]. It is commonly recognized that glaciers situated at higher elevations are exposed to colder temperature, thus they tend to recede less; small-sized glaciers tend to be more sensitive to the similar level of climate change compared to larger ones, especially as they expose more proportion of its area under the ELA. Our model ranked the slope as the third most important factor controlling glacier change, followed by latitude, longitude, regional variation, and the solar radiation.

**Figure 4.** (a) Change of error with increased trees, (b) predicted glacier change and (c) variable importance plot in the random forest model.

The random forest model also produces the representation of the hierarchical classifications which result in high or low glacier changes in our study area. The final regression tree created from the random forest predictions is shown in **Figure 5**. Based on the result, the indication of the mechanism driving glacier changes in our study area can be inferred from the split conditions at each node. We only used split conditions that are significant at the 0.05 level in this figure, since the tree can be morphologically complex when every single node is extended. Also, as the number of nodes becomes larger, our capability to interpret each node becomes very limited. **Figure 5** illustrates the split conditions (nodes) in the random forest model. For example, the first split (node 1) occurs at the elevation of 3795 m; for glaciers with elevation either lower or higher than 3795 m, the next split (node 2 and 17) is based on the size of the glacier.

The random forest regression model calculates an R<sup>2</sup>

longitude, regional variation, and the solar radiation.

different observations.

area −0.45\*\*\*

elev −0.53\*\*\* 0.20\*\*\*

50 Glacier Evolution in a Changing World

\*\*\**p* < 0.0001. \*\**p* < 0.01. \**p* < 0.05.

**Table 3.** Pearson's correlation matrix.

x −0.19\*\*\* 0.15\*\*\* 0.22\*\*\*

y −0.03 0.06 −0.19\*\*\* −0.13\*\*

solar −0.45\*\*\* 0.23\*\*\* 0.71\*\*\* 0.29\*\*\* 0.01

gslope 0.38\*\*\* −0.36\*\*\* −0.05 −0.38\*\*\* −0.04 −0.51\*\*\*

cosa 0.16\*\*\* −0.05 −0.60\*\*\* −0.04 −0.05 −0.75\*\*\* 0.00

sina 0.11\*\* −0.04 −0.22\*\*\* 0.01 0.01 −0.10\* −0.16\*\*\* 0.14\*\*\*

study area, 54.1% of the variance in glacier changes can be explained by these local factors. We used 500 trees, the default value for the random forest regression, and the model stabilized when the number of trees exceeds 200 (**Figure 4**). We used a linear regression between the predicted glacier changes calculated from the random forest model and the observed glacier

**darea area elev x y solar gslope cosa**

(*p* < 0.0001; **Figure 4**). There was no apparent sign of the model over- or under-predicting the glacier change in our study area, and the standard error for our model stayed stationary for

The relative importance of each independent variable in determining the glacier change is shown in **Figure 4**. The importance of each independent variable in our model is determined by testing how the accuracy of the model will be affected if any individual variable is randomly permuted. The increase in mean squared error (MSE) is calculated for any individual variable when the variable is randomly permuted in the model. The importance of each variable in each tree is calculated, averaged and then normalized using the standard error. Our model suggests that the most important variables affecting glacier changes are elevation and area, which increase the MSE by 60.5 and 52.4%, respectively. Slope, latitude, longitude and region show moderate importance—increasing the MSE by 24.4, 22.6, 22.4 and 19.4%, respectively. Solar radiation is the least important variable controlling glacier change—only increasing the MSE by 19.4%. Elevation and glacier size are two most important factors to control glacier area changes in the random forest model. This result is in good agreement with our previous results built on other statistical models, i.e. multivariate stepwise regression [82] and partial least squares regression [89]. It is commonly recognized that glaciers situated at higher elevations are exposed to colder temperature, thus they tend to recede less; small-sized glaciers tend to be more sensitive to the similar level of climate change compared to larger ones, especially as they expose more proportion of its area under the ELA. Our model ranked the slope as the third most important factor controlling glacier change, followed by latitude,

changes to assess the model's predicative ability, and the linear regression had an R<sup>2</sup>

of 0.541, which suggests that in our

of 0.54

Low values of glacier retreat occur when the elevation is greater than 3795 m and the area of the glacier is larger than 4.039 km<sup>2</sup> (node 35, glacier change = 20.284%). Larger glaciers that are situated at high altitudes tend to be less sensitive to the impact of climate change, as the majority of the mass of such glaciers are experiencing accumulation, and it is likely that only limited area of such glaciers is experiencing more ablation than accumulation. For glaciers situated at an elevation higher than 3795 m with an area between 2.356 km<sup>2</sup> and 4.039 km<sup>2</sup> , the glacier change is also relatively small (28.926%). For glaciers situated higher than 3795 m but with a smaller area (≤2.356 km<sup>2</sup> ), a smaller slope (≤24.962°) helps limit glacier retreat. However, if the elevation of the glacier is between 3795 m and 3971 m, the latitude of glacier can make a difference in glacier area variation (41.604% for the south of 43.7° and 29.22% for the north of 43.7°.)

Extreme high values of glacier retreat are observed in node 4 (79.112% retreat). This node represents glaciers in the KL region with an elevation lower than 3795 m and smaller than 0.884 km<sup>2</sup> . It should also be noted that even for glaciers located higher than 3795 m, high glacier retreat (67.816%, node 27) is observed. This node represents the glaciers in KL region with an area smaller than 2.356 km<sup>2</sup> , slope greater than 24.96° and elevation lower than 3884 m. The random forest model is also able to pick up at three split conditions (node 3, 13, 26) where the branches with KL region all show greater glacier changes than the other group, although the glaciers in this region show an overall smaller retreat.

**Figure 5.** The representative tree of the random forest model. All split conditions are significant at 0.05 level. Grey boxes represent terminal nodes, and the second value is the predicted glacier change at the node.
