**2.1 Descriptive statistical analysis**

This entails describing data in statistical summaries meaningfully without making conclusions beyond the data. Measures of central tendency and measures of spread are widely adopted in describing data. The former involves the determination of the mean, median, mode, skewness and kurtosis while the later includes variance and standard deviation. These parameters can be used to describe climate and crop yield data as a preliminary approach alongside data quality control.

Data quality control involves approaches that are used to detect defections and inconsistencies in data sets. Various methods are used to determine the quality of climate data including linear regression approaches. In one such method, a single mass curve technique is applied where cumulative values of climate variable are plotted against a linear scale. The tendency of the resulting curve to shift towards linearity is identified with better quality data. This method is also called the data homogeneity test. Data that fails this test or data with more than 10 percent of missing values is judged to be of poor quality and not fit for inferential statistics.

#### **2.2 Crop yield modeling and prediction**

Statistical models have been applied in predicting crop yield and their ability to accurately predict yield responses to changes in mean temperature and precipitation has been determined by process-based crop models. Prototype models include Crop Environment Resource Synthesis (CERES) that can be applied to a crop to simulate corresponding yield and can be used for projecting future yield responses, with their usefulness higher at broader spatial scales [16]. *Mechanistic* models are also used alongside statistical models to predict crop yields [17]. Crop Yield Simulation and Land Assessment Models (CYSLAM) are applied to model the interaction of environmental variables, physiological responses, inputs, yields and land management mechanistic simulations of crop yield [17].

Yields constrained by radiation and temperature within 10 day periods (dekads) are initially estimated in order to account for effective rainfall, evapotranspiration, percolation, and soil moisture. The procedure is followed by a simulation of crop/soil water balance through the cycle of crop growth accounting for periods of moisture stress and consequently, estimation of crop yield [17]. The moisture-dependent yield is adjusted for nutrient supply, toxicities and drainage conditions of the soil [17]. However, validation of modules for moisture limited yield, nutrient yield and radiation and temperature limited yield is carried out separately in comparison with historical crop data.

#### *2.2.1 Multilinear regression yield estimation*

Single mass curve technique is used for data quality control where Cumulative values of data are plotted against a temporal scale. The nature and variability of climate elements is determined including the mean, skewness, standard deviation, students' t-test and correlation analysis. Trend is determined by dividing the data into two sets of equal length, and the difference in the means of the two sets is tested using the t-test [4].

#### *Agrometeorology*

The Relationship between crop Yield and Variations in Climatic Elements is carried out by Correlation Analysis, the degree of relationship between at least two variables. The Pearson's correlation coefficient (r) is used to determine the correlation between the climate elements and the crop yield according to the following expression (1) below:

$$r - \frac{\sum\_{i=1}^{N} (\boldsymbol{x}\_i - \overline{\boldsymbol{x}})(\boldsymbol{y}\_i - \overline{\boldsymbol{y}})}{\left[\sum\_{i=1}^{N} (\boldsymbol{x}\_i - \overline{\boldsymbol{x}})^2 \sum\_{i=1}^{N} (\boldsymbol{y}\_i - \overline{\boldsymbol{y}})^2\right]^{1/2}} \tag{1}$$

where

N Total number of observations =>

*x '* => Mean of the variable x*'*

*y* => Mean of the variable y*' '*

Hypothetical approach is employed to test for statistical significance of the degree of association in (1) above. The null hypothesis that the correlation is zero and the alternative hypothesis that the correlation is nonzero is assumed. In this case, if the null hypothesis is valid, the relevant test variable (t) from Eq. (2) is a realization of student (t) random variable with mean (zero) and (n) degrees of freedom. P values are computed where p < 0.05 prompts the probability of rejecting the null hypothesis and vice versa. The student t-statistic can be used as given by the Eq. (2) below:

$$t - r\sqrt{\frac{n-2}{1-r^2}}\tag{2}$$

**5**

**Thanks**

for much support.

*Prediction of Crop Yields under a Changing Climate DOI: http://dx.doi.org/10.5772/intechopen.94261*

not equal to a linear combination of other regressors.

Nandi East Sub-County in Kenya.

**Acknowledgements**

**3. Conclusion and recommendation**

first one excluding the non-significant variable in the data file. A model that entails

The third step entails plotting the model residuals with keen interest on the normal Q-Q plot to detect the outliers. Where outliers are detected, the model has to be "re-built" without outliers. The error terms which are the differences between the observed value of the dependent variable and the predicted value are called residuals. The final outlier-less model is specified with the following key assumptions namely: Homoscedasticity of residual based on equal variance; Normality of residual; Leverage based on distance of plots to the center and the cook's distance; Positive variance and non-perfect multicollinearity. Homoscedasticity is defined by a scatter plot and assumes equal distribution of the residuals. Normality of residual assumes that the regression follows a normal distribution. Cook's distance provides an idea on influential data points that are worth checking for validity. Non-perfect multicollinearity occurs when one of the regressors is highly correlated with, but

Contingency table can be used for verifying the model. Data is split into two data sets where one set is used in training the model in a statistical analysis tool (e.g. SYSTAT). Model verification statistics including percent correct (PC), Post Agreement (PA), False-Alarm Ratio (FAR), critical success index (CSI), probability

This book chapter described a basic regression approach applied in predicting crop yield in a changing climate. Introductory concepts of descriptive statistics, data quality control, correlation analysis and multilinear modeling are discussed. A typical regression method to estimate yield is examined. The study concludes that crop yield prediction and estimation is married with uncertainties of both natural and anthropogenic nature and requires continuous improvement with more focus on externalities that affect crop yield. This study recommends hybrid models that are both statistical and mechanistic, integrated by neural network technology based

We acknowledge the contribution of the late professor Joseph Mwalichi Ininda of the Department of Meteorology, University of Nairobi for technical support.

Much thanks to the office of the deputy vice chancellor, PPRI; Kibabii University

of detection (POD), bias, and Heidke's skill score (HSS) were determined. The methodology was applied by [4] in their assessment of crop yield over

on multiple variables of climate and crop physiological importance.

the statistically significant climate variables is specified and adopted.

Multiple Linear Regression Analysis gives models that involve more than one independent variable and one dependent variable. This gives an analytical model, which is used to develop a model for predicting crop yield from climatic elements at various time lags.

This relationship is given by the Eq. (3) below:

$$Y = \mathbb{R}\mathfrak{o} + \mathbb{S}\_1 X\_1 + \mathbb{S}\_2 X\_2 + \dots + \mathbb{S}\kappa X \kappa + \varepsilon \tag{3}$$

Where βs are coefficients, Xi are the predictors, Y is the crop yield (predictand) and β0 is a constant.

Climate variables are the independent variable while yield is the dependent variable. The data is imported into statistical analysis software (SYSTAT or R) where regression analysis is carried out in order to get coefficients ß*o* , ß1 and ß*n* . These values are fitted in a multiple regression model of the form (3) above.

The regression model for predicting crop yield is arrived at via a series of enhancing steps where the initial step entails all climate variables specified in the data file. Climate variable with a p-value greater than 0.05 are judged as statistically significant in the model at 95% confidence level. The second step is a repeat of the

#### *Prediction of Crop Yields under a Changing Climate DOI: http://dx.doi.org/10.5772/intechopen.94261*

*Agrometeorology*

expression (1) below:

where

Eq. (2) below:

various time lags.

and β0 is a constant.

The Relationship between crop Yield and Variations in Climatic Elements is carried out by Correlation Analysis, the degree of relationship between at least two variables. The Pearson's correlation coefficient (r) is used to determine the correlation between the climate elements and the crop yield according to the following

( )( )

*x xy y*

1/2 <sup>2</sup> <sup>2</sup>

(1)

( ) ( )

*xx yy*

*i i i N N i i i i*

− −

1

− − <sup>−</sup>

−

∑ ∑

*N*

∑

*r*

N Total number of observations =>

*x '* => Mean of the variable x*'*

*y* => Mean of the variable y*' '*

1 1

Hypothetical approach is employed to test for statistical significance of the degree of association in (1) above. The null hypothesis that the correlation is zero and the alternative hypothesis that the correlation is nonzero is assumed. In this case, if the null hypothesis is valid, the relevant test variable (t) from Eq. (2) is a realization of student (t) random variable with mean (zero) and (n) degrees of freedom. P values are computed where p < 0.05 prompts the probability of rejecting the null hypothesis and vice versa. The student t-statistic can be used as given by the

> 2 2

<sup>−</sup> <sup>−</sup> <sup>−</sup> (2)

(3)

*r*

1 *n*

Multiple Linear Regression Analysis gives models that involve more than one independent variable and one dependent variable. This gives an analytical model, which is used to develop a model for predicting crop yield from climatic elements at

*Yo X X X* = + + + + +ε ßß ß ß 11 22

Climate variables are the independent variable while yield is the dependent variable. The data is imported into statistical analysis software (SYSTAT or R) where regression analysis is carried out in order to get coefficients ß*o* , ß1 and ß*n* . These values are fitted in a multiple regression model of the form (3) above. The regression model for predicting crop yield is arrived at via a series of enhancing steps where the initial step entails all climate variables specified in the data file. Climate variable with a p-value greater than 0.05 are judged as statistically significant in the model at 95% confidence level. The second step is a repeat of the

Where βs are coefficients, Xi are the predictors, Y is the crop yield (predictand)

κ κ

*t r*

This relationship is given by the Eq. (3) below:

− −

**4**

first one excluding the non-significant variable in the data file. A model that entails the statistically significant climate variables is specified and adopted.

The third step entails plotting the model residuals with keen interest on the normal Q-Q plot to detect the outliers. Where outliers are detected, the model has to be "re-built" without outliers. The error terms which are the differences between the observed value of the dependent variable and the predicted value are called residuals. The final outlier-less model is specified with the following key assumptions namely: Homoscedasticity of residual based on equal variance; Normality of residual; Leverage based on distance of plots to the center and the cook's distance; Positive variance and non-perfect multicollinearity. Homoscedasticity is defined by a scatter plot and assumes equal distribution of the residuals. Normality of residual assumes that the regression follows a normal distribution. Cook's distance provides an idea on influential data points that are worth checking for validity. Non-perfect multicollinearity occurs when one of the regressors is highly correlated with, but not equal to a linear combination of other regressors.

Contingency table can be used for verifying the model. Data is split into two data sets where one set is used in training the model in a statistical analysis tool (e.g. SYSTAT). Model verification statistics including percent correct (PC), Post Agreement (PA), False-Alarm Ratio (FAR), critical success index (CSI), probability of detection (POD), bias, and Heidke's skill score (HSS) were determined.

The methodology was applied by [4] in their assessment of crop yield over Nandi East Sub-County in Kenya.

#### **3. Conclusion and recommendation**

This book chapter described a basic regression approach applied in predicting crop yield in a changing climate. Introductory concepts of descriptive statistics, data quality control, correlation analysis and multilinear modeling are discussed. A typical regression method to estimate yield is examined. The study concludes that crop yield prediction and estimation is married with uncertainties of both natural and anthropogenic nature and requires continuous improvement with more focus on externalities that affect crop yield. This study recommends hybrid models that are both statistical and mechanistic, integrated by neural network technology based on multiple variables of climate and crop physiological importance.

### **Acknowledgements**

We acknowledge the contribution of the late professor Joseph Mwalichi Ininda of the Department of Meteorology, University of Nairobi for technical support.

#### **Thanks**

Much thanks to the office of the deputy vice chancellor, PPRI; Kibabii University for much support.

*Agrometeorology*
