**4. Critical discussion of multivariate statistical methods**

In fact there are some statistical restrictions, which cannot be solved easily. The simple situation starts with the general linear model. This model usually has a character variable y depending on one or more predictor variables x1, x2, …, xk:

In case of cross-classified two-way analysis of variance (equal subclass numbers):

$$\mathbf{y}\_{i|k} = \mathbf{y}\_i + \mathbf{a}\_i + \mathbf{b}\_i + \mathbf{w}\_{i|} + \mathbf{e}\_{i|k} \text{ (i = 1, \dots, a; j = 1, \dots, b; k = 1, \dots, n)}\tag{1}$$

Application of Multivariate Data Analyses in Waste Management 31

cvd/sd =√PRESS/RSS (3)

parameters (e.g. correlation coefficient). A test of significance for the cross-validated r² was performed by Wakeling and Morris [93]. In this paper critical values of r² occurring just by chance alone are tabulated for one to three dimensional models at a significance level of 5 % based on Monte Carlo simulations. A comparable method was used by Stahle and Wold [94] to develop a polynomial approximation of the test statistic for the two-class problem and the number of objects, the number of variables, the percentage variance explained by the first

Unfortunately the definition of hypothesis regarding the regression coefficients still refers to the new components and provides no results regarding the original variables. There is no statistical possibility to prove whether the extraction method is optimal. Other methods of dimension reduction are already in use (e.g. Boosting, Random forest). Robust alternatives

As long as there are no satisfying testing routines, the results of the presented multivariate methods have to be interpreted very carefully. There is an inherent risk of overinterpretation, especially when using descriptive methods such as PCA or cluster analysis. There is no definition of the error probability of the results. That means whatever interpretation of the picture is done, it could be just pure coincidence and there is no information about the risk. The only possibility to overcome these problems would be to analyse a large number of samples and in case of regression models to validate these

In waste management research and practice often huge data sets for statistical evaluation are required to verify the findings. This request concerns both the natural scientific and the logistic field of waste management. Huge data sets can be generated on the one hand by vast numbers of investigated parameters and samples and on the other hand by modern analytical methods such as spectroscopic, chromatographic methods or thermal analysis.

Multivariate data analysis can help to explore data structures of the investigated samples. Another advantage is that the results can be displayed graphically. Furthermore, validated models can serve as adequate evaluation tools for practical application. Different software

In this study the most important multivariate data analysis methods applied in waste management were described in detail and documented by a literature review. It could be demonstrated that Principal Component Analysis (PCA) and Partial Least Square

component in X and the percentage of missing values.

cvd: cross-validated deviances

RSS: residual sum of squares

for PLS-R are also available [95].

types are offered to develop such evaluation tools.

PRESS: prediction error sum of squares

sd: standard deviation

models.

**5. Summary** 

μ is the general mean, ai are the main effects of factor A, bj are the main effects of factor B, wij are the interactions between Ai and Bj, eijk are the random error terms.

In case of multiple linear regression:

$$\mathbf{y}\_{l} = \begin{bmatrix} \beta\_{0} + \beta\_{l} \ \mathbf{x}\_{l} \end{bmatrix} + \begin{bmatrix} \beta\_{2} \ \mathbf{x}\_{l} + \dots + \beta\_{l} \ \mathbf{x}\_{l} + \mathbf{e}\_{l} \ \mathbf{y} \end{bmatrix} = \begin{bmatrix} 1 \ \dots \ \mathbf{n} \end{bmatrix},\tag{2}$$

yj is the j-th value of y depending on the j-th values x1j, … xkj ;

ej are error terms with E(ej) = 0, var(ej) = σ² (for all j), cov(ej', ej) = 0 for j'≠j

The simple case assumes a linear dependency. The statistical parameters (the model coefficients) of the model can be estimated, y can be estimated for given values x1, … xk. Assuming that the ej are normally distributed, confidence intervals can be calculated for each model coefficient and finally tests of hypotheses about the model coefficients can be performed. By this procedure each variable can be tested whether its influence on the variable y is significantly different from 0 or not. The type I and type II error can be stated. Furthermore optimal designs for the experiments and surveys can be calculated [89]. Several assumptions are typically made regarding the distribution of the populations and regarding homoscedasticity. Furthermore the problem of extreme values and outliers respectively is critical, especially in environmental measurements. Increasing the number of regressors and factors respectively also increases the error terms.

For some univariate models robust and powerful alternatives regarding the distribution assumptions and regarding homoscedasticity [90-92] already exist. In the case of cross classification there is still no satisfying, powerful alternative. Many multiple regressors methods (multiple regression models, logistic regression models, discriminant analysis, cross classification models) need independent variables.

In chemometrics some of these problems are highly relevant. Usually the number of regressor variables exceeds the number of samples, which excludes most of the common oligovariate models. Many of the regressor variables are highly collinear. Due to these reasons dimension reduction methods are used such as correspondence analysis or factor analysis. The new factors in the latter are strictly independent from one another and can therefore be used in conventional models. There are several possibilities to extract these factors, like Principal Components or Maximum Likelihood. A possibility to model discrete variables is the classification by means of cluster analysis. These clusters can be tested later by contingency tables. Both steps (factor analysis and cluster analysis) lead to descriptive variables of the data set. Just as all descriptive methods in statistics they do not serve as tests against hypothesis of pure chance. There is no risk assessment of the results. Testing of the new descriptive variables implies the understanding of these new variables. By loading the original variables onto the new variables sometimes the interpretation can be done easily. Then models with these variables can be established (PCR or PLS-R) with several quality parameters (e.g. correlation coefficient). A test of significance for the cross-validated r² was performed by Wakeling and Morris [93]. In this paper critical values of r² occurring just by chance alone are tabulated for one to three dimensional models at a significance level of 5 % based on Monte Carlo simulations. A comparable method was used by Stahle and Wold [94] to develop a polynomial approximation of the test statistic for the two-class problem and the number of objects, the number of variables, the percentage variance explained by the first component in X and the percentage of missing values.

$$\text{cvd/sd} \newline \text{=} \newline \text{PRES/RSS} \newline \tag{3}$$

cvd: cross-validated deviances sd: standard deviation PRESS: prediction error sum of squares RSS: residual sum of squares

Unfortunately the definition of hypothesis regarding the regression coefficients still refers to the new components and provides no results regarding the original variables. There is no statistical possibility to prove whether the extraction method is optimal. Other methods of dimension reduction are already in use (e.g. Boosting, Random forest). Robust alternatives for PLS-R are also available [95].

As long as there are no satisfying testing routines, the results of the presented multivariate methods have to be interpreted very carefully. There is an inherent risk of overinterpretation, especially when using descriptive methods such as PCA or cluster analysis. There is no definition of the error probability of the results. That means whatever interpretation of the picture is done, it could be just pure coincidence and there is no information about the risk. The only possibility to overcome these problems would be to analyse a large number of samples and in case of regression models to validate these models.

### **5. Summary**

30 Multivariate Analysis in Management, Engineering and the Sciences

In case of multiple linear regression:

In case of cross-classified two-way analysis of variance (equal subclass numbers):

are the interactions between Ai and Bj, eijk are the random error terms.

ej are error terms with E(ej) = 0, var(ej) = σ² (for all j), cov(ej', ej) = 0 for j'≠j

yj is the j-th value of y depending on the j-th values x1j, … xkj ;

factors respectively also increases the error terms.

cross classification models) need independent variables.

yijk = μ + ai + bj+ wij + eijk, (i = 1, , a; j = 1, …, b; k = 1, …, n) (1)

yj = β0 + β1 x1j + β2 x2j + … + βk xkj + ej, (j = 1, … , n), (2)

The simple case assumes a linear dependency. The statistical parameters (the model coefficients) of the model can be estimated, y can be estimated for given values x1, … xk. Assuming that the ej are normally distributed, confidence intervals can be calculated for each model coefficient and finally tests of hypotheses about the model coefficients can be performed. By this procedure each variable can be tested whether its influence on the variable y is significantly different from 0 or not. The type I and type II error can be stated. Furthermore optimal designs for the experiments and surveys can be calculated [89]. Several assumptions are typically made regarding the distribution of the populations and regarding homoscedasticity. Furthermore the problem of extreme values and outliers respectively is critical, especially in environmental measurements. Increasing the number of regressors and

For some univariate models robust and powerful alternatives regarding the distribution assumptions and regarding homoscedasticity [90-92] already exist. In the case of cross classification there is still no satisfying, powerful alternative. Many multiple regressors methods (multiple regression models, logistic regression models, discriminant analysis,

In chemometrics some of these problems are highly relevant. Usually the number of regressor variables exceeds the number of samples, which excludes most of the common oligovariate models. Many of the regressor variables are highly collinear. Due to these reasons dimension reduction methods are used such as correspondence analysis or factor analysis. The new factors in the latter are strictly independent from one another and can therefore be used in conventional models. There are several possibilities to extract these factors, like Principal Components or Maximum Likelihood. A possibility to model discrete variables is the classification by means of cluster analysis. These clusters can be tested later by contingency tables. Both steps (factor analysis and cluster analysis) lead to descriptive variables of the data set. Just as all descriptive methods in statistics they do not serve as tests against hypothesis of pure chance. There is no risk assessment of the results. Testing of the new descriptive variables implies the understanding of these new variables. By loading the original variables onto the new variables sometimes the interpretation can be done easily. Then models with these variables can be established (PCR or PLS-R) with several quality

μ is the general mean, ai are the main effects of factor A, bj are the main effects of factor B, wij

In waste management research and practice often huge data sets for statistical evaluation are required to verify the findings. This request concerns both the natural scientific and the logistic field of waste management. Huge data sets can be generated on the one hand by vast numbers of investigated parameters and samples and on the other hand by modern analytical methods such as spectroscopic, chromatographic methods or thermal analysis.

Multivariate data analysis can help to explore data structures of the investigated samples. Another advantage is that the results can be displayed graphically. Furthermore, validated models can serve as adequate evaluation tools for practical application. Different software types are offered to develop such evaluation tools.

In this study the most important multivariate data analysis methods applied in waste management were described in detail and documented by a literature review. It could be demonstrated that Principal Component Analysis (PCA) and Partial Least Square

Regression (PLS-R) are the most applied methods in waste management. PCA was used to find hidden data structures, groupings and interrelationships of data. In most cases PLS-R was applied to predict parameters using new analytical instruments that allow faster and cheaper analyses.

Application of Multivariate Data Analyses in Waste Management 33

[8] Malley DF, McClure C, Martin PD, Buckley K, McCaughey WP (2005) Compositional analysis of cattle manure during composting using a field-portable near-infrared

[11] Romain AC, Godefroid D, Kuske M, Nicolas J (2005) Monitoring the exhaust air of a compost pile as a process variable with an e-nose. Sensors and Actuators, B: Chemical

[12] Smidt E, Meissl K, Schwanninger M, Lechner P (2008) Classification of waste materials using Fourier transform infrared spectroscopy and soft independent modeling of class

[13] Termorshuizen AJ, van Rijn E, van der Gaag DJ, Alabouvette C, Chen Y, Lagerlöf J, et al. (2006) Suppressiveness of 18 composts against 7 pathosystems: Variability in

[14] Böhm K, Smidt E, Tintner J (2011) Modelled on Nature - Biological Processes in Waste Management. In: Kumar S, editor. Integrated Waste Management. Rijeka, Croatia:

[15] Smidt E, Tintner J, Böhm K, Binner E (2011) Transformation of biogenic waste materials through anaerobic digestion and subsequent composting of the residues – A case study.

[16] Smidt E, Meissl K, Tintner J, Binner E (2008) Influence of Input Materials and Composting Operation on Humification of Organic Matter. In: Hao X, editor. Dyn. Soil

[17] Smidt E, Böhm K, Schwanninger M (2011) The Application of FT-IR Spectroscopy in Waste Management. In: G.S. N, editor. Fourier Transforms - New Analytical

[18] Bianchi G, Celeste G, Palmiotto M, Davoli E (2010) Source identification of odours and VOCs from a composting plant by multivariate analysis of trace volatile organic

[19] Biasioli F, Aprea E, Gasperi F, Mark TD (2009) Measuring odour emission and biofilter efficiency in composting plants by proton transfer reaction-mass spectrometry. Water

[20] Tognetti C, Mazzarino MJ, Laos F (2011) Comprehensive quality assessment of municipal organic waste composts produced by different preparation methods. Waste

[21] Vergnoux A, Guiliano M, Le Dréau Y, Kister J, Dupuy N, Doumenq P (2009) Monitoring of the evolution of an industrial compost and prediction of some compost properties by

Approaches and FTIR Strategies. Rijeka, Croatia: InTech. p. 405-430.

compounds. Chemical Engineering Transactions 23: 279-284.

NIR spectroscopy. Sci. Total Environ. 407: 2390-2403.

spectrometer. Communications in Soil Science and Plant Analysis 36: 455-475. [9] Nicolas J, Romain AC, Wiertz V, Maternova J, Andre P (2000) Using the classification model of an electronic nose to assign unknown malodours to environmental sources and to monitor them continuously. Sensors and Actuators, B: Chemical 69: 366-371. [10] Planquart P, Bonin G, Prone A, Massiani C (1999) Distribution, movement and plant availability of trace metals in soils amended with sewage sludge composts: Application

to low metal loadings. Sci. Total Environ. 241: 161-179.

pathogen response. Soil Biol. Biochem. 38: 2461-2477.

analogy. Waste Manage. 28: 1699-1710.

106: 29-35.

InTech. p. 153-178.

Dyn. Soil Dyn. Plant 5: 63-69.

Sci. Technol. 59: 1263-1269.

Manage. 31: 1146-1152.

Dyn. Plant. Special Issue 1: p. 50-59.

In general it can be stated that multivariate data analysis was successfully applied in all experiments. Several authors compared different multivariate methods to determine which one provided the best results. Depending on the data set and the question to be answered the appropriate method must be identified.
