**2. Theory of CoDa**

and is thus redundant. Therefore, a ratio approach conveys *D*-1 degrees of freedom or line‐

In contrast, there are *D*×(*D*-1)/2 dual ratios such as the K/Mg ratio and *D*×(*D*-1)²/2 two-com‐ ponent amalgamated ratios such as the K/(Ca+Mg) ratio that can be derived from a *D*-part composition. Most information on dual and two-component amalgamated ratios is thus re‐ dundant and the dataset is artificially inflated. In Figure 1, the number of (a) dual and (b) two-component amalgamated ratios is plotted against the number of components. With 10 components, one may compute up to 45 dual and 405 two-component amalgamated ratios, hence generating a "redundancy bubble" that inflates exponentially above *D*. [25] elaborat‐ ed the Diagnosis and Recommendation Integrated System (DRIS) to synthesize the *D*× (*D*-1)/2 dual ratios into *D* nutrient indices adding up to zero; therefore, there is still one re‐ dundant index closing the system to zero and computable from other indices. Applying Ockham's razor law of parsimony to compositional data, nine degrees of freedom suffice to

**Figure 1.** Number of (a) dual and (b) two-component amalgamated ratios illustrating the redundancy bubble.

To solve problems related to nutrient diagnosis in soil and plant sciences, one must first rec‐ ognize that soil and plant analytical data are most often compositional, i.e. strictly positive data (concentrations, proportions) related to each other and bounded to some whole [26]. Compositional data have special numerical properties that may lead to wrong inferences if not transformed properly. Log-ratio transformations have been developed to avoid numeri‐ cal biases [26, 29, 30, 31]. The balance concept presented in this chapter is based on log ratios or contrasts. Balances are computed rather simply from compositions using the isometric

arly independent balances for a *D*-part composition [24].

84 Soil Fertility

fully describe a 10-part composition without bias [24].

Because a change in any proportion of a whole reverberates on at least one other proportion, proportions of components of a closed sum (100%) are interdependent. Therefore, a compo‐ sitional vector is intrinsically multivariate: its components cannot be analyzed and interpret‐ ed without relating them to each other [32,33]. Compositional data (CoDa) induce numerical biases, such as self-redundancy (one component is computable by difference between the constrained sum of the whole and the sum of other components), non-normal distribution (the Gaussian curve may range below 0 or beyond 100% which is conceptually meaningless) and scale dependency (correlations depend on measurement scale). Redundancy can be con‐ trolled by carefully removing the extra degree of freedom in the *D*-part composition. Scale dependency is controlled by ratioing components after setting the same scale (e.g. fresh mass, dry mass or organic mass basis) or unit of measurement (e.g. mg kg-1, g dm-3 , cmolc kg-1, etc.) across components. Compositional datasets constrained to a closed space between 0 and 100% are amenable to normality tests after projecting them into a real space using logratio transformations.

One of the log ratio transformations is the centered log ratio (*clr*) developed by [26]. The *clr* is a log ratio contrast between the concentration of any nutrient and the geometric mean across the compositional vector. [34] used the *clr* to convert DRIS into Compositional Nu‐ trient Diagnosis (CND-*clr*), hence correcting inherent biases generated by DRIS. [35] and [36] modeled the time change of ion activities in soils and nutrient solutions using *clr*. However, because *clr* generates a singular matrix (the *clr* variates sum up to 0), one *clr* value should be removed (e.g. that of the filling value) in multivariate analysis. In addition, outliers may af‐ fect considerably log ratios [32]. The diagnostic power of CND-*clr* is decreased by large var‐ iations in nutrient levels (e.g. Cu, Zn, Mn contamination by fungicides) that affect the geometric means across concentrations. Nevertheless, the *clr* transformation is useful to con‐ duct exploratory analyses on compositional data [37].

The additive log ratio or *alr* [26] computed as ln(x/xD) is the ratio between any component x and a reference component xD. [17] used nitrogen as reference component (N=100%) to pro‐ duce a stoichiometric N:P:K:Ca:Mg rule for adjusting nutrient needs of tree seedlings. If a tissue contains 2.50% N and 0.15% P, the Redfield N/P ratio [38] is 16.7 and the correspond‐ ing *alr* [P/N] value is ln(0.15/2.50) = -2.81. Other stoichiometric rules have been proposed such the C:N:P:S rule for humus formation [4]. There are *D*-1 *alr* variables in a *D*-part com‐ position because one component is sacrificed as denominator. The *alrs* are oblique to each other and are thus difficult to rectify and interpret [24]. Orthogonal balances are log ratio contrasts between geometric means of two groups of components that are multiplied by or‐ thogonal coefficients to gain orthogonality [27]. Orthonormal balances are called 'isometric log ratios' coordinates or *ilr* [27] and are illustrated by a mobile and its fulcrums (CoDa den‐ dogram) [37]. Balances are encoded in a device called sequential binary partition that order‐ ly allocates components to balance numerator and denominator or +/- sides of a contrast. The *ilr* of groups of components is a thus rectified ratio between their geometric means. Bal‐ ances avoid matrix singularity and redundancy: there are *D*-1 independent balances in a *D*part composition. The orthonormal balance concept was found to be the most appropriate technique in the multivariate [29] and multiple regression [39] analyses in geochemistry [40], plant nutrition [34, 35, 36, 41, 42], the P cycle [43], and soil quality [44, 45].

#### **2.1. From CoDa to sound balances**

The sample space of a compositional vector defined by *SD* is a strictly positive vector of *D* nutrients adding up to some constant *κ*. The closure operation, , computes the constant sum assignment as follows :

$$\mathbf{S}^{\,D} = \mathcal{C}\{\mathbf{c}\_1, \mathbf{c}\_2, \dots, \mathbf{c}\_D\} = \begin{bmatrix} \frac{c\_1 \kappa}{\sum\_{i=1}^{c\_1}}, \frac{c\_2 \kappa}{\sum\_{i=1}^{c\_2} c\_i}, \dots, \frac{c\_D \kappa}{\sum\_{i=1}^{c\_D} c\_i} \end{bmatrix} \tag{1}$$

**Figure 2.** Unstructured or knowledge-based (structured) pathways of compositional data analysis lead to numerically

Nutrient Balance as Paradigm of Soil and Plant Chemometrics

http://dx.doi.org/10.5772/53343

87

Balances can be illustrated by a CoDa dendrogram [37] where components or groups of components are balanced by analogy to a mobile and its fulcrums (Figure 3). Each part has its own weight and the balance between parts or groups of parts are the fulcrums (boxplots) equilibrating the system and computed as *ilr*. It can be shown that a relative increase in Ca concentration will change the [Ca | Mg] balance and [N,P,K | Ca, Mg] balances without af‐ fecting the ([N,P | K] and [N | P]. Transforming compositions to functional balances does not only create orthogonal real variables amenable to linear statistics; it also creates new var‐ iables whose interpretation is also of interest. Thus the interpretation of relationships be‐ tween nutrients depends on how balances are conceived using the best science and management options. For example, another balance setup could be defined as [N,P | K, Ca,

**•** Each fulcrum represents a balance. There are 4 balances for 5 components in Figure 3.

proportion in the simplex. A fulcrum on the right side indicates a positive balance.

**•** If the fulcrum lies in the center of the horizontal bar, the balance is null. If it lies on the left side of the center, the mean balance is negative and left-side components occupy a larger

Nested balances are encoded in an *ad hoc* sequential binary partition (SBP) that nurtures the ties between groups of components. A SBP is a (*D*-1)×*D* matrix, where parts labelled "+1" (group numerator) are balanced with parts labelled "-1" (group denominator) in each or‐ dered row. A part labelled "0" is excluded. The composition is partitioned sequentially at every ordered row into 2 contrasts until (+1) and (-1) subcompositions each contain a single part. The analyst can use exploratory analysis [37] or refer to current theory and expert knowledge to design the balance scheme. The CoDa dendrogram in Figure 3 is formalized

biased and unbiased interpretations, respectively.

Mg], [N | P], [K | Ca, Mg] and [Ca | Mg].

**•** Rectangles located on fulcrums are boxplots.

by the SBP in Table 1.

A CoDa dendrogram (e.g. Figure 3) is interpreted as follows:

**•** The length of vertical bars represent the proportion of total variance

Where ∑ *i*=1 *D ci* closes the sum of components to some whole such as 1, 100%, 1000 g kg-1, which allows computing a filling value to the unit of measurement. In other cases where the data do not add up to the measurement unit such as mg dm-3 or mg L-1, the measurement unit just cancels out when components are ratioed.

In general, raw or log-transformed concentration data are analyzed statistically without any *a priori* arrangement of the data. The analyst not only processes such data through a numeri‐ cally biased procedure, but also relies on a cognitively unstructured path that returns un‐ structured results that are barely interpretable (Figure 2).

Fortunately, recent progress in compositional data analysis provides means to elaborate structured pathways and interpret results coherently [27]. Indeed, the *ilr* technique trans‐ forms a *D*-part composition into *D*-1 pre-defined orthogonal balances of parts projected into a real Euclidean space [24]. Orthogonality is a special case of linear independence where vectors fall perfectly at right angle to each other [46]. The balances can thus be analyzed as additive (undistorted) variables in the Euclidean space, hence without bias. The log ratio of X/Y is also called a log contrast between X and Y because log(X/Y) = log(X) – log(Y). A log ratio can scan the real space (±∞) because ratios may range from large numbers (positive log values) to small fractions (negative log values).

**Figure 2.** Unstructured or knowledge-based (structured) pathways of compositional data analysis lead to numerically biased and unbiased interpretations, respectively.

Balances can be illustrated by a CoDa dendrogram [37] where components or groups of components are balanced by analogy to a mobile and its fulcrums (Figure 3). Each part has its own weight and the balance between parts or groups of parts are the fulcrums (boxplots) equilibrating the system and computed as *ilr*. It can be shown that a relative increase in Ca concentration will change the [Ca | Mg] balance and [N,P,K | Ca, Mg] balances without af‐ fecting the ([N,P | K] and [N | P]. Transforming compositions to functional balances does not only create orthogonal real variables amenable to linear statistics; it also creates new var‐ iables whose interpretation is also of interest. Thus the interpretation of relationships be‐ tween nutrients depends on how balances are conceived using the best science and management options. For example, another balance setup could be defined as [N,P | K, Ca, Mg], [N | P], [K | Ca, Mg] and [Ca | Mg].

A CoDa dendrogram (e.g. Figure 3) is interpreted as follows:


tissue contains 2.50% N and 0.15% P, the Redfield N/P ratio [38] is 16.7 and the correspond‐ ing *alr* [P/N] value is ln(0.15/2.50) = -2.81. Other stoichiometric rules have been proposed such the C:N:P:S rule for humus formation [4]. There are *D*-1 *alr* variables in a *D*-part com‐ position because one component is sacrificed as denominator. The *alrs* are oblique to each other and are thus difficult to rectify and interpret [24]. Orthogonal balances are log ratio contrasts between geometric means of two groups of components that are multiplied by or‐ thogonal coefficients to gain orthogonality [27]. Orthonormal balances are called 'isometric log ratios' coordinates or *ilr* [27] and are illustrated by a mobile and its fulcrums (CoDa den‐ dogram) [37]. Balances are encoded in a device called sequential binary partition that order‐ ly allocates components to balance numerator and denominator or +/- sides of a contrast. The *ilr* of groups of components is a thus rectified ratio between their geometric means. Bal‐ ances avoid matrix singularity and redundancy: there are *D*-1 independent balances in a *D*part composition. The orthonormal balance concept was found to be the most appropriate technique in the multivariate [29] and multiple regression [39] analyses in geochemistry [40],

The sample space of a compositional vector defined by *SD* is a strictly positive vector of *D* nutrients adding up to some constant *κ*. The closure operation, , computes the constant

> ∑ *i*=1 *D ci* , *<sup>c</sup>*2*<sup>κ</sup>* ∑ *i*=1 *D ci*

allows computing a filling value to the unit of measurement. In other cases where the data do not add up to the measurement unit such as mg dm-3 or mg L-1, the measurement unit

In general, raw or log-transformed concentration data are analyzed statistically without any *a priori* arrangement of the data. The analyst not only processes such data through a numeri‐ cally biased procedure, but also relies on a cognitively unstructured path that returns un‐

Fortunately, recent progress in compositional data analysis provides means to elaborate structured pathways and interpret results coherently [27]. Indeed, the *ilr* technique trans‐ forms a *D*-part composition into *D*-1 pre-defined orthogonal balances of parts projected into a real Euclidean space [24]. Orthogonality is a special case of linear independence where vectors fall perfectly at right angle to each other [46]. The balances can thus be analyzed as additive (undistorted) variables in the Euclidean space, hence without bias. The log ratio of X/Y is also called a log contrast between X and Y because log(X/Y) = log(X) – log(Y). A log ratio can scan the real space (±∞) because ratios may range from large numbers (positive log

closes the sum of components to some whole such as 1, 100%, 1000 g kg-1, which

, …, *cD<sup>κ</sup>* ∑ *i*=1 *D ci*

(1)

plant nutrition [34, 35, 36, 41, 42], the P cycle [43], and soil quality [44, 45].

*<sup>S</sup> <sup>D</sup>* <sup>=</sup>(*c*1, *<sup>c</sup>*2, …, *cD*) <sup>=</sup> *<sup>c</sup>*1*<sup>κ</sup>*

**2.1. From CoDa to sound balances**

just cancels out when components are ratioed.

values) to small fractions (negative log values).

structured results that are barely interpretable (Figure 2).

sum assignment as follows :

Where ∑ *i*=1 *D ci*

86 Soil Fertility

**•** The length of vertical bars represent the proportion of total variance

Nested balances are encoded in an *ad hoc* sequential binary partition (SBP) that nurtures the ties between groups of components. A SBP is a (*D*-1)×*D* matrix, where parts labelled "+1" (group numerator) are balanced with parts labelled "-1" (group denominator) in each or‐ dered row. A part labelled "0" is excluded. The composition is partitioned sequentially at every ordered row into 2 contrasts until (+1) and (-1) subcompositions each contain a single part. The analyst can use exploratory analysis [37] or refer to current theory and expert knowledge to design the balance scheme. The CoDa dendrogram in Figure 3 is formalized by the SBP in Table 1.

Where *ilrj*

*c+*; and *g*(*c-*

Where *ilrj*

Where *x* ¯

is the *j*

nested into *g*(*c*+) and *g*(*c*-

elegant, but is also conceptually meaningful.

**2.2. Dissimilarity between compositions**

utable to numerical biases in DRIS results.

(as illustrated in Figure 5) and has *χ* <sup>2</sup>

ℰ2

th isometric log-ratio; *g*(*c+*) is geometric mean of components in group "+1",

), the balances avoid generating redundant ratios. The orthogonal

. Because dual ratios are

http://dx.doi.org/10.5772/53343

89

\* )<sup>2</sup> (3)

Nutrient Balance as Paradigm of Soil and Plant Chemometrics

is the corresponding *ilr* for the refer‐

<sup>≥</sup><sup>2</sup> (4)

is computed as follows:

) is the geometric mean of components in group "-1", *c-*

coefficient is computed as *rs* / (*r* + *s*) [27]. For example, a Redfield N/P ratio of 16.7 is con‐ verted to *ilr* as 1*x*1 / (1 + 1)ln (16.7)=2.02. The *ilr* technique is thus not only mathematically

As a result of orthogonality, the Aitchison distance () between any two compositions is

(*ilr <sup>j</sup>* - *ilr <sup>j</sup>*

ence composition. Selecting alternative SBPs to test and interpret other balances in the sys‐ tem under study just rotates the orthogonal axes of the *ilr* coordinates without affecting . The Aitchison distances computed across *ilr* or *clr* values are identical [24]. [34] rectified DRIS to fit into *clr*. As computed from dual ratios and nutrient indices [13] and using the same reference population as reference for computing the Aitchison distance, the DRIS nu‐ trient imbalance index appeared to be slightly distorted and noisy (Figure 4). Tissue analy‐ ses in Figure 4 were obtained from a survey across guava (*Psidium guajava*) orchards in the state of São Paulo, Brazil. Noise and distortion between results observed in Figure 4 is attrib‐

On the other hand, the Euclidean distance (ℰ) based on log transformations is biased by the

In plant nutrition studies [49], the Mahalanobis distance (ℳ) may be preferred to the Eucli‐ dean distance because the former takes into account the covariance structure of the data [29]

log-transformed data are higher than their counterparts computed across balances, indicat‐ ing systematic upper bias using natural log compared to *ilr* transformations (Figure 6). Tis‐ sue analyses in Figure 6 were obtained from the same guava orchard survey as above.

difference between the geometric means times the number of parts as follows [48]:

(ln (*x*), ln(*y*))=<sup>2</sup> <sup>+</sup> *<sup>D</sup>*(*ln <sup>g</sup>*(*x*)

*\**

*<sup>g</sup>*(*y*) )<sup>2</sup>

<sup>ℳ</sup><sup>2</sup> <sup>=</sup> (*<sup>x</sup>* - *<sup>x</sup>*¯)*<sup>T</sup>* <sup>×</sup>*COV* -1(*<sup>x</sup>* - *<sup>x</sup>*¯) (5)

distribution [50,51]. The *M2*

is the mean and COV is the covariance matrix. Both and ℳ computed across

computed as a Euclidean distance across the selected *ilr* coordinates as follows [47]:

= ∑ *j*=1 *D*-1

is the jth *ilr* of a given composition and *ilrj*

**Figure 3.** Balances between N, P, K, Ca, and Mg (five weight variables) are illustrated by a mobile and its fulcrums (four balance variables) where N, P, and K are contrasted with Ca and Mg, N and P with K, N with P, and Ca with Mg.


**Table 1.** Sequential binary partition defining macronutrient balances.

In Table 1, the sequential binary partition of nutrients encodes the balances between two geometric means across the + components at numerator and the – components at denomina‐ tor. The orthogonal coefficient of a log contrast is computed from the number of + and – components in each binary partition. The balances between two subcompositions are or‐ thogonal log ratio contrasts between geometric means of the "+1" and "-1" groups. The *j th ilr* coordinate is computed as follows [24]:

$$\text{silt}\_{\text{j}} = \sqrt{\frac{rs}{r+s}} \ln \frac{\text{g}\{c\_{\ast}\}}{\text{g}\{c\_{\ast}\}}, \quad \text{with } \text{j} = \begin{bmatrix} 1, & 2, & \dots, & \mathbf{D} \cdot \mathbf{1} \end{bmatrix} \tag{2}$$

Where *ilrj* is the *j* th isometric log-ratio; *g*(*c+*) is geometric mean of components in group "+1", *c+*; and *g*(*c-* ) is the geometric mean of components in group "-1", *c-* . Because dual ratios are nested into *g*(*c*+) and *g*(*c*- ), the balances avoid generating redundant ratios. The orthogonal coefficient is computed as *rs* / (*r* + *s*) [27]. For example, a Redfield N/P ratio of 16.7 is con‐ verted to *ilr* as 1*x*1 / (1 + 1)ln (16.7)=2.02. The *ilr* technique is thus not only mathematically elegant, but is also conceptually meaningful.

#### **2.2. Dissimilarity between compositions**

**Figure 3.** Balances between N, P, K, Ca, and Mg (five weight variables) are illustrated by a mobile and its fulcrums (four balance variables) where N, P, and K are contrasted with Ca and Mg, N and P with K, N with P, and Ca with Mg.

**Binary partiton Balance between groups of components r s** *ilr computation*

3+2 *ln*( (*NxPxK* )1/3 (*CaxMg*)1/2 )

2+1 *ln*( (*NxP*)1/2 *<sup>K</sup>* )

> 1+1 *ln*( *<sup>N</sup> P* )

1+1 *ln*( *Ca Mg* )

) , with j = 1, 2, …, D - 1 (2)

*th ilr*

N P K Ca Mg [N,P,K | Ca,Mg] +1 +1 +1 -1 -1 3 2 <sup>3</sup>*x*<sup>2</sup>

[N,P | K] +1 +1 -1 0 0 2 1 <sup>2</sup>*x*<sup>1</sup>

[N | P] +1 -1 0 0 0 1 1 1*x*1

[Ca | Mg] 0 0 0 +1 -1 1 1 1*x*1

In Table 1, the sequential binary partition of nutrients encodes the balances between two geometric means across the + components at numerator and the – components at denomina‐ tor. The orthogonal coefficient of a log contrast is computed from the number of + and – components in each binary partition. The balances between two subcompositions are or‐ thogonal log ratio contrasts between geometric means of the "+1" and "-1" groups. The *j*

**Table 1.** Sequential binary partition defining macronutrient balances.

coordinate is computed as follows [24]:

88 Soil Fertility

*ilr <sup>j</sup>* <sup>=</sup> *rs*

*<sup>r</sup>* <sup>+</sup> *<sup>s</sup> ln <sup>g</sup>*(*c*+) *g*(*c*- As a result of orthogonality, the Aitchison distance () between any two compositions is computed as a Euclidean distance across the selected *ilr* coordinates as follows [47]:

$$\mathcal{A} = \sqrt{\sum\_{j=1}^{D-1} (ilr\_{\ j} - ilr\_{\ j}^\*)^2} \tag{3}$$

Where *ilrj* is the jth *ilr* of a given composition and *ilrj \** is the corresponding *ilr* for the refer‐ ence composition. Selecting alternative SBPs to test and interpret other balances in the sys‐ tem under study just rotates the orthogonal axes of the *ilr* coordinates without affecting . The Aitchison distances computed across *ilr* or *clr* values are identical [24]. [34] rectified DRIS to fit into *clr*. As computed from dual ratios and nutrient indices [13] and using the same reference population as reference for computing the Aitchison distance, the DRIS nu‐ trient imbalance index appeared to be slightly distorted and noisy (Figure 4). Tissue analy‐ ses in Figure 4 were obtained from a survey across guava (*Psidium guajava*) orchards in the state of São Paulo, Brazil. Noise and distortion between results observed in Figure 4 is attrib‐ utable to numerical biases in DRIS results.

On the other hand, the Euclidean distance (ℰ) based on log transformations is biased by the difference between the geometric means times the number of parts as follows [48]:

$$\mathcal{L}^2(\ln\left(x\right), \ln\left(y\right)) = \mathcal{A}^2 + D\left(\ln\frac{\mathcal{G}\left(x\right)}{\mathcal{G}\left(y\right)}\right) \ge \mathcal{A}^2\tag{4}$$

In plant nutrition studies [49], the Mahalanobis distance (ℳ) may be preferred to the Eucli‐ dean distance because the former takes into account the covariance structure of the data [29] (as illustrated in Figure 5) and has *χ* <sup>2</sup> distribution [50,51]. The *M2* is computed as follows:

$$\mathcal{M}^2 = (\mathbf{x} \cdot \bar{\mathbf{x}})^T \times \mathbf{CO} V^{-1} (\mathbf{x} \cdot \bar{\mathbf{x}}) \tag{5}$$

Where *x* ¯ is the mean and COV is the covariance matrix. Both and ℳ computed across log-transformed data are higher than their counterparts computed across balances, indicat‐ ing systematic upper bias using natural log compared to *ilr* transformations (Figure 6). Tis‐ sue analyses in Figure 6 were obtained from the same guava orchard survey as above.

**2.3. Cate-Nelson analysis**

classes as follows:

as TP/(TP+FP)

as TN/(TN+FN)

as imbalanced (above critical index).

balanced (below critical index).

The performance of the test is measured by four indices:

consecutive groups of observations are iterated as follows:

*Class sum of squares* =

**•** Sensitivity: probability that a low yield is imbalanced as TP/(TP+FN)

**•** Specificity: probability that a high yield is balanced as TN/(TN+FP)

The Cate-Nelson procedure was developed as a graphical technique to partition percentage yield (yield in control divided by maximum yield with added nutrient) versus soil test [52]. The scatter diagram is subdivided into four quadrants to determine a critical test level and a critical percentage yield by maximizing the number of points in the + quadrants. This techni‐ que is analog to binary classification tests widely used in medical sciences [53] where data each quadrant are interpreted as true positive (correctly diagnosed as sick), false positive (incorrectly diagnosed as sick), true negative (correctly diagnosed as healthy) and false neg‐ ative (incorrectly diagnosed as healthy). Applied to soil fertility studies, we can define four

Nutrient Balance as Paradigm of Soil and Plant Chemometrics

http://dx.doi.org/10.5772/53343

91

**•** True positive (TP: nutrient imbalance): imbalanced crop (low yield) correctly diagnosed

**•** False positive (FP: type I error): balanced crop (high yield) incorrectly identified as imbal‐

**•** True negative (TN: nutrient balance): balanced crop (high yield) correctly diagnosed as

**•** False negative (FN: type II error): imbalanced crop (low yield) incorrectly identified as balanced (below critical index). FN points show impacts of other limiting factors.

**•** Positive predictive value (PPV): probability that an imbalance diagnosis returns low yield

**•** Negative predictive value (NPV): probability that a balance diagnosis returns high yield

The performance of the binary classification test is higher when the four indexes get closer to unity. However, the maximization of the four indexes may not be the most appropriate procedure. Indeed, agronomists are more interested in high PPV than in high specificity.

Using the Cate-Nelson graphical procedure, the TN specimens are selected as reference pop‐ ulation after removing outliers. If the number of points is too large, yields are arranged in an ascending order and a two-group partition is computed. The sums of squares between two

> ( ∑ *<sup>j</sup>*=1+*n*<sup>1</sup> *n Y*<sup>2</sup> *<sup>j</sup>*) 2 *<sup>n</sup>*<sup>2</sup> -

( ∑ *j*=1 *n <sup>Y</sup> <sup>j</sup>*)<sup>2</sup> *n*

(6)

( ∑ *j*=1 *n*1 *Y*<sup>1</sup> *<sup>j</sup>* ) 2 *<sup>n</sup>*<sup>1</sup> +

anced (above critical index). FP points indicate luxury consumption of nutrients.

**Figure 4.** Distance from a reference composition computed using DRIS versus the Aitchison distance.

**Figure 5.** The Euclidean distance is circular while the Mahalanobis distance (*M*) is elliptical. The blue ellipse represents a line of equidistant points in terms of *M* that scales data to the variance in each direction. The *M* between green points and the center are equal. However, the Euclidean distance between each green point and the center is differ‐ ent, as shown by the Euclidean equidistance pink circles.

### **2.3. Cate-Nelson analysis**

**Figure 4.** Distance from a reference composition computed using DRIS versus the Aitchison distance.

**Figure 5.** The Euclidean distance is circular while the Mahalanobis distance (*M*) is elliptical. The blue ellipse represents a line of equidistant points in terms of *M* that scales data to the variance in each direction. The *M* between green points and the center are equal. However, the Euclidean distance between each green point and the center is differ‐

ent, as shown by the Euclidean equidistance pink circles.

90 Soil Fertility

The Cate-Nelson procedure was developed as a graphical technique to partition percentage yield (yield in control divided by maximum yield with added nutrient) versus soil test [52]. The scatter diagram is subdivided into four quadrants to determine a critical test level and a critical percentage yield by maximizing the number of points in the + quadrants. This techni‐ que is analog to binary classification tests widely used in medical sciences [53] where data each quadrant are interpreted as true positive (correctly diagnosed as sick), false positive (incorrectly diagnosed as sick), true negative (correctly diagnosed as healthy) and false neg‐ ative (incorrectly diagnosed as healthy). Applied to soil fertility studies, we can define four classes as follows:


The performance of the test is measured by four indices:


The performance of the binary classification test is higher when the four indexes get closer to unity. However, the maximization of the four indexes may not be the most appropriate procedure. Indeed, agronomists are more interested in high PPV than in high specificity.

Using the Cate-Nelson graphical procedure, the TN specimens are selected as reference pop‐ ulation after removing outliers. If the number of points is too large, yields are arranged in an ascending order and a two-group partition is computed. The sums of squares between two consecutive groups of observations are iterated as follows:

$$\text{Class sum of squares} = \frac{\binom{n\_1}{\sum} \, \, Y\_{1,j} \Big|^2}{n\_1} + \frac{\left(\sum\_{j=n\_1}^n \, \, Y\_{2,j}\right)^2}{n\_2} - \frac{\left(\sum\_{j=1}^n \, \, Y\_{1,j}\right)^2}{n} \tag{6}$$

Where *Y*<sup>1</sup> *<sup>j</sup>* is class 1 yields starting with the two lowest soil indices; the remaining yields are in class 2 or *Y*<sup>2</sup> *<sup>j</sup>* ; and *n1*, *n2* and *n* are the numbers of observation in class 1, class 2 and across classes, respectively. The last member of the equation is the correction factor. The starting values for maximization of the sums of squares across or ℳ could be the *ilr* means of the upper 20 top specimens [54]. Due to yield variations between production years, the upper quartile of higher yield standardized by year of production is an additional option. Because the iterative procedure is very sensitive to extreme values, an *a posteriori* visual adjustment may be necessary to maximize the number of points in opposite quadrants.

**2.4. Statistics**

**3. Cationic balances in tropical soils**

**3.1. Sequential binary partition**

tions as modified by liming.

(1) *<sup>K</sup>* <sup>|</sup>*Ca*, *Mg*, *<sup>H</sup>* <sup>+</sup> *Al* <sup>=</sup> <sup>1</sup>*x*<sup>3</sup>

(2) *<sup>K</sup>* <sup>|</sup>*Ca*, *Mg* <sup>=</sup> <sup>1</sup>*x*<sup>2</sup>

(3) *Ca* <sup>|</sup>*Mg* <sup>=</sup> <sup>1</sup>*x*<sup>1</sup>

this soil-plant system.

In this chapter, statistics computed across compositional data were performed in the R stat‐ istical environment [55]. Compositional data analysis was conducted using the R "composi‐ tions" package [56]. Data distribution was tested using the Anderson-Darling normality test [57] in the "nortest" package [58]. Multivariate outliers were removed using ℳ computed in the R "mvoutlier" package [59]. Linear discriminant analysis (LDA) was used as a statistical ordination technique that allows computing linear combinations of variables that best dis‐ criminate groups. Multiple regression analysis was conducted using *ilr* [39] and compared to raw data. After completing the statistical analysis, the balances could be back-trans‐ formed to the familiar concentration units using the *D*-1 *ilr* values and the sum constraint.

The percentage base saturation is the proportion of soil cation exchange capacity (CEC) oc‐

As illustrated in Figure 7, the first contrast, [K | Ca, Mg, H+Al], balances the K against diva‐ lent cations and acidity to enable adjusting the K fertilization to soil basic acid-base condi‐

The second contrast [Ca, Mg | H+Al] is the acid-base contrast for determining lime require‐ ments while the [Ca | Mg] balance reflects the Ca:Mg ratio in soils adjustable by the liming materials. Alternative SBPs could also be elaborated such as [K, Ca, Mg | (H+Al)], [K | Ca, Mg] and [Ca | Mg] balances that reflects the BCSR model of [12]. The selected sequential

For example, if a soil contains 2.9 mmolc K dm-3, 20 mmolc Ca dm-3, 5 mmolc Mg dm-3, and 23

<sup>20</sup>*x*5, *<sup>x</sup>* <sup>3</sup> 23) ) <sup>=</sup> - 1.312;

Note that the K fertilization would depend on soil acidity as well as levels of exchangeable Ca and Mg in the soil. We thus expect the K index and the K balance to be similarly related to fruit yield if the *ceteris paribus* assumption applies to exchangeable Ca, Mg, and acidity in

*S* <sup>4</sup> =(*K*, *Ca*, *Mg*, *H* + *Al*) (7)

Nutrient Balance as Paradigm of Soil and Plant Chemometrics

http://dx.doi.org/10.5772/53343

93

cupied by a given cation. The soil compositional vector is defined as follows [12]:

binary partition for cationic balances is presented in Table 2.

mmolc H+Al dm-3. Cationic balances are computed as follows:

<sup>1</sup> <sup>+</sup> <sup>2</sup> *ln*( 2.9

<sup>5</sup> ) =0.980.

<sup>1</sup> <sup>+</sup> <sup>1</sup> *ln*( <sup>20</sup>

<sup>1</sup> <sup>+</sup> <sup>3</sup> *ln*( 2.9

<sup>20</sup>*x*<sup>5</sup> ) <sup>=</sup> - 1.011; and

**Figure 6.** Relationships between the Euclidean, Aitchison and Mahalanobis distances.

#### **2.4. Statistics**

Where *Y*<sup>1</sup> *<sup>j</sup>* is class 1 yields starting with the two lowest soil indices; the remaining yields are

classes, respectively. The last member of the equation is the correction factor. The starting values for maximization of the sums of squares across or ℳ could be the *ilr* means of the upper 20 top specimens [54]. Due to yield variations between production years, the upper quartile of higher yield standardized by year of production is an additional option. Because the iterative procedure is very sensitive to extreme values, an *a posteriori* visual adjustment

may be necessary to maximize the number of points in opposite quadrants.

**Figure 6.** Relationships between the Euclidean, Aitchison and Mahalanobis distances.

; and *n1*, *n2* and *n* are the numbers of observation in class 1, class 2 and across

in class 2 or *Y*<sup>2</sup> *<sup>j</sup>*

92 Soil Fertility

In this chapter, statistics computed across compositional data were performed in the R stat‐ istical environment [55]. Compositional data analysis was conducted using the R "composi‐ tions" package [56]. Data distribution was tested using the Anderson-Darling normality test [57] in the "nortest" package [58]. Multivariate outliers were removed using ℳ computed in the R "mvoutlier" package [59]. Linear discriminant analysis (LDA) was used as a statistical ordination technique that allows computing linear combinations of variables that best dis‐ criminate groups. Multiple regression analysis was conducted using *ilr* [39] and compared to raw data. After completing the statistical analysis, the balances could be back-trans‐ formed to the familiar concentration units using the *D*-1 *ilr* values and the sum constraint.
