**3.3 Tissue nutrient diagnosis**

Early workers proposed to classify the results of tissue tests, that are continuous variables, using concentration ranges and critical values such as poverty adjustment (deficiency), critical percentage, and nutrient sufficiency, luxury consumption or excess (including antagonism and toxicity) [48–50]. The critical percentage was the tipping point on the response curve, located at 90–95% maximum yield. Nutrients were diagnosed separately rather than as unique combinations of interactive nutrients. Although the reject/accept dichotomania led to considerable interpretation uncertainties [17], the one-nutrient-at-the-time approach is still commonly used today. Holland [51] suggested using methods of multivariate analysis to handle tissue compositions as a whole rather than as separate components, ignoring the numerical pathologies of using inherently interrelated raw concentration values.

Dual ratios were thought to account for nutrient interactions [52]. The Diagnosis and Recommendation Integrated System (DRIS) has been elaborated to handle nutrient ratios [53, 54]. The DRIS required computing the mean and variance of dual ratios but did not fit into any method of multivariate analysis. Much earlier, [14] already developed a concept of optimum combinations of interactive nutrients within a ternary diagram (**Figure 3**). Because plants show various degrees of plasticity in response to growing conditions [55–57], they can adjust nutrient acquisition to nutrient stress [58–61]. This fits perfectly into the realm of Composition Data Analysis.

Because compositional vectors convey relative information, one should first 'think ratios' but, realizing that quotients are more difficult to handle than sums or differences, 'think logratios' [62]. Log ratios are log contrasts between components at numerator and denominator, respectively. While compositional data are constrained to the compositional space (e.g., 100%), log ratios can scan the real space, allowing to conduct statistical analyses and return confidence intervals without constraints. It was not until [12] developed the theory of Compositional data Analysis (CoDa) that ternary diagram could be expanded to more than three nutrients.

The Compositional Nutrient Diagnosis (CND) avoided several computational pathologies in DRIS such using different measurement units for macro- and micronutrients, pairwise rather than multivariate ratios, non-normal distribution, use of a dry matter basis as a separating component, assumed additivity of nutrient functions, non-symmetrical functions between dual ratios and their inverse, and non-symmetrical nutrient ratio and product functions. The CoDa also allowed

#### **Figure 3.**

*Area of optimum balances between N, P and K is plant tissues uncentered (left) or centered (right) within a ternary diagram using the Codapack 2.01 freeware (ellipses with p = 0.10, 0.05, and 0.01, respecrtively).*

*Machine Learning, Compositional and Fractal Models to Diagnose Soil Quality and Plant… DOI: http://dx.doi.org/10.5772/intechopen.98896*

diagnosing multinutrient ratios in the Euclidean space [16] and conducting multivariate analyses in plant ionomics [58].

In CoDa, the simplex is closed to measurement unit using a filling value computed as follows:

$$\mathbf{F\_v = 1000} - \sum\_{i=1}^{D} \mathbf{c\_i} \tag{8}$$

Where *Fv* is the filling value for unit g kg�<sup>1</sup> , *D* is the number of quantified components in the *D*-part composition, and *ci* is concentration of each quantified part. The filling value is required to back-transform log ratio means into original concentration values. The centered log ratio [*clr* ¼ *ln x*ð Þ *<sup>i</sup>=G* ] integrates all pairwise ratios into a single multinutrient expression, as follows for N:

$$\text{clr}\_{\text{N}} = \ln\left(\frac{\text{N}}{\text{G}}\right) = \ln\left(\frac{\text{N}}{\text{N}}, \frac{\text{N}}{\text{P}}, \dots, \frac{\text{N}}{\text{F}\_{\text{v}}}\right)^{\text{fc}} \tag{9}$$

Where *clr* is centered log ratio, *xi* is a component of the compositional simplex, and *G* is geometric mean across components including the filling value, expressed in exactly the same measurement unit. For plant tissue analysis showing 4% N, 0.325% P and 5% K, the filling value is 100% - (4% + 0.25% + 5%) = 90.75%. The *clr* value for N in that 4-part composition is computed as follows:

$$\text{clr}\_{\text{N}} = \ln \left( \frac{4}{\left( 4 \times 0.25 \times 5 \times 90.75 \right)^{0.25}} \right) = -0.143 \tag{10}$$

Euclidean distance *ε* can be computed between two tissue states, one being diagnosed and another being used as benchmark composition, using *clr* or *ilr* as follows:

$$\varepsilon = \sqrt{\sum\_{\mathbf{k}=1}^{\mathrm{D}} \left( \mathrm{clr}\_{\mathbf{k}} - \mathrm{clr}\_{\mathbf{k}}^{\*} \right)^{2}} = \sqrt{\sum\_{\mathbf{k}=1}^{\mathrm{D}-1} \left( \mathrm{ilr}\_{\mathbf{k}} - \mathrm{ilr}\_{\mathbf{k}}^{\*} \right)^{2}} \tag{11}$$

The *ilr* has the advantage over *clr* that Euclidean distances can be computed across the selected Euclidean dimensions (**Figure 4**). Micronutrients can be balanced separately to avoid large variations due to tissue contamination. Moreover, macronutrients with concentrations moving in the same direction with time (N, P, K vs. Ca, Mg) [63, 64] can be set apart to address timlessness (**Figure 5**).

The CND based on *clr* aimed initially to replace DRIS for regional diagnosis [16, 42, 65–80]. Thereafter, a website service was made available to Brazilian growers (https://www.registro.unesp.br/#!/sites/cnd/). The standardized *clr* differences between *clr* values of the diagnosed (*clrj*) and that of the reference subpopulation (*clr* <sup>∗</sup> *<sup>j</sup>* ) of true negative (high-yielding and nutritionally balanced) specimens weighted by the standard deviation (*SD*<sup>∗</sup> *<sup>j</sup>* ) ranked nutrients in the order of their limitation to yield, as follows [80]:

$$\text{Index\\_clr}\_{\text{\\_clr}\_{\text{\\_}}} = \frac{\left(\text{clr}\_{\text{\\_}} - \overline{\text{clr}\_{\text{\\_}}^{\*}}\right)}{\text{SD}\_{\text{\\_}}^{\*}}\tag{12}$$

At that time, the reference subpopulation was selected at regional scale using the Cate-Nelson partitioning procedure by iterating the Mahalanobis distance M to maximize classification accuracy. The M was computed as follows:

#### **Figure 4.**

*Balance dendrogram of tissue nutrient compositions of peach trees in southern Brazil, addressing micro- and macronutrients, then macronutrients moving in different directions with time.*

$$\mathcal{M}\_{ilr} = \sqrt{\sum\_{j=1}^{D-1} \left( ilr\_j - ilr\_j^\* \right) \text{COV}^{-1} \left( ilr\_j - ilr\_j^\* \right)} \text{ or } \tag{13}$$

$$\mathcal{M}\_{\text{clr}} \approx \sqrt{\sum\_{j=1}^{D-1} \left( \text{clr}\_{\text{j}} - \text{clr}\_{\text{j}}^{\*} \right) \text{VAR}^{-1} \left( \text{clr}\_{\text{j}} - \text{clr}\_{\text{j}}^{\*} \right)} \tag{14}$$

The **M<sup>2</sup>** is distributed like a *χ* **<sup>2</sup>** variable. The variance matrix is used where *clr* values are relatively independent from each other [80]. The use of *D clr* variables leads to singularity of the covariance matrix. This required removing one *clr* value, generally that of the filling value. Filzmoser et al. [81] recommended using the *ilr* transformation rather than *clr* or the ordinary log transformation to conduct multivariate analysis due to the advantageous orthonormal basis of *ilr* variables.

The Cate-Nelson procedure returned four quadrants by point counting and thus allowed setting apart the subpopulation of true negative specimens, avoiding to include false positive specimens (high-yielding but nutritionally imbalanced) in the reference subpopulation, as was the case for DRIS and other nutrient diagnostic approaches. Quadrants are interpreted as follows:


*Machine Learning, Compositional and Fractal Models to Diagnose Soil Quality and Plant… DOI: http://dx.doi.org/10.5772/intechopen.98896*

**Figure 5.**

*Time change in N, P, and K concentrations the leaf tissues of peach trees (data [64]). Balances between nutrient concentrations moving in the same direction with time are stationary (upper figure). As expected, the balance between [N, P, K] and [Ca, Mg] changes with time (lower figure).*


Model accuracy is determined as follows:

$$\mathbf{Accuracy} \ (\%) = \mathbf{100} \times \frac{\mathbf{VN} + \mathbf{TP}}{\mathbf{TN} + \mathbf{TP} + \mathbf{FN} + \mathbf{FP}} \tag{15}$$

#### **4. Machine learning methods to process large datasets**

An introduction to machine learning methods is provided in [82]. "When dealing with complexity, mechanistic models become less obvious. System thinking, implying stocks and flows, becomes difficult to tune where species interact through varying functions over space and time … most ecological patterns are nonlinear … Another approach could rely purely on phenomenology with machine learning. Using this approach, we identify key features to predict outcomes using pattern detection".

Machine learning is a family of methods of artificial intelligence that includes object similarity algorithms (k-nearest neighbors), decision trees (e.g., Random Forest), boosted decision trees (e.g., Gradient Boosting), multiple regression, gaussian methods, neural networks and several others, often tunable with hyperparameters. Machine learning methods can integrate numerous growthimpacting factors including soil quality indicators such as those documented by technologies of precision agriculture or supported by classical state- or industrybased agronomic models. Documenting as many growth-limiting factors as possible can decrease the number of assumptions required to diagnose nutrient problems at local scale, facilitating side-by-side comparisons. The confusion matrix generated by machine learning (ML) model in classification mode classified specimens into four quadrants by point counting, and thus allowed setting apart true negative specimens.

Compositional Data Analysis can be combined with machine learning methods to customize plant nutrient requirements for application at local scale where factor interactions shape fertilization decisions [17, 46, 83–86]. After running ML methods, it was suggested to use the *ilr* transformation to compute the Euclidean distance between the diagnosed (*X*) and successful (*x*) compositions, then compute the corresponding perturbation vector to rank nutrients in the order of their limitations to yield [44]. The perturbation vector is computed as follows [87]:

$$p = X \ominus x = \left[ \frac{X\_1}{\varkappa\_1}, \dots, \frac{X\_D}{\varkappa\_D} \right], \text{ hence:}\\p = \left[ \frac{N}{N^\*}, \frac{P}{P^\*}, \dots, \frac{F\_v}{F\_v^\*} \right] \text{ or } \tag{16}$$

$$p = \left[\frac{N}{N^\*} - \mathbf{1}, \frac{P}{P^\*} - \mathbf{1}, \dots, \frac{F\_v}{F\_v^\*} - \mathbf{1}\right].\tag{17}$$

The perturbation vector resembles the Deviation from Opimum Percentage [88]. Several log ratio transformation techniques other than *clr* and *ilr* are available but have not been tested yet [89].

#### **4.1 Information flow**

A flow of information from data acquisition to dataset organization and fertilizer recommendations at subfield level was described for lowbush blueberry (*Vaccinium angustifolium*) in Quebec [46], cranberry (*Vaccinium macrocarpon*) in Quebec and Wisconsin [85], and several crops in Brazil [17, 83, 84]. Nutrient diagnosis at local scale requires a well-documented dataset, an accurate machine learning model, a reliable model prediction algorithm, and a large set of ecologically diversified true negative specimens (**Figure 6**).

The bottleneck of machine learning models is knowledge gain on the learning curve. As anticipated 200 years ago by Alexander von Humboldt [3] a comprehensive understanding of living systems requires collecting facts and local knowledge

*Machine Learning, Compositional and Fractal Models to Diagnose Soil Quality and Plant… DOI: http://dx.doi.org/10.5772/intechopen.98896*

**Figure 6.**

*Flowchart of nutrient diagnosis in agroecosystems from data collection to fertilizer recommendations.*

trustfully. Data can be observational as provided by growers, or experimental as retrieved from the published and the gray literature. Data sharing among stakeholders does not suffice to run machine learning. Data must be collected in a uniform way and cleaned from errors. Missing data could be imputed carefully or documented from other databases such as meteorological databases. Thereafter, data must be checked for their distribution to detect outliers.

A minimum dataset of meaningful features could be selected by adding or removing features (razor of Occam) without losing model accuracy during the model training process. Minimum data sets facilitate data acquisition by stakeholders at minimum cost and effort and make sense to them. The most performing machine learning model is selected. In general, the classification mode (yield class about yield cutoff) is more acurate than the regression mode. The classification mode returns the probability to exceed yield cutoff as targeted by the grower.

#### **4.2 Local diagnosis**

Features such as cultivar, rootstock, soil type or climatic conditions have been averaged to generate regional standards as "Frankenstein-built constructs" that may lead to unaccurate diagnosis at local scale where factors interact [17]. The local diagnosis often differs from regional diagnosis because the heroic assumption that "all controllable and uncontrollable factors but the ones being addressed are at equal or optimum levels" may fail at local scale. Indeed, the regional diagnosis is counterintuitive to growers' heuristics that compares normal to abnormal situations under similar conditions in their neigborhood [86]. Fertilizer recommendations can be customized using the fertilization regime of the closest compositional neighbors as reference, by modifying regional recommendations, from response curves, or using an optimization algorithm (**Figure 7**).

At local scale, the closest compositional neighbors are the true negative specimens showing similar growing conditions and the smallest compositional Euclidean distance from the diagnosed specimen. The nearest neighbors were said to be located in "Humboldtian loci or "enchanting islands", "Ilhas Encantadas" in Portuguese, for a given set of uncontrollable factors. The grower has been pictured by [43] as a compositional parachutist manipulating nutrients as paracords to land on the closest "enchanting islands". There, the resources to tackle controllable factors can be used parsimoniously and efficiently to reach trustful yield targets. Because the number of successful factor combinations is limited by the size and diversity of datasets, a close collaboration is required between stakeholders to collect facts and document local knowledge trustfully [6, 7, 90–94].

*Fertilization recommendation using a Markov chain random walk algorithm to combine optimally N,, P and K dosage to increase yield from 2300 to 5900 kg berry ha<sup>1</sup> for lowbush blueberry considering a set of corrected site-specific controllable factors (reproduced from [46]).*

#### **Figure 8.**

*Dependence on yield cutoff of the number of true negative (high-yielding and nutritionnallly balanced) specimens and classification accuracy.*

The decision to fix a yield target in classificaiton mode depends not only on growers' yield objective, but also on model precision and the number of true negative specimens available as close neighbors. The number of true negative specimens must be high because they provide benchmark compositions and trustfull yield targets under otherwise comparable growing conditions. As shown in **Figure 8** for the Brazilian peach tree dataset [83], classification accuracy increased slightly while the number of true negative specimens decreased exponentially as yield target increased. Smaller number of true negative specimens as benchmark compositions limits model's capacity to select local conditons close to those of the diagnosed specimen. In this case, the decision was to select 16 ton ha<sup>1</sup> as cutoff yield, a reasonable yield objective.

*Machine Learning, Compositional and Fractal Models to Diagnose Soil Quality and Plant… DOI: http://dx.doi.org/10.5772/intechopen.98896*
