**2.1 Geochemical methods**

To analyze geochemical properties, and reaction mechanisms, mass balance, piper diagrams, Gibbs diagrams [28] are usually applied. The piper diagram shows the major element composition of water, which category water into different types. Several software, PHREEQC, MINTEQ, geochemists' workbench, can be used to calculate mass balance, saturated index, and model the reaction path, draw piper diagram, etc. By the process of water-rock interaction, major elements and TEs may be released and immigrate to other water bodies, therefore the major elements and trace element with distinguishing feature could be used as source apportionment [69, 70]. However, the TEs undergo geochemical process of adsorption, desorption, mineralization, dissolution to change concentration in water. Therefore, the univariate analysis is not credible and robust. Comparatively, isotopic analysis, both stable and radiogenic [30], may be used as a univariate analysis method or combing some other indexes. The widely used isotopic method are δ18O and δD in water, 87Sr/86Sr, δ34S and δ18O in sulfate [71, 72], δ15N and δ18O in nitrate [29], etc. The isotope δ18O and δD in water are used to identify water relations between precipitation and surface/ground water. The ratio of strontium isotope of water is strictly controlled by water-rock interaction. For a unisource water, the 87Sr/86Sr reflect the mineralogy of the rocks with which the water has been contact and does not change along the water flow. It is highlighted that differences in the strontium isotope ratio and strontium concentration are caused mixing of water of various origins with specific chemical characteristics and isotopic values. Therefore, the strontium isotope is an ideal tracer for element resources, groundwater movements, and waterrock interaction [70, 73–75]. The use of δ34S and δ18O in sulfate is increasing because they have wide range of stable isotope composition and the δ34S value is derived from multiple sources and very close to that of the precursor sulfide mineral. A common anthropogenic source of sulfate is the coal and metal mining which is rich in pyrite and other sulfide minerals. In activity and abandon coal mines, the sulfide mineral may be oxidized and dissolved, with the release of trace elements. It has been proved by the sulfate isotopes that the ground water could be contaminated by the water-rock interaction in coal mines. Besides of the natural isotope, some tracers such as isotopes and stable organic compound are injected into groundwater to find out the flow pathway [71, 72, 76, 77].

To evaluate contamination of TEs and find the source of pollution on water and solid matrix, some calculations are used. The enrichment factor (*EF*) is an enrichment level of a certain TEs in environment, with an equation as shown in Eq. (1):

$$EF = \left(c\_i / c\_{ref}\right)\_{samples} / \left(\mathbf{B}\_i / \mathbf{B}\_{ref}\right)\_{baseline} \tag{1}$$

where the *ci* is the measured concentrations of TEs in samples, the *cref* is the measured concentration of the reference element, *Bi* and *Bref* are the background level of the local region and reference element in the same region [41]. An *EF* value close to 1 suggests a weathering origin of trace element, while a higher than 1 value means TEs enrichment in soil which is probably caused by human activities. An *EF* value between 2 and 5 indicate a moderate contamination, and a higher than 5 value

**Environmental**

**6**

Soil (dust)

Soil (dust)

Soil Soil (city topsoil)

Soil

Soil Soil Soil

Soil Soil Soil Soil Particle Particle (PM 2.5) Particle (PM 2.5) Particle (PM 2.5) Particle (PM 2.5) Particle (PM 2.5) Particle (PM 2.5, PM 10) Particle (PM 2.5, PM 10) Particle (PM 2.5, PM 10)

**Table 1.** *A list of data mining method on trace element migration in recent years.*

(agriculture)

(atmospheric

 deposition)

 China Spain Pakistan

Greece

Italy China

Iran

USA

India Canada

China China

China

USA

Costa Rica

Nigeria

USA

—

Cr, Mn, Fe, Cu, Zn, As, Pb, Ba

—

Zn, Pb, Cd, Cu

—

Fe, Ni, V

Cl, K, V, Cr, Ni, Br, Pb, S, Na, S, Zn, As

Zn, Pb, P

Ni, Cr, Zn, Cu, Pb, Cd, Co

Cu, Pb, Zn, As, Cd, P, K

Ni, Cr, Pb, Zn

Cd, Hg, Pb, Zn

—

—

Ni, Cu, Pb, Cd, Cr

 **media**

**Country/region**

China China

India Armenia

Pb, Zn, Cu, Mo

As, Hg, Cu, Cd, Mo, S, Zn, Cr, Ni, Pb, Se

Pb, Tl, As, Sb, Cd, Cr, Ni, Be, V, Co

Zn, Mn, Ni, As, Cu, Pb, Cr, Co

Pb, Cd

Ni, Co

**Anthropogenic**

 **source TEs**

**Data mining method**

PCA PCA/ANN

PCA PCA/CA PCA/CA

FA PCA/FA/ CA

PCA/CA PCA/FA-MLR

PCA/CA Semi-supervised

(Six models)

PCA/CA

PMF Regression/Monte

PCA-MLR

PCA PCA PMF PMF PMF

 Carlo

 [62] [63]

[64]

[65]

[66]

[67]

[68]

 ML

[58]

[59]

[60]

[61]

 **References**

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

*Trace Metals in the Environment - New Approaches and Recent Advances*

[56]

[57]

show a heavily polluted by TEs [49]. The *EF* factor is frequently used combing wit data mining method, or as a verification to trace the contaminating sources.

The geo-cumulative index method (*Igeo*) is defined using Eq. (2):

$$I\_{\rm geo} = \log\_2[c\_i/\mathbf{1.5B\_i}] \tag{2}$$

Step 2: the covariance matrix of the standardized M is calculated;

*Data Mining for Source Apportionment of Trace Elements in Water and Solid Matrix*

calculated, then the principal components can be determined according to

some others use the accumulative contribution of the eigen values, say 80%.

Given the M has 30 observations and 10 variates (*x1*, *x2*, …, *x10*), and three principal components (*y1*, *y2*, *y3*) are selected, eigen vector of *y1*, *y2*, and *y3* are *A1*,

After the calculation of PCA, some variates have higher loadings on specific principal components, while some variates have higher loadings on other PCs. Then it is inferred that the variates have similar pattern in the matrix may have similar pattern in the real world, i.e., the source, migration behavior, and reaction pathway. Theoretically, the PCA is similar with clustering analysis, but the PCA is not constrained to two dimensions, which allow the researchers mine the inner

The Factor analysis (FA) is based on PCA, have similar principle, and aim to

The data on every variate should be normal distributed. Kaiser-Meyer-Olkin

Because of the advantages of PCA and FA in analysis of environmental and geochemical data mining, they are widely implemented [1–4, 11, 12, 17, 19, 23,

On the issue of source apportionment of particle matter and trace elements for the suspended particles and trace elements inside, a method of positive matrix factorization (PMF) is usually used. When we have a matrix *M*, with *f* of observa-

In a source apportionment problem, *W* stands for source contributions, *H* stands

for source profiles, *k* is the number of possibly sources. The least loss function determines the proper *k* value and the matrix *W* and *K*, then the source quantity, contribution ratio can be inferred. The *Wf* <sup>∗</sup> *<sup>k</sup>* need to be normalized by their average

*wfk* <sup>¼</sup> <sup>X</sup>*<sup>n</sup>*

where *wfk* is elements in the matrix *W*. The PMF is usually used in source apportionment for particle, such as PM 10 and PM 2.5 [61, 66–68], but seldom used

*f*¼1

*wfk n*

*M*ð Þ¼ *<sup>f</sup>* <sup>∗</sup> *<sup>n</sup> W*<sup>f</sup> <sup>∗</sup> *<sup>k</sup>* ∗ *Hk* <sup>∗</sup> *<sup>n</sup>* (3)

(4)

relationships in the matrix and understand real world more precise.

tions, and *n* of variates, the *M* can be calculated as Eq. (3):

distribution of data for analysis of PCA/FA.

value across all samples as shown in Eq. (4):

in other environmental medias.

**9**

28–30, 34, 41, 49, 64, 78, 81–83].

obtain similar result with PCA, but the applications are less than PCA.

(KMO) test and Bartlett's sphericity test are usually used to determine the

mathematical and project criterion;

*DOI: http://dx.doi.org/10.5772/intechopen.88818*

*A2*, and *A3*, respectively, then:

*y1 = A1 \* x1 y2 = A2 \* x2*, and *y3 = A3 \* x3*.

be calculated.

Step 3: eigen value and eigen vectors of the covariance matrix are calculated; Step 4: contributing ratio and accumulative contribution of the eigen value was

Step 5: loading of every principal component and score of every observation can

Theoretically, the number of new variates is equal to variates of the original matrix. On the other hand, the new variates contribute different ratio to explain variance of variates, then the principal components are selected based on the explanation ratio. Different criterion was used, some researchers use the eigen value larger than 1, and

where the *ci* is the measured concentration of TEs, and the *Bi* is the background concentration of the particular TEs [41].

The Hakanson potential ecology risk method (RI) was proposed by Hakanson and can be used to evaluate the potential ecological risk posed by TEs in water and solid matrix. This comprehensive method considers four factors: concentration, type of pollutant, toxicity level, and the sensitivity of the water body to metal contamination in water and solid matrix [78, 79].

#### **2.2 Machine learning**

Studies of environmental processes exhibit spatial variation within data sets. The ability to derive predictions of risk from field data is a critical path forward in understanding the data and applying the information to land and resource management. Multivariate analysis, or machine learning methods present advantages of precise, robust, and can look insight the phenomena to find mechanism. On the other hand, the environment data usually composed of matrices. Therefore, the machine learning methods is an ideal tool to deal with environmental and geochemical issues. However, the calculation of machine learning and multivariate analysis are complex, which may prohibit their implementation. Thanks to recent advances in predictive modeling, open source software (R, Python, SPSS, SAS, Minitab, etc.), and computing, the power to do this is within grasp.

Basic principle of ML is to train models for the specific data frame using the obtained data, then apply the models on the target problems. The ML methods can be divided into three types, namely supervised, unsupervised and semi-supervised learning. When the training data has labeled data, it is a supervised ML, while unsupervised ML have no labeled data. The semi-supervised ML add labels to data during model training. Generally, the supervised ML has higher precise and robust than others. However, the geochemical and environmental data are usually unlabeled. For example, when the researchers try to identify source of water, or TEs in water and solid matrices, the results are usually not assured. Therefore, the unsupervised ML, up to present, has more widely implemented than the supervised and semi-supervised ML.

#### *2.2.1 Unsupervised ML*

It is undoubted that the unsupervised ML is mostly used in this area. Common techniques of unsupervised ML include: principal component analysis (PCA), factor analysis (FA), clustering analysis (CA), positive matrix factorization (PMF), etc.

In the scope of machine learning algorithm, PCA is a tool to reduce high dimensional matrix to a lower, usually two to five, dimensional matrix. Dimensional reduction is accomplished by transforming the data to a new set of variables (principal components), which are derived from linear combinations of the original variables and classified in such a way that the first principal components are responsible for most of the variation in all of the original variables [80].

A matrix M, with m observations (row) and n variates (column), is calculated following five steps to form a new matrix with less variates.

Step 1: the raw data in the M is standardized;

*Data Mining for Source Apportionment of Trace Elements in Water and Solid Matrix DOI: http://dx.doi.org/10.5772/intechopen.88818*

Step 2: the covariance matrix of the standardized M is calculated;

Step 3: eigen value and eigen vectors of the covariance matrix are calculated;

Step 4: contributing ratio and accumulative contribution of the eigen value was calculated, then the principal components can be determined according to mathematical and project criterion;

Step 5: loading of every principal component and score of every observation can be calculated.

Theoretically, the number of new variates is equal to variates of the original matrix. On the other hand, the new variates contribute different ratio to explain variance of variates, then the principal components are selected based on the explanation ratio. Different criterion was used, some researchers use the eigen value larger than 1, and some others use the accumulative contribution of the eigen values, say 80%.

Given the M has 30 observations and 10 variates (*x1*, *x2*, …, *x10*), and three principal components (*y1*, *y2*, *y3*) are selected, eigen vector of *y1*, *y2*, and *y3* are *A1*, *A2*, and *A3*, respectively, then:

*y1 = A1 \* x1 y2 = A2 \* x2*, and *y3 = A3 \* x3*.

show a heavily polluted by TEs [49]. The *EF* factor is frequently used combing wit data mining method, or as a verification to trace the contaminating sources. The geo-cumulative index method (*Igeo*) is defined using Eq. (2):

*Trace Metals in the Environment - New Approaches and Recent Advances*

where the *ci* is the measured concentration of TEs, and the *Bi* is the background

The Hakanson potential ecology risk method (RI) was proposed by Hakanson and can be used to evaluate the potential ecological risk posed by TEs in water and solid matrix. This comprehensive method considers four factors: concentration, type of pollutant, toxicity level, and the sensitivity of the water body to metal

Studies of environmental processes exhibit spatial variation within data sets. The

ability to derive predictions of risk from field data is a critical path forward in understanding the data and applying the information to land and resource management. Multivariate analysis, or machine learning methods present advantages of precise, robust, and can look insight the phenomena to find mechanism. On the other hand, the environment data usually composed of matrices. Therefore, the machine learning methods is an ideal tool to deal with environmental and geochemical issues. However, the calculation of machine learning and multivariate analysis are complex, which may prohibit their implementation. Thanks to recent advances in predictive modeling, open source software (R, Python, SPSS, SAS,

Basic principle of ML is to train models for the specific data frame using the obtained data, then apply the models on the target problems. The ML methods can be divided into three types, namely supervised, unsupervised and semi-supervised learning. When the training data has labeled data, it is a supervised ML, while unsupervised ML have no labeled data. The semi-supervised ML add labels to data during model training. Generally, the supervised ML has higher precise and robust

unlabeled. For example, when the researchers try to identify source of water, or TEs in water and solid matrices, the results are usually not assured. Therefore, the unsupervised ML, up to present, has more widely implemented than the supervised

It is undoubted that the unsupervised ML is mostly used in this area. Common techniques of unsupervised ML include: principal component analysis (PCA), factor analysis (FA), clustering analysis (CA), positive matrix factorization (PMF), etc. In the scope of machine learning algorithm, PCA is a tool to reduce high dimen-

A matrix M, with m observations (row) and n variates (column), is calculated

sional matrix to a lower, usually two to five, dimensional matrix. Dimensional reduction is accomplished by transforming the data to a new set of variables (principal components), which are derived from linear combinations of the original variables and classified in such a way that the first principal components are responsible for most of the variation in all of the original variables [80].

following five steps to form a new matrix with less variates. Step 1: the raw data in the M is standardized;

than others. However, the geochemical and environmental data are usually

Minitab, etc.), and computing, the power to do this is within grasp.

concentration of the particular TEs [41].

**2.2 Machine learning**

and semi-supervised ML.

*2.2.1 Unsupervised ML*

**8**

contamination in water and solid matrix [78, 79].

*Igeo* ¼ log <sup>2</sup> *ci=*1*:*5*Bi* ½ � (2)

After the calculation of PCA, some variates have higher loadings on specific principal components, while some variates have higher loadings on other PCs. Then it is inferred that the variates have similar pattern in the matrix may have similar pattern in the real world, i.e., the source, migration behavior, and reaction pathway. Theoretically, the PCA is similar with clustering analysis, but the PCA is not constrained to two dimensions, which allow the researchers mine the inner relationships in the matrix and understand real world more precise.

The Factor analysis (FA) is based on PCA, have similar principle, and aim to obtain similar result with PCA, but the applications are less than PCA.

The data on every variate should be normal distributed. Kaiser-Meyer-Olkin (KMO) test and Bartlett's sphericity test are usually used to determine the distribution of data for analysis of PCA/FA.

Because of the advantages of PCA and FA in analysis of environmental and geochemical data mining, they are widely implemented [1–4, 11, 12, 17, 19, 23, 28–30, 34, 41, 49, 64, 78, 81–83].

On the issue of source apportionment of particle matter and trace elements for the suspended particles and trace elements inside, a method of positive matrix factorization (PMF) is usually used. When we have a matrix *M*, with *f* of observations, and *n* of variates, the *M* can be calculated as Eq. (3):

$$\mathcal{M}\_{(f\ast n)} = \mathcal{W}\_{\mathbf{f}\ast k} \ast H\_{k\ast n} \tag{3}$$

In a source apportionment problem, *W* stands for source contributions, *H* stands for source profiles, *k* is the number of possibly sources. The least loss function determines the proper *k* value and the matrix *W* and *K*, then the source quantity, contribution ratio can be inferred. The *Wf* <sup>∗</sup> *<sup>k</sup>* need to be normalized by their average value across all samples as shown in Eq. (4):

$$
\overline{w}\_{\hat{f}k} = \sum\_{f=1}^{n} \frac{w\_{\hat{f}k}}{n} \tag{4}
$$

where *wfk* is elements in the matrix *W*. The PMF is usually used in source apportionment for particle, such as PM 10 and PM 2.5 [61, 66–68], but seldom used in other environmental medias.

#### *2.2.2 Bayesian network*

A model of Bayesian Network has been implemented to estimate TE source contribution, and evaluate the contaminating levels [26, 29, 84–86]. A R package SIAR (Stable Isotope Analysis in R) can be run to calculate the isotope mixing model base on the Bayesian Network. The mixing model can be elucidated as the equation set Eq. (5):

$$\begin{aligned} X\_{ij} &= \sum\_{k=1}^{k} p\_k \left( S\_{jk} + \varepsilon\_{jk} \right) + \varepsilon\_{ij} \\ S\_{jk} &\sim N\left( \mu\_{jk}, \alpha\_{jk}^2 \right) \\ \varepsilon\_{jk} &\sim N\left( \lambda\_{jk}, \tau\_{jk}^2 \right) \\ \varepsilon\_{jk} &\sim N\left( 0, \sigma\_j^2 \right) \end{aligned} \tag{5}$$

branches. It is also the basis of deep learning, which is used as figure and voice

*Data Mining for Source Apportionment of Trace Elements in Water and Solid Matrix*

data than PCA, decision tree, and some other methods.

geochemical and environmental purpose.

*DOI: http://dx.doi.org/10.5772/intechopen.88818*

*2.2.5 Discrimination analysis*

model [27, 33].

**11**

In the area of environmental and geochemistry, the implementation of ANN is not as popular as unsupervised ML. The most important reason is that it is a kind of supervised data mining method. In the data preparation step, the observations need to be labeled, while the environmental data are usually cannot or difficult to label. However, some researches have used this to predict the contaminating potential, Mclean et al. reviewed the application of ANN on ambient air pollution [87]. The second obstacle for its implementation is that ANN usually need larger amount of

A basic ANN model has an input layer, one or several hidden layer, and an output layer. Variates of one observation are input through the input layer, and the output layer are labeled data, the hidden layer are used to calculate the model from input to output. The relationship of the layers is trained by input the variate data and labeled data. A trained model is used to predict while input data are obtained. Once the labeled data can be obtained, the ANN model is useful to predict, discriminate, divide samples with trained model in the engineering and research of

The discrimination analysis (DA) is a kind of supervised ML, because the data set is labeled in the model training step. The DA has a similar concept with the principal component analysis. While the PCA tries to find principal components (new axis) to stand for the most variates, the DA tries to find axis that stand for the least variates, so that the different variates and observations can be divided. Prior to the application of DA, all of the variables were standardized to ensure that scale differences between the variables are eliminated. Hence, the absolute discriminant weights ranked the variables in terms of their discriminating power, i.e., the variables with large weights are those that contribute most to differentiating the groups. In the model training step, if the origin of samples can be identified, then the DA could be used to identify sample source. For example, water intrusion in coal mines may come from different ground water aquifer, the different aquifers have specific geochemical characteristics and hazard level. As a water hazard control work, discrimination models can be set up by train the labeled data. The labeled data means water collected from different aquifers. Once water intrusion happens, the water characteristics is used to identify source of water by comparing with the

The criterion applied in the discrimination analysis are mainly distance based or Bayes rule base. When the distance rule is used, Manhattan distance of samples to different groups are calculated, a group with less distance to the samples is labeled to the samples. The distance has to be calculated in pair, which constrains efficiency of the model training and implantation. Another popular method is called the Bayes discrimination method. A reasonable way to discriminate the group of a characteristic sample is to compare the conditional probability of the characteristic sample falling in different category. The class with the highest conditional probability is the final category result of this characteristic sample. Theoretically, the Bayes DA has a

The unsupervised ML are easy to carry out, but low in accuracy, robust, reliability and duplicate, while reverse for the supervised ML method. As an improved

higher coefficient and accurate than the distance-based DA.

*2.2.6 Semi-supervised machine learning*

identification.

where *Xij* is the isotope value *j* of the mixture *i*, in which *i* = 1, 2, 3, …, *N* and *j* = 1, 2, 3, …, *J*; *Sjk* is the source value *k* on isotope *j*(*k* = *1, 2, 3,* …, *K*), *cjk* is the isotope fractionation factor for isotope *j* on source *k*. *pk* is the proportion of source *k*, which needs to be estimated by the SIAR model. The *Sjk* and *cjk* are normally distributed with mean *μjk* and standard deviation *ωjk*, mean *λjk* and standard deviation *τjk*, respectively. *εij* is the residual error representing the additional unquantified variation between individual mixtures and is normally distributed with mean 0 and standard deviation *σj*. Algorithm of Monte Carlo is usually to solve the equation of Bayesian network.

#### *2.2.3 Decision tree*

The decision tree is a kind of supervised ML, including a series of machine learning techniques to divide samples into different categories, such as algorithms ID 3, C 4.5, C 5.0, CART, etc. Different algorithms follow the same principle: the observations are divided by breakpoints on a variate. The selection of variates and breakpoints provide basis of decision, and all of the decisions make a tree for users to make a project decision system. Take the algorithm CART for example, the sample space needs to be split by using the variate breakpoints. Different split strategies make different decision efficiency. As an index, the Gini Coefficients are use. The decision tree machine will calculate Gini index for every split, the split method with lower Gini Coefficients is used. Advantages of the decision tree are easy to carry out and easy to explain. Some researchers have introduced the method to trace source of nitrogen and TE contaminations [38].

Some decision tree series methods, including random forest, boosting method are widely used in the data mining [59]. However, the applications in the source apportionment are rare. The obstacle that prohibit the implementation of the decision tree methods may be the acquirement of data labeled.

#### *2.2.4 Artificial neural network*

The artificial neural network (ANN) has been recognized as a powerful supervised ML and applied in a wide scope of engineering and research. The ANN research is one of the most active research in the ML algorithm, which have a lot of

#### *Data Mining for Source Apportionment of Trace Elements in Water and Solid Matrix DOI: http://dx.doi.org/10.5772/intechopen.88818*

branches. It is also the basis of deep learning, which is used as figure and voice identification.

In the area of environmental and geochemistry, the implementation of ANN is not as popular as unsupervised ML. The most important reason is that it is a kind of supervised data mining method. In the data preparation step, the observations need to be labeled, while the environmental data are usually cannot or difficult to label. However, some researches have used this to predict the contaminating potential, Mclean et al. reviewed the application of ANN on ambient air pollution [87]. The second obstacle for its implementation is that ANN usually need larger amount of data than PCA, decision tree, and some other methods.

A basic ANN model has an input layer, one or several hidden layer, and an output layer. Variates of one observation are input through the input layer, and the output layer are labeled data, the hidden layer are used to calculate the model from input to output. The relationship of the layers is trained by input the variate data and labeled data. A trained model is used to predict while input data are obtained. Once the labeled data can be obtained, the ANN model is useful to predict, discriminate, divide samples with trained model in the engineering and research of geochemical and environmental purpose.

#### *2.2.5 Discrimination analysis*

*2.2.2 Bayesian network*

set Eq. (5):

Bayesian network.

*2.2.3 Decision tree*

A model of Bayesian Network has been implemented to estimate TE source contribution, and evaluate the contaminating levels [26, 29, 84–86]. A R package SIAR (Stable Isotope Analysis in R) can be run to calculate the isotope mixing model base on the Bayesian Network. The mixing model can be elucidated as the equation

> *pk Sjk* þ *cjk* � � <sup>þ</sup> *<sup>ε</sup>ij*

j*k* � �

(5)

*jk* � �

*j* � �

The decision tree is a kind of supervised ML, including a series of machine learning techniques to divide samples into different categories, such as algorithms ID 3, C 4.5, C 5.0, CART, etc. Different algorithms follow the same principle: the observations are divided by breakpoints on a variate. The selection of variates and breakpoints provide basis of decision, and all of the decisions make a tree for users to make a project decision system. Take the algorithm CART for example, the sample space needs to be split by using the variate breakpoints. Different split strategies make different decision efficiency. As an index, the Gini Coefficients are use. The decision tree machine will calculate Gini index for every split, the split method with lower Gini Coefficients is used. Advantages of the decision tree are easy to carry out and easy to explain. Some researchers have introduced the method

Some decision tree series methods, including random forest, boosting method are widely used in the data mining [59]. However, the applications in the source apportionment are rare. The obstacle that prohibit the implementation of the deci-

The artificial neural network (ANN) has been recognized as a powerful super-

vised ML and applied in a wide scope of engineering and research. The ANN research is one of the most active research in the ML algorithm, which have a lot of

to trace source of nitrogen and TE contaminations [38].

sion tree methods may be the acquirement of data labeled.

*2.2.4 Artificial neural network*

**10**

where *Xij* is the isotope value *j* of the mixture *i*, in which *i* = 1, 2, 3, …, *N* and *j* = 1, 2, 3, …, *J*; *Sjk* is the source value *k* on isotope *j*(*k* = *1, 2, 3,* …, *K*), *cjk* is the isotope fractionation factor for isotope *j* on source *k*. *pk* is the proportion of source *k*, which needs to be estimated by the SIAR model. The *Sjk* and *cjk* are normally distributed with mean *μjk* and standard deviation *ωjk*, mean *λjk* and standard deviation *τjk*, respectively. *εij* is the residual error representing the additional unquantified variation between individual mixtures and is normally distributed with mean 0 and standard deviation *σj*. Algorithm of Monte Carlo is usually to solve the equation of

*Xij* <sup>¼</sup> <sup>X</sup> *k*

*Trace Metals in the Environment - New Approaches and Recent Advances*

*k*¼1

*Sjk* � *<sup>N</sup> <sup>μ</sup>jk;ω*<sup>2</sup>

*cjk* � *<sup>N</sup> <sup>λ</sup>jk; <sup>τ</sup>*<sup>2</sup>

*<sup>ε</sup>jk* � *<sup>N</sup>* <sup>0</sup>*; <sup>σ</sup>*<sup>2</sup>

The discrimination analysis (DA) is a kind of supervised ML, because the data set is labeled in the model training step. The DA has a similar concept with the principal component analysis. While the PCA tries to find principal components (new axis) to stand for the most variates, the DA tries to find axis that stand for the least variates, so that the different variates and observations can be divided. Prior to the application of DA, all of the variables were standardized to ensure that scale differences between the variables are eliminated. Hence, the absolute discriminant weights ranked the variables in terms of their discriminating power, i.e., the variables with large weights are those that contribute most to differentiating the groups.

In the model training step, if the origin of samples can be identified, then the DA could be used to identify sample source. For example, water intrusion in coal mines may come from different ground water aquifer, the different aquifers have specific geochemical characteristics and hazard level. As a water hazard control work, discrimination models can be set up by train the labeled data. The labeled data means water collected from different aquifers. Once water intrusion happens, the water characteristics is used to identify source of water by comparing with the model [27, 33].

The criterion applied in the discrimination analysis are mainly distance based or Bayes rule base. When the distance rule is used, Manhattan distance of samples to different groups are calculated, a group with less distance to the samples is labeled to the samples. The distance has to be calculated in pair, which constrains efficiency of the model training and implantation. Another popular method is called the Bayes discrimination method. A reasonable way to discriminate the group of a characteristic sample is to compare the conditional probability of the characteristic sample falling in different category. The class with the highest conditional probability is the final category result of this characteristic sample. Theoretically, the Bayes DA has a higher coefficient and accurate than the distance-based DA.

#### *2.2.6 Semi-supervised machine learning*

The unsupervised ML are easy to carry out, but low in accuracy, robust, reliability and duplicate, while reverse for the supervised ML method. As an improved strategy, the semi-supervised ML is an option. When the labeled data is not easy to acquire, and need to do unsupervised ML at first, then the semi-supervised algorithm may apply to add labels for the data while the model is being trained. Vesselinov et al. used the non-negative matrix factorization method for blind source separation in the first step, then a semi-supervised clustering algorithm was used to predict the sources of contaminates [37]. Fatehi and Asadi used a hybrid method combining hieratical clustering and fuzzy c-means clustering to classify soil types [58]. At present, this method used in this topic at present is rare.

case, the probably sources of trace elements are listed in order of importance. The

*Data Mining for Source Apportionment of Trace Elements in Water and Solid Matrix*

The TEs migrate from rock and coal to water through water-rock interaction.

In Turkey, TEs source in a large reservoir was identified. The PCA showed that PC1, PC2, and PC3 includes Co/Cr/Fe, Cu/Pb/Zn, and As/Cd, respectively. Combing with correlation analysis, the three PCs were identified to natural source, bedrock weathering, and bedrock weathering, respectively [25]. Another research revealed by PCA that mineral pollution, nutrient pollution, and organic pollution are major latent factors which influence the water quality of Asi River [88].

Because of the vandalization of pipeline, soil and water may be contaminated. The PCA results showed that the first source was associated with anthropogenic source, such as vehicular emission, which was composited by Cd, Cr, Pb, and Mn. The second source, including Cu and Zn, was related to natural geological origin, and the Ni and V were released from natural source collaborating with the petro-

In Ethiopia, water samples were divided into four categories by clustering analysis: natural cluster, mixed cluster, agriculture cluster and urban cluster. In the

Supervised ML technique, discriminant analysis, was applied with the clustering analysis to assort and find spatiotemporal distribution of trace element in surface water, in the USA. Sources of salt ions (magnesium, chloride, and sodium) vary from natural sources (oceans, atmospheric deposition, weathering of common rocks, minerals and soils, and salt deposits and brines) to anthropogenic sources (landfills, wastewater and water treatment, agriculture, and application of deicing

A Bayesian isotope mixing model was used to estimate proportional contributions of multiple nitrate sources in surface water in Belgium. The result showed that

In southern India, potential TE source of ground water was analyzed, it was concluded that Fe and Mn were natural origin, Cr, Cu, Pb and Ni may come from mixed sources, natural and flow contaminated with fertilizers and pesticide. In another study of northern India, the sources of ground water were identified to be anthropogenic source via agrochemical and industrial wastes (As, Cd, Co, Pb and V), parent material from an adjacent area (U and Sr), lithogenic origin (Fe, Mn,

In Greece, Matiatos et al. [28–30] investigated surface water and ground water combing the method of geochemical, isotopic and multivariate statistical analysis, such as PCA and Bayesian isotope mixing model. By the PCA analytical result, EC,

Zn), and background level elements (Mo and Se), respectively [36].

ness, and Mn, which is cultivated originated, VF2 were associate with turbidity, Chl-α, and Cu, which may come from farming and excavation sites of quarrying activities. Mg, and K were mainly loading on VF3, and VF4, respectively. K is

, salinity, Fe, NH3, hard-

fertilizer" and "NH4

in precipitation" contributed

+

main method to be used are listed in **Table 1**.

*DOI: http://dx.doi.org/10.5772/intechopen.88818*

*3.1.1 Contamination sources of surface water*

Then the surface and ground water may be contaminated.

agriculture cluster, VF1 has strong loadings on TN, NO3

"manure and sewage" contributed highest, "soil N", "NO3

fertilizer and rain" contributed middle, and "NO3

*3.1.2 Contamination sources of ground water*

mainly spread while potash fertilizer is used [24].

**3.1 TE apportionment in water**

leum contamination [12].

salts) [27].

least [26].

**13**

## **2.3 Regression**

The regression is easy to use, explain, and understand. Also, the regression is a big too box in the machine learning workshop. The most popular method is the multivariate linear regression, sometime logistical regression, lasso regression, ridge regression, plastic net regression can also be used. The regression method can be combined into other machine learning techniques, such as decision tree [56], support vector machine, etc. However, the regression has very distinction shortages. In a regression process, the model of data is fitted to a linear or curve function, which may not accord with the real situation. Second, the regression is prone to overfitted, while the training mode performed well, disaster results may be gotten when applied in real environment. To solve this problem, lasso, ridge, and plastic net regression are applied. Besides the two issues, another problem may bother the application of the regression, the data are usually not easy, or cannot to label. In this situation, unsupervised techniques should be used. Once the labeled data are acquired, regression method are applied [31, 56, 59].

### **2.4 Artificial tracers**

In order to find ground-surface, ground-ground water relationship, artificial tracers are also used. The chemical traces sodium chloride, eosine, uranine and pyranine were used to analyze spring-ground water relationship. Conductivity meter and thermometer was yet installed for electrical conductivity (EC) monitoring and field fluorimeter was equipped for tracer detection [31].

#### **2.5 Other methods**

In a research from Alaska America, six models were set up to predict soil contamination. The model includes random forests, generalized boosted regression, elastic net regression, multivariate adaptive regression splines, generalized linear model with stepwise selection using Akaike's information regression, and partial least squares regression. Although got similar explanatory power overall among the models, the machine learning models performed much better than the linear models on predictive accuracy and were better able to identify variables of interest and describe non-linear relationships. In order to understanding the mechanisms behind trace element pollutant fate and transport and were less vulnerable to errors of omission, the machine learning techniques have priorities than the linear models [59].

## **3. Implementation of the data mining of TE source apportionment**

The environmental medias that may be contaminated by trace elements are grouped into four types, water, sediment, soil, and particles in this chapter. In every case, the probably sources of trace elements are listed in order of importance. The main method to be used are listed in **Table 1**.
