**2. Methods for data mining**

To investigate trace element concentration in time series and spatial distribution, migration source, and reaction pathway, technology of data mining is used. In narrow sense, the data mining refers to using multivariate analysis and machine learning method to find distributing or changing pattern in big data sets. In a broader concept, the data mining may include more techniques, such as geochemical, isotopic, univariate analysis, etc. In this chapter, techniques of multivariate analysis and machine learning are emphasized for its increasingly application and effective in source apportionment and reaction path analysis.

**Environmental**

**5**

Water (surface) Water (surface)

Water (surface)

Water (surface)

Water (surface ground) Water (surface ground)

Water (ground)

Water (ground)

Water (ground) Water (ground) Water (ground) Water (ground)

Water (ground)

Water (ground)

Water/soil Sediment (river) Sediment (river) Sediment (river)

Sediment (lake)

Sediment (lake)

Soil Soil Soil Soil (peat)

 **media**

**Country/region**

Ethiopia

Turkey Belgium

USA Greece

Spain

China

China

China

India India Greece

USA China

Nigeria

China China China China China China China China Spain

N Cr, Br, Cl, N, S

—

Cd, Cr, Pb, Ni, V

Ni, Hg, Cr, Cu, Cd, Pb, Zn, As

Cd, Zn

Cr, Cd, Pb, Hg

As, Cd, Hg, Pb, Zn

Cr, Pb, Zn, Cu, Co

Hg, Cr, Ni, Ba

Cu, Zn, Cd, Hg

Not found Cd, Pb, P, Zn

—

—

—

Se, As, Hg, Cr, Pb

Pb, Cu, Cr

As, Cd, Co, Pb, V

N

Mg, Cl, Na

EC, Cl, SO4, Mg, Ca, Na, K, As, Fe, B, Br, Sr, V (sea water intrusion)

Cu, K

Not found

**Anthropogenic**

 **source TEs**

**Data mining method**

PCA/FA/CA/DA

PCA/HCA Bayesian network

CA/DA

 PCA/DA PCA/regression

PCA/DA

DA PCA PCA/CA PCA/CA Bayesian network

Semi-supervised

Decision tree/CA

PCA/CA PCA/DA/Monte

PCA PCA, EF

PCA/EF

PCA PCA/CA

PCA PCA/CA PCA/CA

 Carlo

 [39] [40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

 ML

[37]

[38]

[12]

 **References**

[24]

[25]

[26]

[27]

[28–30]

*DOI: http://dx.doi.org/10.5772/intechopen.88818*

[31]

[32]

[33]

[34]

[35]

*Data Mining for Source Apportionment of Trace Elements in Water and Solid Matrix*

[36]

[29]

**Table 1** lists application of data mining methods and implementation on the trace element migration. In which, PCA stands for principal component analysis,


### *Data Mining for Source Apportionment of Trace Elements in Water and Solid Matrix DOI: http://dx.doi.org/10.5772/intechopen.88818*

address issue relating to the behavior and mechanism of them among environmen-

The TEs are widely studies in the areas of water, rock, coal geochemistry, leaching and mobility potential, bioaccumulation and human health risk, survey technologies, and other related topics [1, 2]. The harm to human health of TEs are amount related. Some TEs are essential to human in a concentration scale, while become toxic along with the concentration elevation. Some toxic TEs may cause acute and chronic effect even in very low content. In light of the levels of toxicity, trace elements lead (Pb), zinc (Zn), copper (Cu), nickel (Ni), chromium (Cr), cadmium (Cd), arsenic (As), selenium (Se), mercury (Hg), are most investigated, studied and regulated. For example, small amounts of lead in the body can make it difficult for children to learn, pay attention and succeed in school. Lead accounts for most of the cases of pediatric heavy metal poisoning. Arsenic is the most common cause of acute heavy metal poisoning in adults and does not leave the body once it

*Trace Metals in the Environment - New Approaches and Recent Advances*

enters. Mercury exposure put newborns at risk of neurological deficits and

ment, soil [3, 4, 15–23], air particles [15] can be contaminated.

The TEs may be released from sources of lithogenic or anthropogenic [3, 4]. With the industrialization and urbanization process, TEs released from anthropogenic source are increasing, including discharge of industrial and municipal wastes, storms, run-offs, dry deposition, mine discharge, waste incineration, application of pesticides and fertilizers, sewage irrigation and transportation, and other diffused sources [1, 5–11]. The environmental medias, including water [10, 12–14], sedi-

In order to understand and control pollution of the trace element, source identification and quantification of TEs in water, sediment, soil, and particles are of great importance. The traditional techniques are mostly based on geochemical method. Statistical method based on univariate analysis are also used. However, the univariate analysis is cumbersome, and sometimes hard to explain. The multivariate analysis provides a new technique system for the TE source apportionment. Multivariate analysis, and related method, machine learning, data mining have been approved to be successful in a very wide aspects of human living and production. In the area of geochemistry, environmental engineering, applications of the method

In this chapter, two related topics are reviewed and discussed. First, the advances of multivariate analysis on the issue of source apportionment, especially several kinds of multivariate analytical method; second, understanding of the contaminating origin of TEs on important environmental media, ground and surface water, sediment in river and lake, soil, precipitate dust, suspended particle matters,

To investigate trace element concentration in time series and spatial distribution, migration source, and reaction pathway, technology of data mining is used. In narrow sense, the data mining refers to using multivariate analysis and machine learning method to find distributing or changing pattern in big data sets. In a broader concept, the data mining may include more techniques, such as geochemical, isotopic, univariate analysis, etc. In this chapter, techniques of multivariate analysis and machine learning are emphasized for its increasingly application and

**Table 1** lists application of data mining methods and implementation on the trace element migration. In which, PCA stands for principal component analysis,

effective in source apportionment and reaction path analysis.

increased cardiovascular risk in adults.

are also increasingly used.

**2. Methods for data mining**

PM 2.5 and PM 10.

**4**

tal bodies.


**Table 1.**

*A*

 *list of data mining method on trace element migration*

 *in recent years.* CA stands for clustering analysis, ANN stands for artificial neural network, FA stand for factor analysis, MLR stands for multi-linear regression, DA stands for discriminate analysis, EF stands for enrichment factor, PMF stands for positive

*Data Mining for Source Apportionment of Trace Elements in Water and Solid Matrix*

To analyze geochemical properties, and reaction mechanisms, mass balance, piper diagrams, Gibbs diagrams [28] are usually applied. The piper diagram shows the major element composition of water, which category water into different types. Several software, PHREEQC, MINTEQ, geochemists' workbench, can be used to calculate mass balance, saturated index, and model the reaction path, draw piper diagram, etc. By the process of water-rock interaction, major elements and TEs may be released and immigrate to other water bodies, therefore the major elements and trace element with distinguishing feature could be used as source apportionment [69, 70]. However, the TEs undergo geochemical process of adsorption, desorption, mineralization, dissolution to change concentration in water. Therefore, the univariate analysis is not credible and robust. Comparatively, isotopic analysis, both stable and radiogenic [30], may be used as a univariate analysis method or combing some other indexes. The widely used isotopic method are δ18O and δD in water, 87Sr/86Sr, δ34S and δ18O in sulfate [71, 72], δ15N and δ18O in nitrate [29], etc. The isotope δ18O and δD in water are used to identify water relations between precipitation and surface/ground water. The ratio of strontium isotope of water is strictly controlled by water-rock interaction. For a unisource water, the 87Sr/86Sr reflect the mineralogy of the rocks with which the water has been contact and does not change along the water flow. It is highlighted that differences in the strontium isotope ratio and strontium concentration are caused mixing of water of various origins with specific chemical characteristics and isotopic values. Therefore, the strontium isotope is an ideal tracer for element resources, groundwater movements, and waterrock interaction [70, 73–75]. The use of δ34S and δ18O in sulfate is increasing because they have wide range of stable isotope composition and the δ34S value is derived from multiple sources and very close to that of the precursor sulfide mineral. A common anthropogenic source of sulfate is the coal and metal mining which is rich in pyrite and other sulfide minerals. In activity and abandon coal mines, the sulfide mineral may be oxidized and dissolved, with the release of trace elements. It has been proved by the sulfate isotopes that the ground water could be contaminated by the water-rock interaction in coal mines. Besides of the natural isotope, some tracers such as isotopes and stable organic compound are injected into

groundwater to find out the flow pathway [71, 72, 76, 77].

*EF* <sup>¼</sup> *ci=cref*

Eq. (1):

**7**

To evaluate contamination of TEs and find the source of pollution on water and solid matrix, some calculations are used. The enrichment factor (*EF*) is an enrichment level of a certain TEs in environment, with an equation as shown in

where the *ci* is the measured concentrations of TEs in samples, the *cref* is the measured concentration of the reference element, *Bi* and *Bref* are the background level of the local region and reference element in the same region [41]. An *EF* value close to 1 suggests a weathering origin of trace element, while a higher than 1 value means TEs enrichment in soil which is probably caused by human activities. An *EF* value between 2 and 5 indicate a moderate contamination, and a higher than 5 value

*samples<sup>=</sup> Bi=Bref*

*baseline* (1)

matrix fractionation.

**2.1 Geochemical methods**

*DOI: http://dx.doi.org/10.5772/intechopen.88818*

CA stands for clustering analysis, ANN stands for artificial neural network, FA stand for factor analysis, MLR stands for multi-linear regression, DA stands for discriminate analysis, EF stands for enrichment factor, PMF stands for positive matrix fractionation.
