**2. Applications of multivariate analysis in geochemistry**

The geochemical issues involve a sample-parameter matrix, which includes a coexistence pattern among the parameters and samples. It is cumbersome and hard to identify the patterns using traditional geochemical technology. Thanks to the technological development of artificial intelligence, and the technique of machine learning, the multivariate parameter problem could be solved or mined to discover knowledge or criteria. In the field of geochemistry, the problems are feasible to be solved by using the multivariate analysis method. The multivariate analysis method can be classified to be supervised, unsupervised, and semi-supervised, depending on whether the target parameters are labeled. The unsupervised algorithms refer to principal component analysis (PCA), factor analysis (FA), clustering analysis (CA), positive matrix fractionation (PMF), etc., while the supervised algorithms refer to linear regression, logistic regression, support vector machine (SVM), decision tree (DT), random forest (RF), artificial neural network (ANN), and discriminant analysis (DA).

While the target parameter can be labeled, a supervised machine learning algorithm should be used in priority as accurate and stable models are expected. In the USA, the research tried to identify the source of salt ions (Mg, CL, and Na). As the samples were collected from known sites or environments, including (oceans, atmospheric deposition, weathering of common rocks, minerals and soils, and salt deposits and brines landfills, wastewater and water treatment, agriculture), the samples can be labeled. Therefore, discriminant analysis and clustering analysis were applied [35]. In Belgium, a Bayesian isotope mixing model was used to estimate proportional contributions of multiple nitrate sources in surface water [36]. In a coal mine, water inrush constantly threatens the production and human health and causes financial losses. The source apportionment technology is used in coal mines to determine the source of water inrush [37]. The water inrushes could be categized into four sources: quaternary sand-gravel pore aquifer, Dyas sandstone aquifer, limestone aquifer from Ordovician and Carboniferous, and abandoned coal mine districts, respectively. Different sources show various features and need suitable treating strategies. To set up the discriminant model, geochemical and data mining analytical protocol should be established. As the samples were collected from identified aquifers, a supervised machine learning method could be used. Huang et al. [37] proposed a technology system, the Piper-PCA-Bayes-LOOCV discrimination model to determine water inrush types in coal mines. The piper diagram is a geochemical technique to show the water characteristics, and abnormal samples/points were screened in this research. PCA was used to lower the dimension of the sample matrix, to make less variates standing for all the original variates. Then, the supervised ML model, Bayes DA, is used to train and implement a model for water source discriminant. LOOCV means leave-one-out cross-validation, to validate and improve the quality of the model. Wang et al. used discriminant analysis to determine water bursting sources in coal mines [38].

Comparing the supervised ML method, the unsupervised ML algorithms are used more frequently, for the samples are not always labeled. Pumure et al. [39] investigated the occurrence of selenium and arsenic in coal by the method of two-step PCA, founding that ultrasound leachable selenium concentrations were associated with

14 Å d-spacing phyllosilicate clays (chlorite, montmorillonite, and vermiculite all 2:1 layered clays), while ultrasound leachable arsenic concentrations were closely related to the concentration of illite, another 2:1 phyllosilicate clay. The PCA and PMF methods are often used to identify the source of trace elements. For example, lake sediment was analyzed [40] in southwest China using the PCA method, and it is shown that Cd/Hg/Pb/Zn and As were mainly from nonpoint anthropogenic sources, especially with the atmospheric emission from nonferrous metal smelting and coal consumption [41]. In Costa Rica, by using the method of PMF, eight important sources of PM 2.5 and PM 10 were identified. Vehicle exhaust, residual oil combustion, and fresh sea salt were the first three sources. Crustal, or dust aerosols originated, organic carbon and sulfate, secondary sulfate, secondary nitrate, and heavy fuels are the other potential sources [42]. In Pakistan, factor analysis was used to identify sources of surface soil contamination. It was found that Ni, Cr, Zn, and Cu were originated from industrial activates, and vehicular emission, and anthropogenic activities such as automobiles brought Pb, Cd, and Co; some other important contaminants, including Fe and Mn, were natural source origin [43]. In Turkey, the PCA was used to find latent factors that influence the water quality, mineral pollution, nutrient pollution, and organic pollution were identified to be the major factors.
