**5. Statistical methods for honey classification**

point out that δ<sup>13</sup>C values vary with time, location, pollen content, but there is a levelling effect characteristic to the system itself. Honey is collected from more than one colony, over a period of several weeks. As the season starts, honeybees are fed with syrups, so there is high chance that the honey produced reflects the syrup isotopic ratio. Since hive population is renewed every 3–4 weeks, newer generations feed on the previously collected honey, so the adulterat-

The stable isotopic ratio methods for adulteration with C4 sugars is expensive in terms of time, consumables, personnel, and equipment, so the efforts of Puscas et al. [55] in developing a simple and reproducible high-performance thin-layer chromatographic method are welcome. It has been tested on some Romanian honey samples, being based on the F/G ratio and sucrose content evaluation. Using a suitable composition of ethyl acetate : pyridine : water : acetic acid, 6:3:1:0.5 volume ratios, high-performance thin-layer chromatographic aluminium silica gel sheets, a chromatographic twin through chamber, a dipping acetone solution of diphenylamine and aniline hydrochloride, and a visible light TLC visualization device, the authors have managed to validate the proposed procedure for the determination of the glucose, fructose, and sucrose levels. The newly validated method has given trustworthy results during the analysis of 15 Romanian acacia, linden, and polyfloral honey samples harvested by five individual producers. Almost half of the investigated samples have been declared adulterated with fructose from other sources than the natural ones. As F/G is 0.88, a polyfloral sample is declared adulterated with industrial glucose. When determined sucrose levels run above the admitted limit, there is an indication of adulteration by honeybees feeding with sucrose syrup. The acacia honey samples present a higher fructose/glucose ratio than the admitted value, effect of some producers' initiative to improve sensory properties by fructose addition (acacia honey being not too sweet). EC regulation 470/2009 [21] states that honey should be free from antibiotics residues, serious health hazard agents. Antibiotics are generally used for the treatment of bacterial brood diseases produced by *Paenibacillus larvae*, known as American foulbrood (AFB). Even if they are effective only against the hives infestation with AFB, many beekeepers, the Romanians included, practice preventive antibiotics usage. Streptomycin, often used in veterinary medicine, opens up the human organism to deafness and kidney failure at higher concentrations, causing allergies, destroying intestinal flora, and inducing resistance of certain microorganisms at lower concentrations. So there is a multitude of antibiotics screening tests and confirmatory methods. High-performance liquid chromatography with post-column derivatization and fluorescence detection (HPLC-FD) is one of the most versatile and reliable methods in antibiotics residues analysis. Equally effective are the immunochemical assay kits based on antigen-antibody interactions to detect a large variety of antibiotics. The lower rate of false-negative samples, short analysis time, simple operating procedures, good selectivity, low costs are counterbalanced by the possibility to identify and quantify a single target analyte. Cara et al. [56] have used an enzyme-linked immunosorbent assay (ELISA) test kit for streptomycin to determine the antibiotic loadings in acacia, linden, and polyfloral honey samples collected from the Romanian market and get more information on the kinetic law governing the contaminant degradation on storage in the dark and different temperatures. The method has been validated (in terms of repeatability, recovery, precision, specificity, and variation coefficient), and cross-validated by high-performance liquid chromatography with post-column derivatization and fluorescence

ing effect of the syrup on the protein δ<sup>13</sup>C value will quickly decrease.

44 Honey Analysis

As mentioned before, Romania is one of the most important honey suppliers for the national and the European honey market. The quality regulation imposed for foodstuff, honey included, often requires highly specializes investigation techniques. As beekeepers are generally spread all over the country, the botanic origin is initially recorded according to the beekeepers' declaration. Therefore, it is of great interest to find an affordable method for honey classification, based on currently measured physico-chemical properties, to confirm the declared botanic source. In this attempt, a thorough statistical study of honey properties variability is necessary. The European Union issued regulations concerning the general and specific characteristics important in assessing authenticity: moisture, sugar content (fructose, glucose, and sucrose), free acidity, diastase activity, and HMF content. These parameters are relatively simple to measure and provide a good information value.

Chemometric methods (also known as multivariate statistical technique) allow identification of the natural clustering pattern and group variables based on similarities between samples. Their application aid in reducing the complexity of large data sets, and offer better interpretation and understanding of the data sets. In the last years, several chemometric techniques, such as principal component analysis and linear discriminant analysis were used for classification of various foodstuffs [57–60]. Principal component analysis is a multivariate technique, usually at the introductory level, permitting to reduce the dimensionality of multivariate data and to provide a preview of the data structure. It belongs to the group of so-called unsupervised pattern recognition techniques, where no assumption upon possible data clustering is considered. Linear discriminant analysis falls into the group of supervised pattern recognition techniques, and classes are assumed from the beginning. Discrimination relies on finding new co-ordinates where the original data can be projected in such a way to maximize the between-group variance with respect to within-group variance. Linear discriminant analysis results may be further used at building a classification model that could later predict the class of unknowns. Artificial neural networks, designed and trained for pattern recognition, are also used to create a tool that may be used for the identification of a given unknown honey type. The efficiency of the employed statistical tools was defined in terms of their capability to classify a large set of honey samples according to their botanic origin.

#### **5.1. Case study: experimental data**

A significant data sample of four honey types (acacia, polyfloral, linden, and colza) was collected between 2014 and 2016 and the main physico-chemical characteristics were measured: HMF, acidity, diastase index, water content, inverted sugar, and sucrose. For each honey type, 90 samples (30 samples/year) were considered in the analysis, in total 360 data sets. The unifloral and polyfloral samples were delivered, received, and transferred to the laboratory in their original packages and kept at 20°C before analysis. Information on the botanical origin of the samples was provided by the beekeepers and later validated by pollen spectrum. Aliquots were homogenized by mixing with a glass rod, filtered through cheesecloth, and left to stand until complete clarification, in order to eliminate the incorporated air, as recommended in SR 784-3:2009 [13]. Physico-chemical parameters were analysed according to the national standard [13], as presented in the literature [60]. **Table 1** presents the means and ranges for all measured characteristics.

According to data recorded in **Table 1**, some general features can be underlined in accordance with general European Union regulations issued on the specific honey characteristics important in assessing authenticity and quality. Moisture is considered one of the basic parameters in evaluating the honey quality. According to Council Directive 2001/110/EC and Revised Codex Standard for Honey, water content may not be greater than 20%. As seen in **Table 1**, all honey types in the data set fulfil the quality requirements. The HMF content is indicative of honey freshness and/



**Table 1.** Ranges of experimental values for honey physico-chemical characteristics.

or overheating. The HMF content should not exceed 4 mg/100 g honey, but in some countries, as Germany or Romania, the maximum admitted value is lower, 1.5 mg HMF/100 g being the limit for unifloral honey samples. There are only about 5–8% individual samples in each honey type characterized by HMF values higher than 1.5 mg/100 g, thus raising possible freshness questions. The diastase activity is also indicative of freshness and is above 17 in all honey samples. Both HMF and diastase activity values determined are typical for unprocessed honey. The free acidity also varied among the four honey types investigated, but in all samples the acidity is below 4 mL NaOH solution, which is the upper limit admitted. Sugars practically consist of inverted sugar and sucrose. SR EN 784/2:2009 [12] regulates the minimum allowed inverted sugar to 70% in the flower honey. As for sucrose, the standard sets the limits to maximum 5%. All samples involved in the present study fulfil the inverted sugar and sucrose requirements (**Table 1**).

#### **5.2. Case study: statistical analyses**

HMF, acidity, diastase index, water content, inverted sugar, and sucrose. For each honey type, 90 samples (30 samples/year) were considered in the analysis, in total 360 data sets. The unifloral and polyfloral samples were delivered, received, and transferred to the laboratory in their original packages and kept at 20°C before analysis. Information on the botanical origin of the samples was provided by the beekeepers and later validated by pollen spectrum. Aliquots were homogenized by mixing with a glass rod, filtered through cheesecloth, and left to stand until complete clarification, in order to eliminate the incorporated air, as recommended in SR 784-3:2009 [13]. Physico-chemical parameters were analysed according to the national standard [13], as presented in the literature [60]. **Table 1** presents the means and

According to data recorded in **Table 1**, some general features can be underlined in accordance with general European Union regulations issued on the specific honey characteristics important in assessing authenticity and quality. Moisture is considered one of the basic parameters in evaluating the honey quality. According to Council Directive 2001/110/EC and Revised Codex Standard for Honey, water content may not be greater than 20%. As seen in **Table 1**, all honey types in the data set fulfil the quality requirements. The HMF content is indicative of honey freshness and/

**g honey**

2015 Max 19.2 1.86 38.5 80.27 2.88 2.3

2016 Max 19.6 2.37 38.5 79.2 2.68 2.4

2015 Max 18.7 2.53 23.8 74.73 4.96 1.7

2016 Max 20 3.11 23.8 75.73 4.95 1.9

Colza 2014 Max 19.8 1.76 38.5 80 3.1 2.2

Acacia 2014 Max 18.6 4.4 23.8 75 4.75 1.9

**Diastatic index**

Min 17 0.11 17.9 75.5 1.15 1.2 Average 18.05 0.61 25.51 77.68 2.13 1.75

Min 17 0.19 17.9 76 1.17 1.3 average 17.96 0.76 27.05 78.12 2.00 1.75

Min 17.2 0.05 17.9 75.73 1.42 1.2 Average 18.16 0.89 27.91 77.38 1.98 1.79

Min 15.3 0.19 13.8 70 2.17 0.8 Average 16.83 0.79 18.67 72.89 3.20 1.16

Min 14.6 0.01 10.9 70.29 2.05 1 Average 16.27 0.62 17.31 73.08 3.68 1.27

Min 14 0.09 10.9 70.55 1.67 0.9 Average 16.77 0.65 17.22 73.50 3.76 1.23

**Inverted sugar, %**

**Sucrose,% Acidity mL** 

**1N NaOH/100 g honey**

ranges for all measured characteristics.

46 Honey Analysis

**Honey type Year Range Water,% HMF mg/100** 

In the first stage of statistical analysis, the measured data were investigated using descriptive statistic tools and one-way analysis of variance (ANOVA) factor analysis. A first attempt was to investigate whether the year of collection can be considered a factor that influences the honey physico-chemical properties or not. A one-way ANOVA test was performed for each honey type, results being summarized in **Table 2**.

As data in **Table 2** show, the honey characteristic properties are not influenced by the year of collection. An exception is the influence upon the inverted sugar content in colza, linden, and polyfloral honey, and upon the HMF in the linden honey. As the time period Investigated was rather short, and climatic condition were similar, the ANOVA results obtained, considering the collection year a possible influencing factor, are not unexpected.

For further statistical analysis, the data collected for each honey type in the 3 years mentioned were lumped together. Descriptive statistics tools were further used for univariate distribution analysis of each honey group. The mean, variance, skewness, and kurtosis were calculated from the data samples to evaluate the lack of symmetry and the flatness in the experimental data sets (**Table 3**).

As it can be noticed, the univariate distributions for all six characteristics can be considered normal for all honey types as, according to a rule of thumb generally accepted, the skewness and kurtosis are mainly in the −1 to +1 range, with few values outside this range, but still between −2 and 2 [61]. Only the HMF distribution for acacia and polyfloral honey is an exception to this


**Table 2.** One-way ANOVA results considering as factor the honey collection year.


**Table 3.** Descriptive statistic estimations for the honey types investigated.

to investigate whether the year of collection can be considered a factor that influences the honey physico-chemical properties or not. A one-way ANOVA test was performed for each

As data in **Table 2** show, the honey characteristic properties are not influenced by the year of collection. An exception is the influence upon the inverted sugar content in colza, linden, and polyfloral honey, and upon the HMF in the linden honey. As the time period Investigated was rather short, and climatic condition were similar, the ANOVA results obtained, considering

For further statistical analysis, the data collected for each honey type in the 3 years mentioned were lumped together. Descriptive statistics tools were further used for univariate distribution analysis of each honey group. The mean, variance, skewness, and kurtosis were calculated from the data samples to evaluate the lack of symmetry and the flatness in the

As it can be noticed, the univariate distributions for all six characteristics can be considered normal for all honey types as, according to a rule of thumb generally accepted, the skewness and kurtosis are mainly in the −1 to +1 range, with few values outside this range, but still between −2 and 2 [61]. Only the HMF distribution for acacia and polyfloral honey is an exception to this

> **Diastatic index**

*F*crit 3.10 3.10 3.10 3.10 3.10 3.10 *p* Value 0.35 0.023 0.28 0.078 0.90 0.56 Relevance No Yes No No No No

*F*crit 3.10 3.10 3.10 3.10 3.10 3.10 *p* Value 0.055 0.22 0.25 0.69 0.11 0.09 Relevance No No No No No No

*F*crit 3.10 3.10 3.10 3.10 3.10 3.10 *p* Value 0.055 0.007 0.67 0.005 0.90 0.28 Relevance No Yes No Yes No No

*F*crit 3.10 3.10 3.10 3.10 3.10 3.10 *p* Value 0.41 0.0008 0.81 0.56 0.42 0.45 Relevance No Yes No No No No

**HMF Acidity Water**

honey type, results being summarized in **Table 2**.

experimental data sets (**Table 3**).

48 Honey Analysis

**Honey type Sucrose Inverted** 

the collection year a possible influencing factor, are not unexpected.

**sugars**

Colza *F*test 1.05 3.91 1.23 2.61 0.90 0.57

Acacia *F*test 2.98 1.52 1.39 0.36 2.19 2.43

Linden *F*test 2.98 5.25 0.38 5.56 0.2 1.18

Polyfloral *F*test 0.87 7.51 0.21 0.57 0.86 0.79

**Table 2.** One-way ANOVA results considering as factor the honey collection year.

pattern. The higher positive skewness of the HMF distribution is caused by some honey samples (approximately 10 out of 90 samples) with higher content (between 2 and 4.9 mg/100 g honey).

To estimate the botanical origin influence upon the main measured characteristics, the oneway ANOVA was performed in the frame of EXCEL software. The factor considered in the analysis was the honey type. The tests were carried at a significance level of 0.05. The results are presented in **Table 4**. Results show that honey type is a factor with statistic significance in the variation of honey physico-chemical properties. Starting from this consideration, multivariate statistical analysis is expected to give more insight concerning the possibility of honey type classification using a complex mathematical treatment of all measured variables.


**Table 4.** One-way ANOVA considering as factor the honey type.

Principal component analysis, as unsupervised method, is generally first performed as it can lead to a data reduction and highlight the measured characteristics most responsible for data variability. As the original variables have different units, the dimensionless standardized data matrix was used in principal component analysis. All computing tasks were implemented in Matlab® [62]. Principal component analysis practically defines an orthogonal linear transformation of the original data set into a new set of coordinates, named principal components. The first PC encompasses the largest data variability, the second PC the second largest variance, and so on. According to principal component analysis, the first eigenvectors of the covariance matrix correspond to the 'directions' of highest variability in the data set. The first three eigenvalues are larger than 1 for the data investigated, meaning that the first three PCs explain more variability in the data set than the variables themselves. The first three principal components considered explain almost 70% of the variability (PC1 reflects 32.1%, PC2 20.7%, and PC3 15.8%) as represented by the Pareto plot (**Figure 8**).

The bi-plot representation (**Figure 9**) simultaneously shows the variables represented as vectors and the points corresponding to all samples in the data set projected in the PC1-PC2 space. The coordinates of each variable are proportional to its contribution (loading) in PC1 and PC2. The samples are displayed as points normalized in [−1, 1] interval, thus only the relative position in the graphical representation is relevant. The bi-plot allows visualization of the magnitude and sign of each variable contribution in the first two PCs. For instance, sucrose and inverted sugar have opposite signs loading, indicating that PC1 distinguishes between samples with low sucrose content and high inverted sugar content, and vice versa. As **Figure 9** shows, the loadings in the first PC have high values for sucrose and inverted sugar (about 0.6), signalling that these two variables account for the most variability in the data set. HMF and water content have very small loadings in PC1, but quite high ones in PC2, revealing a smaller contribution in samples variability.

In order to visualize a possible data clustering, the projection of samples in the first two principal components space is presented of **Figure 10**, for the data samples in the four honey types. The ellipses cover about 95% of each honey type population. As **Figure 10** shows, acacia and colza honey are clearly separated on PC1 direction, where sucrose and diastase activity present the highest loadings. These two characteristics are able to differentiate between these two botanic origins. Polyfloral honey is somehow separated from acacia and colza honey on PC2 direction, meaning that the water and HMF are responsible for

**Figure 8.** Principal component contribution in the data variability.

Principal component analysis, as unsupervised method, is generally first performed as it can lead to a data reduction and highlight the measured characteristics most responsible for data variability. As the original variables have different units, the dimensionless standardized data matrix was used in principal component analysis. All computing tasks were implemented in Matlab® [62]. Principal component analysis practically defines an orthogonal linear transformation of the original data set into a new set of coordinates, named principal components. The first PC encompasses the largest data variability, the second PC the second largest variance, and so on. According to principal component analysis, the first eigenvectors of the covariance matrix correspond to the 'directions' of highest variability in the data set. The first three eigenvalues are larger than 1 for the data investigated, meaning that the first three PCs explain more variability in the data set than the variables themselves. The first three principal components considered explain almost 70% of the variability (PC1 reflects 32.1%, PC2 20.7%, and PC3 15.8%) as represented by the Pareto

*F* test value 58.20 132.23 68.64 7.33 45.71 38.88 *F* critical value 2.63 2.63 2.63 2.63 2.63 2.63 *p* value 1.2E-30 1.5E-57 4.8E-35 8.8E-05 1.4E-19 9.4E-22 Relevance Yes Yes Yes Yes Yes Yes

**Sugar Inverted sugars Diastatic index HMF Acidity Water**

The bi-plot representation (**Figure 9**) simultaneously shows the variables represented as vectors and the points corresponding to all samples in the data set projected in the PC1-PC2 space. The coordinates of each variable are proportional to its contribution (loading) in PC1 and PC2. The samples are displayed as points normalized in [−1, 1] interval, thus only the relative position in the graphical representation is relevant. The bi-plot allows visualization of the magnitude and sign of each variable contribution in the first two PCs. For instance, sucrose and inverted sugar have opposite signs loading, indicating that PC1 distinguishes between samples with low sucrose content and high inverted sugar content, and vice versa. As **Figure 9** shows, the loadings in the first PC have high values for sucrose and inverted sugar (about 0.6), signalling that these two variables account for the most variability in the data set. HMF and water content have very small loadings in PC1, but quite high ones in

In order to visualize a possible data clustering, the projection of samples in the first two principal components space is presented of **Figure 10**, for the data samples in the four honey types. The ellipses cover about 95% of each honey type population. As **Figure 10** shows, acacia and colza honey are clearly separated on PC1 direction, where sucrose and diastase activity present the highest loadings. These two characteristics are able to differentiate between these two botanic origins. Polyfloral honey is somehow separated from acacia and colza honey on PC2 direction, meaning that the water and HMF are responsible for

PC2, revealing a smaller contribution in samples variability.

**Table 4.** One-way ANOVA considering as factor the honey type.

plot (**Figure 8**).

**Measured characteristic**

50 Honey Analysis

the differentiation. Principal component analysis could not achieve a good discrimination between the honey types: the polyfloral honey completely overlap linden, and the other honey types also partially overlap as shown in **Figure 10**. **Figure 11** presents the principal component analysis classification capability for the case when only unifloral honey (270 samples) is considered. **Figure 11** shows that the overlapping of acacia, linden, and colza samples is more or less similar to the case previously described (**Figure 10**).

**Figure 9.** Bi-plot representation in the frame of principal component analysis.

**Figure 10.** Data projection of four honey type samples in the principal components space.

As not always the directions of highest data variability are the same with those for better data discrimination, the classification efficiency of Fisher linear discriminant analysis was also investigated. Linear discriminant analysis considers from the beginning the data samples grouped in classes, and projects the data onto a lower-dimensional vector space, such that the ratio of the between-class distance to the within-class distance is maximized, thus attempting to achieve maximum discrimination. The optimal projection is computed by applying the eigendecomposition on the scatter matrices. The method is recommended for large data sets and for the case when the univariate distributions are relatively close to Gaussian repartition, which is the case for the current experimental data set. The discrimination between groups (honey types) is presented in **Figures 12** and **13**. **Figure 12** corresponds to the discrimination of the four honey types that includes the polyfloral honey, while **Figure 13** reflects the linear discriminant analysis classification capacity for unifloral honey.

When comparing the representations in **Figures 10** and **12**, the linear discriminant analysis proves to be a better classification method for the investigated unifloral honey samples. Analysing the samples graphical representation (**Figure 12**), it can be noticed that while

**Figure 11.** Data projection of unifloral honey samples in the PC1-PC2 space.

As not always the directions of highest data variability are the same with those for better data discrimination, the classification efficiency of Fisher linear discriminant analysis was also investigated. Linear discriminant analysis considers from the beginning the data samples grouped in classes, and projects the data onto a lower-dimensional vector space, such that the ratio of the between-class distance to the within-class distance is maximized, thus attempting to achieve maximum discrimination. The optimal projection is computed by applying the eigendecomposition on the scatter matrices. The method is recommended for large data sets and for the case when the univariate distributions are relatively close to Gaussian repartition, which is the case for the current experimental data set. The discrimination between groups (honey types) is presented in **Figures 12** and **13**. **Figure 12** corresponds to the discrimination of the four honey types that includes the polyfloral honey, while **Figure 13** reflects the linear

When comparing the representations in **Figures 10** and **12**, the linear discriminant analysis proves to be a better classification method for the investigated unifloral honey samples. Analysing the samples graphical representation (**Figure 12**), it can be noticed that while

discriminant analysis classification capacity for unifloral honey.

**Figure 10.** Data projection of four honey type samples in the principal components space.

52 Honey Analysis

colza and acacia samples form distinct groups, approximately 30–40% of linden and polyfloral samples are miss-classified. When only unifloral samples are subjected to classification (**Figure 13**), about 25% of the linden samples are represented in the acacia and colza region. Even if better results were obtained compared to principal component analysis, linear discriminant analysis does not seem accurate enough to achieve classification of unifloral honey samples based on physico-chemical properties.

The pattern recognition technique using artificial neural networks should be also tested as classification tool. A neural network with 6 input nodes (the 6 physico-chemical honey characteristics), 4 output nodes (each node corresponding to a given honey group), and 12 nodes in the hidden layer was defined in the frame of Matlab® neural network toolbox. The 360 samples were divided in 252 (70%) samples for training, 54 samples (15%) for testing, and 54 samples (15%) for validation. In this way, the results obtained are reliable, and the final fitted network would be capable to assign unknown samples to a given category. The selected training algorithm was the scaled conjugated gradient. The performance was appreciated based on mean squared error evaluation.

**Figure 12.** Data discrimination along the first and second linear discriminant analysis functions for the four honey type samples.

**Figure 13.** Data discrimination along the first and second linear discriminant analysis functions for unifloral honey samples.

The best results obtained after repeated training steps are represented with the aid of the confusion matrix in **Figure 14**. The number of samples correctly assigned is listed in the green boxes on the diagonal of this matrix, while the red boxes contain the number of incorrect prediction. The overall incorrect assignments represented 10.3%. For individual honey types, 96.7% of acacia honey samples, 81.2% of linden samples, 98.9% colza sets, and 82.2% polyfloral ones were correctly classified.

**Figure 12.** Data discrimination along the first and second linear discriminant analysis functions for the four honey type

**Figure 13.** Data discrimination along the first and second linear discriminant analysis functions for unifloral honey

samples.

54 Honey Analysis

samples.

**Figure 14.** Confusion matrix for unifloral and polyfloral samples classification (1–acacia, 2–linden, 3–colza, 4–polyfloral).

For unifloral honey samples classification, a similar pattern recognition artificial neural network was built, with 6 neurons in the input layer, 3 neurons in the outer layer, and 10 neurons in the hidden layer. A total of 70% of the 270 unifloral honey samples were used for training, 15% for testing, and 15% for validation. The best results obtained led to a correct group assignment with a total error of only 3.3%. For each honey type, the errors in the sample recognition were: 4.4% for acacia, 5.6% for linden, and 0% for colza (**Figure 15**).

**Figure 15.** Confusion matrix for unifloral samples classification (1–acacia, 2–linden, 3–colza).

This case study, as well as those published by other Romanian researchers point out the necessity to set up a comprehensive database containing parameters of honey samples from different regions and harvesting seasons, containing not only the standardized physico-chemical parameters but also details on volatile organic compounds, phenolics, flavonoids, and stable isotopic ratios. Supervised and unsupervised classification tools would benefit from such large statistic samples, allowing a higher degree of generalization for the conclusions drawn.
