**2. Methodology**

#### **2.1. Ribosomal ribonucleic acid genes partial sequence data**

In a previous study [2], partial 26S rRNA gene sequences from 18 palm wine yeast isolates were deposited under accession numbers (HG452325-42). The sequences from three yeasts genera identified in that study namely *S. cerevisiae, P. kudriavzevii*, and *C. ethanolica* from *Elaeis*  sp. and *Raphia* sp. palm trees were selected and used to carry out new updated searches in this report. For *Elaeis* sp., the sequence accession numbers used were HG425336, HG425328, and HG425333 whereas HG425332, HG425338, and HG425335 were used for the *Raphia* sp. palm tree. The current versions of the selected six sequences mentioned above were used separately for an updated search in the Genbank database. The searches were optimized for highly similar sequences and the first 100 sequences from relatives of each yeast species with the highest percent identity were marked to make a shortlist of up to 600 sequences. These sequences were examined for the features listed at the time of submission after which the countries of origin and sources were noted. Sources were classified as beverage, food, or non-food sources.-

### **2.2. Construction of phylogenetic trees**

Phylogenetic trees were constructed from the shortlisted sequences by using the molecular evolutionary genetic analysis (MEGA, version 7) computer software [14]. The software allowed a seamless transfer of the sequences from Genbank. Using the multiple sequence comparison by log expectation (MUSCLE) reported by Edgar [15], multiple sequence alignments (MSA) were constructed with the software. The evolutionary history was inferred by using the maximum likelihood method based on the Tamura-Nei model [16]. The tree with the highest log likelihood was chosen. Initial trees for the heuristic search were obtained using the maximum composite likelihood approach. Trees were drawn to scale, with branch lengths measured in the number of substitutions per site. All positions containing gaps and missing data were eliminated. The nucleic acid composition of the sequences was calculated automatically by switching to the nucleic acids estimation mode of the software after which the G+C content of the sequences were calculated manually from the arginine, guanine, cytosine, and thiamine percentage distribution displayed. The MAS tool MUSCLE used assumes an equality of substitution rates among sites and takes into account differences in transitional, transversional rates, and G+C-content bias [17]. For brevity, only 20 sequences from the initial 100 relatives obtained are shown in the trees with the reference sequence.

The complete list of 600 sequences analyzed showing sources and countries of origin is available in the public repository figshare [18].
