**6. Bioinformatic tools for metagenomic bioremediation**

In the last two decades, bioinformatics has been advanced and simultaneously adapted to multiple fields of science such as basic sciences and advanced applied sciences [74]. Our previous study has given a glance of basic applications of bioinformatics in bioremediation [75]. Bioinformatics holds multiple tasks in the field of metagenomic bioremediation, majorly during metagenomic data analysis [76, 77]. A special issue on bioinformatics approaches and tools for metagenomic analysis has provided an advanced view towards comprehensive bioinformatic tools and methodologies used in metagenomics [78].

Multiple metagenomic projects are generating a large chunk of metagenomic sequence data challenging bioinformatics to develop more robust and better tools to analyze metagenomic sequence data. A recent study reveals the metagenomic characterization of soil microbial community using metagenomic approaches [79]. In this study, researchers have used 33 publicly available metagenomes obtained from diverse soil sites and integrated some state-ofthe-art computational tools to explore the phylogenetic and functional characteristics of the microbial communities in soil. Recently, multiple advancements have taken place in the field of bioinformatics with respect to metagenomic bioremediation. In this section, most of our study focuses on recent bioinformatic tools and datasets majorly used in the analysis of metagenomic data in bioremediation. A comparative overview of functions and suitability of mostly used tools for metagenomic analysis is given in Table 3.

#### **6.1. MEGAN**

vary according to the characteristics of source and site of contamination [72]. A metagenomic analysis conducted on the heavy metal-contaminated groundwater revealed metagenomes of γ- and β-Proteobacteria dominated by *Rhodanobacter*-like γ-proteobacterial and *Burkholderia*like β-proteobacterial species from the habitat of extremely high levels of uranium, nitrate, technetium and various organic contaminants [73]. Moreover, multiple metagenome projects have been taking place around the world; we have sorted out a list of multiple environmental metagenome projects with top microbe having the highest percentage of presence in the metagenomic community (Table 2). Studies on microbial adaptation of toxic environments may give rise to trace new metagenomic communities useful for efficient bioremediation. Specific functions and interactions of microbial communities with respect to contaminationdegrading capabilities can be a result of environmental-based gene switching in the metage‐

**Domain Metagenome Projects Source**

microbial sulphate reduction in Mediterranean marine sediments

metatranscriptomic analyses of a diatom-induced bacterioplankton

Tibetan Plateau soils affected by permafrost or seasonal freezing

obtained from the tuna oil field in the

Gippsland Basin, Australia

bloom in the North Sea

metagenome

Soil

Environmental

Environmental

Environmental

Soil

Water

Environmental

Environments

Metagenome

nomes.

**Top Phylum**

**Percentage of Presence in Community**

80 Advances in Bioremediation of Wastewater and Polluted Soil

Actinobacteria 38.04 Bacteria BASE - Biomes of Australian Soil

Actinobacteria 38.21 Bacteria American Lake Mendota metagenome Water Proteobacteria 31.62 Bacteria Swedish Lake Vattern metagenome Water

Proteobacteria 29.68 Bacteria Detoxification of arsenic mediated by

Unassigned Bacteria 34.8 Bacteria Functional metagenomic profiling of

Euryarchaeota 22.71 Archaea Lonar Lake Sediment prokaryotic

presence in the metagenomic community

Unassigned Bacteria 53.84 Bacteria Metagenome of a microbial consortium

Actinobacteria 27.1 Bacteria Meta soil Soil

**Table 2.** List of multiple environmental metagenome projects with top microbe having the highest percentage of

Chlorobi 56.04 Bacteria Antarctica Aquatic Microbial

Proteobacteria 48.12 Bacteria Illumina and 454-based

Meta Genome Analyzer (MEGAN) is one of the most widely used software tools for efficiently analyzing large chunks of metagenomic sequence data [80, 81]. This tool is most preferably used to interactively analyze and compare metagenomic and metatranscriptomic data, taxonomically and functionally. To perform taxonomic analysis, the program places reads onto the NCBI taxonomy and functional analysis is performed by mapping reads to the SEED, COG, and KEGG classifications. In addition, samples can be compared taxonomically and function‐ ally, using a wide range of charting and visualization techniques like co-occurrence plots. This software also performs PCoA (Principle Coordinate Analysis) and clustering methods allowing high-level comparison of large numbers of samples [82]. Different attributes of the samples can be captured and used during analysis. Moreover, MEGAN supports different input formats of data and is capable of exporting the results of analysis in different text-based and graphical formats. Multiple methods of analysis, acceptance and comparison of high throughput data, robustness and being easy-to-handle are some of the features that made MEGAN as one of the most used metagenome analyzers.

#### **6.2. SmashCommunity**

Simple Metagenomics Analysis SHell for microbial communities (SmashCommunity) is a stand-alone metagenomic annotation and analysis pipeline that shares design principles and routines with SmashCell [83]. It is suitable for data delivered from Sanger and 454 sequencing technologies. It supports state-of-the-art software for essential metagenomic tasks such as assembly and gene prediction. It also provides tools to estimate the quantitative phylogenetic and functional compositions of metagenomes, to compare compositions of multiple metage‐ nomes, and to produce intuitive visual representations of such analyses [84]. It provides optimized parameter sets for Arachne and Celera for metagenome assembly, and GeneMark and MetaGene for predicting protein coding genes on metagenomes. SmashCommunity also includes scripts for downstream analysis of datasets. They can generate intuitive tree-based visualizations of results using the batch access API of the interactive Tree of Life (iTOL) web tool. SmashCommunity can also compare multiple metagenomes using these profiles, cluster them based on a relative entropy-based distance measure suitable for comparing such quantitative profiles, perform bootstrap analysis of the clustering, and generate visual representation of the clustering results.

#### **6.3. CAMERA**

Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA) is a database and associated computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing, and sharing data about microbial biology through an advanced web-based analysis portal [85]. CAMERA holds a huge chunk of data including environmental metagenomic and genomic sequence data, associated environmental parameters, pre-computed search results, and software tools to support powerful cross-analysis of environmental samples. CAMERA works on a pattern of collecting and linking metadata relevant to environmental metagenome datasets with annotation in a semantically aware environment that allows users to write expressive semantic queries to the database. It also provides data submission tools to allow researchers to share and forward data to other metagenomic sites and community data archives. CAMERA can be best considered as a complete genome-analysis tool allowing users to query, analyze, annotate, and compare metagenome and genome data [86].

#### **6.4. MG-RAST**

Rapid Annotation using Subsystems Technology for Metagenomes (MG-RAST) is an auto‐ mated analysis platform for metagenomes, providing quantitative insights into microbial populations based on sequence data [87]. This pipeline performs quality control, protein prediction, clustering, and similarity-based annotation on nucleic acid sequence datasets using a number of bioinformatic tools. Users can upload raw sequence data in FASTA format; the sequences will be normalized and processed, and summaries will be automatically generated. The MG-RAST server provides several methods of access to different data types, including phylogenetic and metabolic reconstructions, and has the ability to compare metabolism and annotations of one or more metagenomes and genomes. In addition, the server also offers a comprehensive search capability. The pipeline is implemented in Perl by using a number of open-source components, including the SEED framework, NCBI BLAST, SQLite, and Sun Grid Engine.


**Table 3.** A comparative overview of functions and suitability of mostly used tools for metagenomic analysis

#### **6.5. IMG/M**

assembly and gene prediction. It also provides tools to estimate the quantitative phylogenetic and functional compositions of metagenomes, to compare compositions of multiple metage‐ nomes, and to produce intuitive visual representations of such analyses [84]. It provides optimized parameter sets for Arachne and Celera for metagenome assembly, and GeneMark and MetaGene for predicting protein coding genes on metagenomes. SmashCommunity also includes scripts for downstream analysis of datasets. They can generate intuitive tree-based visualizations of results using the batch access API of the interactive Tree of Life (iTOL) web tool. SmashCommunity can also compare multiple metagenomes using these profiles, cluster them based on a relative entropy-based distance measure suitable for comparing such quantitative profiles, perform bootstrap analysis of the clustering, and generate visual

Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA) is a database and associated computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing, and sharing data about microbial biology through an advanced web-based analysis portal [85]. CAMERA holds a huge chunk of data including environmental metagenomic and genomic sequence data, associated environmental parameters, pre-computed search results, and software tools to support powerful cross-analysis of environmental samples. CAMERA works on a pattern of collecting and linking metadata relevant to environmental metagenome datasets with annotation in a semantically aware environment that allows users to write expressive semantic queries to the database. It also provides data submission tools to allow researchers to share and forward data to other metagenomic sites and community data archives. CAMERA can be best considered as a complete genome-analysis tool allowing users to query, analyze, annotate, and compare

Rapid Annotation using Subsystems Technology for Metagenomes (MG-RAST) is an auto‐ mated analysis platform for metagenomes, providing quantitative insights into microbial populations based on sequence data [87]. This pipeline performs quality control, protein prediction, clustering, and similarity-based annotation on nucleic acid sequence datasets using a number of bioinformatic tools. Users can upload raw sequence data in FASTA format; the sequences will be normalized and processed, and summaries will be automatically generated. The MG-RAST server provides several methods of access to different data types, including phylogenetic and metabolic reconstructions, and has the ability to compare metabolism and annotations of one or more metagenomes and genomes. In addition, the server also offers a comprehensive search capability. The pipeline is implemented in Perl by using a number of open-source components, including the SEED framework, NCBI BLAST, SQLite, and Sun Grid

representation of the clustering results.

82 Advances in Bioremediation of Wastewater and Polluted Soil

metagenome and genome data [86].

**6.3. CAMERA**

**6.4. MG-RAST**

Engine.

Integrated Microbial Genomes and Metagenomes (IMG/M) system supports annotation, analysis, and distribution of microbial genome and metagenome datasets. IMG/M provides comparative data using analytical tools extended to handle metagenome data, together with metagenome-specific analysis [88, 89]. IMG/M consists of samples of microbial community aggregate genomes integrated with IMG's comprehensive set of genomes from all three domains of life: plasmids, viruses, and genome fragments. Function-based comparison of metagenome samples and genomes is provided by analytical tools that allow examination of the relative abundance of protein families, functional families or functional categories across metagenome samples and genomes. It seems like registered users can gain more advantage out of IMG/M as the tools focus on handling substantially larger metagenome datasets, are available only to registered users as part of the 'My IMG' toolkit, and support specifying, managing, and analyzing persistent sets of genes, functions, genomes or metagenome samples and scaffolds.

#### **7. Summary**

Metagenomics is a strategic approach for analyzing microbial communities at a genomic level. This gives a glimpse towards the microbial community view of "Uncultured Microbiota". Bioremediation has always been adapting new advances in science and technology for establishing better environments, and metagenomics can be considered as one of the best adaptations ever. Identification and screening of metagenomes from the polluted environ‐ ments are crucial in a metagenomic study. The second section emphasizes recent multiple case studies explaining the approaches of metagenomics in bioremediation. Accordingly, the third section speaks about metagenomic bioremediation in different contaminated environments such as soil and water. The fourth section explains different sequences and function-based metagenomic strategies and tools starting from providing a detailed view of metagenomic screening, FACS, and multiple advanced metagenomic sequencing strategies. The fifth section deals with the prevalent metagenomes in bioremediation giving a list of different prevalent metagenomic organisms and their respective projects. The last section gives a detailed view of different major bioinformatic tools and datasets most prevalently used in metagenomic data analysis and processing during metagenomic bioremediation.
