Skip to main content

EasyCGTree: a pipeline for prokaryotic phylogenomic analysis based on core gene sets

Abstract

Background

Genome-scale phylogenetic analysis based on core gene sets is routinely used in microbiological research. However, the techniques are still not approachable for individuals with little bioinformatics experience. Here, we present EasyCGTree, a user-friendly and cross-platform pipeline to reconstruct genome-scale maximum-likehood (ML) phylogenetic tree using supermatrix (SM) and supertree (ST) approaches.

Results

EasyCGTree was implemented in Perl programming languages and was built using a collection of published reputable programs. All the programs were precompiled as standalone executable files and contained in the EasyCGTree package. It can run after installing Perl language environment. Several profile hidden Markov models (HMMs) of core gene sets were prepared in advance to construct a profile HMM database (PHD) that was enclosed in the package and available for homolog searching. Customized gene sets can also be used to build profile HMM and added to the PHD via EasyCGTree. Taking 43 genomes of the genus Paracoccus as the testing data set, consensus (a variant of the typical SM), SM, and ST trees were inferred via EasyCGTree successfully, and the SM trees were compared with those inferred via the pipelines UBCG and bcgTree, using the metrics of cophenetic correlation coefficients (CCC) and Robinson–Foulds distance (topological distance). The results suggested that EasyCGTree can infer SM trees with nearly identical topology (distance < 0.1) and accuracy (CCC > 0.99) to those of trees inferred with the two pipelines.

Conclusions

EasyCGTree is an all-in-one automatic pipeline from input data to phylogenomic tree with guaranteed accuracy, and is much easier to install and use than the reference pipelines. In addition, ST is implemented in EasyCGTree conveniently and can be used to explore prokaryotic evolutionary signals from a different perspective. The EasyCGTree version 4 is freely available for Linux and Windows users at Github (https://github.com/zdf1987/EasyCGTree4).

Peer Review reports

Background

Phylogenetic analysis uses genome-based methods more and more routinely rather than a small number of genes, for interpreting the evolutionary and genetic information of prokaryotes [1, 2]. There are two main approaches most frequently used in inferring phylogenies from large gene collections [2, 3]. The supermatrix (SM) uses gene concatenation to reduce stochastic errors, and allow the combination of weak phylogenetic signals in different genes. The supertree (ST) derives the optimal tree obtained through the analysis of individual genes of interest that is unnecessary to be present in every genome. This approach prevents the combination of genes with incompatible phylogenetic histories [4]. The ST can be easily parallelized in practice and does not require as much memory as the SM. They are originally developed by replacing phylogenies based on a limited number of house-keeping genes, which have been widely used historically for characterizing the taxonomy, evolution, and genotypic characterization of prokaryotes and constitute the general framework [2]. In the genomic era, the number of house-keeping genes covered in phylogenies of prokaryotes has increased, such as the rps gene set (53 genes encoding the bacterial ribosome protein subunits) [5], bac120/ar122 gene set [6], and up-to-date bacterial core gene (UBCG) set [7]. Furthermore, core gene sets based on pan-genome have been defined for phylogenetic analysis within different taxonomic ranks, particularly from species to family [8].

However, the considerable bioinformatics skills needed in analyzing a large volume of genomic data and the complex formats of data from different applications impedes related analysis by beginners. Detecting the core genes of a customized genome data set is time consuming and requires a powerful machine. Several tools have been developed for phylogenomic analysis. The autoMLST pipeline scansconserved single-copy housekeeping genes, and builds a phylogeny by using SM (concatenated gene matrix) or ST (coalescent tree with ASTRAL-III) [9]. The UBCG pipeline uses a gene set, named UBCG, to build an SM phylogeny [7], and GToTree estimates genome completeness and redundancy, and infers an SM phylogenomic tree according to a gene set of custom hidden Markov models (HMMs) or one of its 13 enclosed HMMs [10]. The bcgTree pipeline extracts 107 core genes (included in the essential gene set) by using HMMs and performs a phylogenetic analysis with SM [11]. All these tools can be employed by Linux users only, and specific libraries and third-party software are required for their installation. These requirements are prohibitive for Windows users and biologists not focusing on bioinformatics.

In this study, we introduced EasyCGTree, which is a user-friendly and cross-platform Perl-language (https://www.perl.org/) tool, for constructing genome-scale maximum-likehood (ML) phylogenetic tree with SM and ST. It uses microbial genomic data (amino acid sequence) as input data, and the profile HMMs of core gene sets for homolog searching. It is an all-in-one automatic pipeline from input data to phylogenomic tree, is highly portable, and can be operated on a personal computer or powerful server running either Linux or Windows.

Implementation

EasyCGTree was implemented in Perl programming languages (https://www.perl.org/) and was built using a collection of published reputable tools, including Clustal Omega version 1.2.4 [12]; consense from PHYLIP version 3.698 [13]; FastTree version 2.1 [14]; hmmbuild and hmmsearch from HMMER version 3.0 (http://hmmer.org/); IQ-TREE version 2.1.1 [15]; trimAl version 1.2 [16]; and wASTRAL version 1.15.2.3 [17]. Most tools required by EasyCGTree were precompiled as standalone executable files and contained as a single package. It can run after the installation of Perl language environment. It supports the Linux 64-bit architecture and Windows version 7 and above. Experienced users can replace the included tools with precompiled up-to-date versions easily to update EasyCGTree, and we will update these tools and the main scripts aperiodically to ensure longevity. It will be tried to develop a version on MAC OS that is as portable as those on Windows and Linux.

Several profile HMMs of core gene sets were prepared in advance for the construction of a profile HMM database (PHD) enclosed in the package and used for homolog searching with HMMER (http://hmmer.org/). Customized gene sets (prepared as gene clusters) can be used to build profile HMMs by EasyCGTree and added to the PHD. Currently, the PHD comprises the following gene sets: bac120, 120 ubiquitous genes (corresponding to 120 protein domains) in the domain Bacteria [18]; ar122, 122 ubiquitous genes (122 protein domains) in the domain Archaea [18]; rp1, 16 ubiquitous ribosomal protein genes (18 protein domains) in Prokaryote [19]; rp2, 23 ubiquitous ribosomal protein genes (27 protein domains) in Prokaryote [20]; UBCG, 92 up-to-date bacterial core genes in Bacteria [7]; ery288, 288 core genes of the family Erythrobacteraceae [21]; and essential, 107 essential single-copy core genes in Bacteria [11]. EasyCGTree can facilitate core-gene based phylogeny inference and is portable. Furthermore, the intermediate data of an EasyCGTree run can be directly used as input data of many other applications. An overview of the workflow is shown in Fig. 1, and more details can be found in the EasyCGTree package manual.

Fig. 1
figure 1

EasyCGTree flowchart. Note: The algorithm uses FASTA/multi-FASTA formatted amino acid sequences from prokaryotic genomes (i.e., proteome) as input data. EasyCGTree includes the supermatrix (SM) and supertree (ST) approaches to infer phylogeny. Several profile HMMs of core gene sets have been prepared in advance to construct a profile HMM database (PHD) that is enclosed in the package and used for homolog searching. Customized gene sets can also be used to build profile HMMs and added to the PHD

Input

EasyCGTree uses FASTA/multi-FASTA-formatted amino acid sequences from prokaryotic genomes (i.e., proteome) as input (option-proteome). The file names of the proteomes will be formatted automatically in order that the labels can be processed correctly. The command line will be checked for validity before a run, including the file number in the input directory (≥ 5); sequence type (protein sequence) and formation (FASTA-formatted); and consistency among options. Error reports will be sent back to users if some options are set invalidly. Otherwise, a run will start, and a run log will be printed on the screen and will be saved in a log file named according to the name of the input directory and local starting time.

Gene calling

An HMM file of a gene set can be specified by using the option “-hmm” (default bac120) and can be used in searching homologs against each proteome with hmmsearch from the HMMER package (http://hmmer.org). An E value 1e-10 can be used as the default threshold for HMM searching, and it can be modified by using the option “-evalue”.

Filtrate hits and generate clusters

The top hit of each gene was screened according to the E value threshold. Genomes with fewer gene detected and genes with low prevalence will be excluded by applying options “-genome_cutoff” and “-gene_cutoff”, respectively. Subsequently, the homologs of the selected genes were retrieved from the selected proteomes and prepared as gene clusters. A gene prevalence report can be generated by using the Perl script “Gene_Prevelence.pl” from the EasyCGTree package.

Build alignments

MUSCLE [22] was used for multiple sequence alignment in Windows to ensure accuracy, and Clustal Omega [12] was used in Linux. The reasons were that: Clustal Omega is faster than MUSCLE for extremely large alignments, and Linux is preferred in powerful servers. Experienced uses can employ other pre-compiled tools for alignment via the modification of several lines of code in the EasyCGTree script.

Trimming

The tool trimAl [16] was used for alignment trimming and conserved segment selection, and three different automatic methods (i.e. gappyout, strict and strictplus) were implemented using trimAl for the selection of different thresholds on the basis of MSA features. A standard for trimming alignment used by trimAl can be set by using the option “-trim” (default strict). The strict method combines a gappyout trimming with a subsequent trimming based on an automatically selected similarity threshold. More information can be found in the trimAl manual (http://trimal.cgenomics.org/trimal).

Phylogeny inference

SM and ST [1, 2] can be set using the option “-tree” (default supermatrix; -tree sm). The two tree-inference programs, namely FastTree [14] and IQ-TREE [15], can be specified by using the option “-tree_app” (default FastTree). FastTree was used as the default mainly for the consideration on initial completeness of phylogenomic analysis because of its faster speed and less requirement of memory. IQ-TREE was highly recommended for its accuracy on powerful machines or with a small input dataset [23]. If SM is used, a concatenation of each trimmed alignment will be generated and subjected to infer phylogeny with a selected program. If a proteome is missing in a gene cluster, it will be treated as gaps at related segment of the concatenation. If ST is used (-tree st), gene trees will be inferred from each trimmed alignment of the gene clusters (do not require all taxa to be present), and then the program wASTRAL [17] will be employed to combine them into a single ST with the hybrid method that consider phylogenetic uncertainty by integrating signals from branch length and branch support in gene trees.

In addition, the classic consensus technique (e.g., majority rule consensus tree; -tree cs) [24] was included in EasyCGTree to explore consistency among core gene trees and can be regarded as a variant of the typical SM with concatenation [1]. For this approach, the option “-gene_cutoff” will be set as “1” to ensure that each taxon would be present in all the gene trees inferred from the trimmed alignment of each gene cluster. Then the program consense [13] will be used to generate a consensus tree of the type “Majority rule (extended)”. This type of consensus tree uses the following principles: any set of taxa will be included if it appears in more than 50% of the gene trees; the other sets of taxa are considered in the order of the frequency of their appearance, and added to the consensus tree compatible with it; all the taxa will be added to the consensus tree until the tree is fully resolved.

Options within FastTree and IQ-TREE can be changed by editing the file “tree_app-options.txt” attached with the EasyCGTree package. Notably, the specified tree-inference program determines the maximum memory required by EasyCGTree. Users are encouraged to refer to the documentations of FastTree and IQ-TREE to evaluate whether their machines can run EasyCGTree successfully. If it cannot be run by a machine, small input data set, gene set with few genes, and a powerful machine should be considered.

Outputs

All the data generated during a run were recorded, including HMM searching result, sequence of gene clusters, alignment, gene tree, and running log. The final phylogenetic tree was written in Newick format. Users can display it via FigTree (http://figtree-international.com/), MEGA [25], iTOL [26] or other tree viewers.

Profile HMM database (PHD)

For the ar122, bac120, essential, rp1, rp2, and UBCG gene sets, the accession numbers of the genes included in each gene set were retrieved from previous reports [7, 11, 18,19,20]. All the HMMs were downloaded (on July 7, 2022) from the Pfam-A (www.pfam.org/) and NCBI (https://ftp.ncbi.nih.gov/hmm/) HMM databases. Subsequently, the HMM of each gene was retrieved from the local HMM databases and merged into a single file according to the accession list of each gene set. For the ery288 gene set, the alignments of each gene cluster were retrieved from our previous study [27] and built with the Perl script “BuildHMM.pl” from the EasyCGTree package. Customized gene sets prepared as gene clusters can also be used in building profile HMMs and added to the local PHD via this script.

Results and discussion

Examples in publications

EasyCGTree has been used and cited in some publications, most of which are studies on novel bacterial taxa descriptions. Phylogenies inferred with EasyCGTree were compared with those in reports associated with closely related taxon, and similar topologies were observed between them. The current taxonomic framework of the family Erythrobacteraceae was established mainly by using phylogeny based on 288 core genes [21] and was named ery288 in this study. With ery288 and the substitution model LG + F + R9 for IQ-TREE as the previous report (with manually controlled workflow) [21], an SM phylogeny with an identical topology was recovered by using EasyCGTree [28]. Most genera of this family can be recovered as monophyletic groups in the phylogeny of ery288 inferred with FastTree employed by EasyCGTree, except the genera Alteriqipengyuania, Croceibacterium, and Qipengyuania [27]. Although low accuracy was observed, EasyCGTree employing FastTree (< 40 min) had an advantage over employing IQ-TREE (> 12 h) in terms of elapsed time when a personal computer (Intel Core i7-9700 CPU and 16 Gb RAM) running Windows 10 was used.

The consensus approach in EasyCGTree was successfully applied in clarification of the relationships between the genera Marmoricola and Nocardioides [29], and in the classification of strain HHU G3-2 as a new species of the genus Aestuariicella [30]. However, there is rare case of applying ST approach in prokaryotic study, and we cannot perform further validation of ST implemented in EasyCGTree. In a recent study, we applied the EasyCGTree package to build the HMMs of the genes involved in the ammonification metabolism pathway, and the prevalence of related gene families in genus Alteromonas has been extensively characterized [31].

Performance compared with other tools

A comparison of the features of EasyCGTree and other tools with similar functions mentioned above is summarized in Table 1. EasyCGTree includes third-party software also used by others: trimAl [16] is also employed by autoMLST [9] and GToTree [10]; MUSCLE [22] also by GToTree [10] and bcgTree [11]; FastTree [14] by UBCG and GToTree [10]; and IQ-TREE [15] by autoMLST [9]. The highlighted advantages of EasyCGTree are that it supports Window and does not require preinstalled software. Thus, it is easier to use than other pipelines.

Table 1 A comparison of the features among EasyCGTree and other pipelines with similar functions

Subsequently, we conducted phylogenomic analysis of the genus Paracoccus by using EasyCGTree and two other easy pipelines (UBCG version 3.0 [7] and bcgTree version 1.2.0 [11]) to evaluate their performance on a Dell PowerEdge T430 sever (2 × Xeon E5-2680 v4 CPU, 128 GB RAM) running Ubuntu 18.04.4 LTS. GToTree [10] was not included because the links for downloading the HMMs were broken, and autoMLST [9] was not included because it was not portable enough (with databases ~ 25 GB). A total of 43 genomic datasets of the genus Paracoccus and an outgroup were downloaded from the RefSeq database on the NCBI server (https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/; Additional file 1: Table S1).

As shown in Fig. 2, the SM tree inferred using EasyCGTree with default options (except that the UBCG gene set as the HMM was used) had a similar topology with stronger support (support values > 0.85) than that (support values > 0.59) obtained with the UBCG pipeline with default options (Additional file 1: Table S2). In particular, P. alcaliphilus formed a later branching lineage (support value 0.88) of the genus than Clade 2 in one tree (Fig. 2A) but closely neighboured Clade 2 with low-level confidence (support value 0.59) in the other tree (Fig. 2B). We checked the concatenations produced via the EasyCGTree and UBCG pipelines, and found that they included 26 089 and 30 232 amino acid positions, respectively. The appearance of differences in topology and confidence may depend on whether a trimming method was applied (Additional file 1: Table S2). This suggested that EasyCGTree can infer more reliable phylogeny than the UBCG pipeline. With regard to elapsed time, EasyCGTree took 6.5 min for the run, whereas the UBCG pipeline took 18 min. It was notable that the first step of the UBCG run (converting genome sequence to bcg files one by one manually) accounted for nearly 16 min. No convenient method was provided with the UBCG pipeline to help users at this laborious step [7]. Thus, the UBCG pipeline is unsuitable for handling hundreds of input datasets.

Fig. 2
figure 2

Phylogenomic trees of the genus Paracoccus constructed using the pipelines EasyCGTree, UBCG, and bcgTree. Note: A EasyCGTree using 90 genes of the UBCG gene set (gene families TIGR03625 and TIGR01044 were excluded because of low prevalence) as the HMM and other default options. B UBCG using default options. C EasyCGTree using the essential gene set as the HMM and IQ-TREE for phylogeny inference with best-fit model Q.yeast + F + I + I + R5. D bcgTree using default options. Clades with identical topology among the four trees are collapsed and marked in the same color. Support confidence (A, 0–1; B, 0–1/≤ 0–92; C and D, 0–100) is indicated in the middle of branches or near the nodes. Support values in A and the former of those in B are calculated by using the Shimodaira-Hasegawa test with FastTree, and later of those in B are the gene support indices (GSI, the gene trees supporting the clade) generated by the UBCG pipeline. Support values in C and D are standard bootstraps generated by IQ-TREE and RaxML, respectively. Support values, > 0.7 for A and B, 64 for B (> 70% out of the 92 gene trees), and 70 for C and D, are considered as significant. All the trees were rooted at the outgroup Roseobacter litoralis Och 149 (GCF_000154785.2), which was omitted. Bar, 0.05 substitutions per amino acid position

We used IQ-TREE in EasyCGTree to build a conduct phylogenomic tree of the genus Paracoccus on the basis of the essential gene set, and compared the tree with that made via bcgTree [11] using default options (Fig. 2C, D; Additional file 1: Table S2). The trees had nearly identical topologies with nodes most of which were well-supported (bootstrap > 90), except that Clades 6 and 7 clustered together in Fig. 2C, and Clade 7 clustered with Clade 8 and another two species and then with Clade 6 in Fig. 2D. In particular, P. alcaliphilus was placed in a position identical to that in Fig. 1A, but it was not well-supported in both trees (bootstraps 57 and 30, respectively). EasyCGTree and bcgTree used different methods for alignment (Clustal Omega [12] vs. MUSCLE [22]), trimming (trimAl [16] vs. Gblocks [32]), and phylogeny inference (IQ-TREE [15] vs. RaxML [33]) (Additional file 1: Table S2). This result suggested that EasyCGTree can conduct reliable phylogenomic analysis as bcgTree. EasyCGTree took 4 h and 44 min (including best-fit model selection), whereas the bcgTree pipeline took 12 h and 14 min. When 50 threads are used instead, EasyCGTree and bcgTree only took 31 min and 1 h and 53 min, respectively. EasyCGTree had considerably higher efficiency than bcgTree because IQ-TREE [15] is much faster than RaxML [33].

To further compare phylogenetic topologies among the four trees, we calculated pairwise cophenetic correlation coefficients (CCCs) and topological distance (i.e., Robinson-Foulds distance, RF) from the Newick files with the dendextend [34] and ape (http://ape-package.ird.fr/) packages in R v4.2.2 (https://github.com/rstudio/rstudio), respectively (Table 2). The CCC facilitates the calculation of the correlation between two cophenetic distance matrices of the two trees, and the value can range from − 1 to 1. Values near 1 mean that the two trees are nearly identical. The CCC values of > 0.99 suggested that EasyCGTree can produce nearly identical trees with similar methods conducted by UBCG and bcgTree. Compared with the tree from the UBCG and essential gene sets, the CCC decreased inapparently (0.957–0.982; Table 2). The RF distance is originally defined as twice the number of internal branches defining different bipartitions of the tips [35]. The branch length score used in this study is similar to the previous distance but considers branch lengths [36]. The results suggested the limited topological distance among the four trees (< 0.1; Table 2), although we were unable to summarize some rules as we did for the CCC analysis. Overall, these results indicated that EasyCGTree can construct SM trees with topologies comparable to those of UBCG and bcgTree.

Table 2 Pairwise cophenetic correlation coefficients (CCC) and Robinson-Foulds distance (RF) among supermatrix (SM) trees in this study

In addition, EasyCGTree was used in constructing consensus tree and ST of the genus Paracossus (Additional file 1: Figure S1). We failed to assess the accuracy of the two approaches, because no portable tool that can perform similar analysis is available. The CCC and topological distance analysis cannot be analysed because the two trees lack of normal branch length that can be considered. However, the consensus and ST trees (Additional file 1: Figure S1) had topologies similar to those of the four SM trees in Fig. 2, although Clades 6 and 7 were divided and P. alcaliphilus clustered with P. amoyensis and P. caeni in the ST tree, which was not observed in the other trees. This finding suggested that ST constructed by EasyCGTree also accounts for prokaryotic phylogenomic analysis as it may draw evolutionary signals from a different perspective, although they were not widely used in prokaryotes currently. Similar to SM, ST is regarded as a critical way to corroborate an evolutionary hypothesis and to infer species tree. We expected ST to be beneficial for recent or rapidly diverging lineages of prokaryotes, as documented for eukaryotes [1, 9, 24, 37].

On the basis of results from the six trees (Fig. 2; Additional file 1: Figure S1), the evolutionary positions of most clades (Clades 1–5) and lineages were regarded as well resolved in the genus Paracoccus, because their positions were consistent among at least five trees (83%) and most of the support values were significant (> 0.7, > 70, or > 64 gene trees). Nevertheless, the positions of P. alcaliphilus and species in Clades 6–8 needed further clarification with enhanced methods or more reasonable gene sets, because low support values and low level of agreements (≤ three trees) were determined among the trees in this study.

Conclusions

We presented the portable, flexible, and cross-platform tool EasyCGTree for genome-based phylogenetic tree reconstruction with SM and ST. Compared with other tools, EasyCGTree was much easier to install and use, and the robustness and accuracy were guaranteed. This tool will benefit microbiologists, especially individuals who use a computer running Windows or do not have a focus of bioinformatics.

Availability and requirements

Project name: EasyCGTree


Project home page: https://github.com/zdf1987/EasyCGTree4


Operating system(s): Linux, Windows


Programming language: Perl


Other requirements: Perl 5.0 or higher


License: GNU GPL


Any restrictions to use by non-academics: license needed

Availability of data and materials

EasyCGTree version 4 is freely available at GitHub (https://github.com/zdf1987/EasyCGTree4). All information regarding installation and application of the pipeline is provided.

Abbreviations

HMM:

Hidden Markov model

ML:

Maximum-likehood

PHD:

Profile HMM database

SM:

Supermatrix

ST:

Supertree

UBCG:

Up-to-date bacterial core gene

References

  1. Snel B, Huynen MA, Dutilh BE. Genome trees and the nature of genome evolution. Annu Rev Microbiol. 2005;59:191–209.

    Article  CAS  PubMed  Google Scholar 

  2. Hugenholtz P, Chuvochina M, Oren A, Parks DH, Soo RM. Prokaryotic taxonomy and nomenclature in the age of big sequence data. ISME J. 2021;15:1879–92.

    Article  PubMed  PubMed Central  Google Scholar 

  3. McInerney JO, Cotton JA, Pisani D. The prokaryotic tree of life: past, present... and future? Trends Ecol Evol. 2008;23:276–81.

    Article  PubMed  Google Scholar 

  4. Pisani D, Cotton JA, McInerney JO. Supertrees disentangle the chimerical origin of eukaryotic genomes. Mol Biol Evol. 2007;24:1752–60.

    Article  CAS  PubMed  Google Scholar 

  5. Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony C, Colles FM, et al. Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain. Microbiology. 2012;158:1005–15.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Parks DH, Chuvochina M, Chaumeil PA, Rinke C, Mussig AJ, Hugenholtz P. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol. 2020;38:1079–86.

    Article  CAS  PubMed  Google Scholar 

  7. Na SI, Kim YO, Yoon SH, Ha SM, Baek I, Chun J. UBCG: up-to-date bacterial core gene set and pipeline for phylogenomic tree reconstruction. J Microbiol. 2018;56:280–5.

    Article  CAS  PubMed  Google Scholar 

  8. Vernikos G, Medini D, Riley DR, Tettelin H. Ten years of pan-genome analyses. Curr Opin Microbiol. 2015;23:148–54.

    Article  CAS  PubMed  Google Scholar 

  9. Alanjary M, Steinke K, Ziemert N. AutoMLST: an automated web server for generating multi-locus species trees highlighting natural product potential. Nucleic Acids Res. 2019;47:W276–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Lee MD. GToTree: a user-friendly workflow for phylogenomics. Bioinformatics. 2019;35:4162–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Ankenbrand MJ, Keller A. bcgTree: automatized phylogenetic tree building from bacterial core genomes. Genome. 2016;59:783–91.

    Article  CAS  PubMed  Google Scholar 

  12. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li WZ, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011;7:539.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Revell LJ, Chamberlain SA. Rphylip: an R interface for PHYLIP. Methods Ecol Evol. 2014;5:976–81.

    Article  Google Scholar 

  14. Price MN, Dehal PS, Arkin AP. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol Biol Evol. 2009;26:1641–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37:1530–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Zhang C, Mirarab S. Weighting by gene tree uncertainty improves accuracy of Quartet-based species trees. Mol Biol Evol. 2022;39:msac215.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Parks DH, Rinke C, Chuvochina M, Chaumeil PA, Woodcroft BJ, Evans PN, et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol. 2017;2:1533–42.

    Article  CAS  PubMed  Google Scholar 

  19. Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, Singh A, et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature. 2015;523:208–11.

    Article  CAS  PubMed  Google Scholar 

  20. Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng JF, et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature. 2013;499:431–7.

    Article  CAS  PubMed  Google Scholar 

  21. Xu L, Sun C, Fang C, Oren A, Xu XW. Genomic-based taxonomic classification of the family Erythrobacteraceae. Int J Syst Evol Microbiol. 2020;70:4470–95.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Zhou XF, Shen XX, Hittinger CT, Rokas A. Evaluating fast maximum likelihood-based phylogenetic programs using empirical phylogenomic data sets. Mol Biol Evol. 2018;35:486–503.

    Article  CAS  PubMed  Google Scholar 

  24. Degnan JH, DeGiorgio M, Bryant D, Rosenberg NA. Properties of consensus methods for inferring species trees from gene trees. Syst Biol. 2009;58:35–54.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol Biol Evol. 2018;35:1547–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49:W293–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Xue HP, Zhang DF, Xu L, Wang XN, Zhang AH, Huang JK, et al. Actirhodobacter atriluteus gen. nov., sp. nov., isolated from the surface water of the Yellow Sea. Antonie Van Leeuwenhoek. 2021;114:1059–68.

    Article  CAS  PubMed  Google Scholar 

  28. Zhang XM, Zhang DF, Zhang YL. Altererythrobacter flava sp. nov., a new member of the family Erythrobacteraceae, isolated from a surface seawater sample. Antonie Van Leeuwenhoek. 2021;114:497–506.

    Article  CAS  PubMed  Google Scholar 

  29. Wang L, Xue HP, Zhang DF, Huang JK, Liu C, Zhang AH. Description of Nocardioides jiangsuensis sp. nov., and proposal for reclassification of the genus Marmoricola as Nocardioides. Curr Microbiol. 2023;80:60.

    Article  CAS  PubMed  Google Scholar 

  30. Xue HP, Li JY, Zhang DF, Zhang AH, Huang JK, Liu C, et al. Aestuariicella albida sp. nov., isolated from surface water of the Yellow Sea, and proposal of the genus Aestuariicella as a member of the family Cellvibrionaceae. Int J Syst Evol Microbiol. 2022;72:005493.

    Article  CAS  Google Scholar 

  31. He W, Xue H-P, Liu C, Zhang AH, Huang J-K, Zhang D-F. Biomineralization of struvite induced by indigenous marine bacteria of the genus Alteromonas. Front Mar Sci. 2023;10:1085345.

    Article  Google Scholar 

  32. Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000;17:540–52.

    Article  CAS  PubMed  Google Scholar 

  33. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Galili T. dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics. 2015;31:3718–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–47.

    Article  Google Scholar 

  36. Rzhetsky A, Nei M. A simple method for estimating and testing minimum-evolution trees. Mol Biol Evol. 1992;9:945–67.

    CAS  Google Scholar 

  37. Simmons MP, Gatesy J. Coalescence vs. concatenation: sophisticated analyses vs. first principles applied to rooting the angiosperms. Mol Phylogen Evol. 2015;91:98–122.

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the National Natural Science Foundation of China (No. 31900001), the Innovation Project for Marine Science and Technology of Jiangsu Province (JSZRHYKJ202209), and the China Postdoctoral Science Foundation (2020M671312).

Author information

Authors and Affiliations

Authors

Contributions

DFZ, ZS and WJL conceptualized the idea for the EasyCGTree package. DFZ and WH wrote the code and developed the package and documentation. DFZ and IA wrote the first draft of the manuscript. YQZ and ZZ critically reviewed, revised, and approved the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Dao-Feng Zhang or Zongze Shao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1

Genomic information of strains used for phylogenomic analysis in this study; Table S2 Performance of EasyCGTree, UBCG, and bcgTree conducting phylogenomic analysis of the genus Paracoccus; Figure S1 Consensus tree and supertree (ST) of the genus Paracossus from the gene set UBCG with EasyCGTree.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, DF., He, W., Shao, Z. et al. EasyCGTree: a pipeline for prokaryotic phylogenomic analysis based on core gene sets. BMC Bioinformatics 24, 390 (2023). https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-023-05527-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-023-05527-2

Keywords