Skip to main content

PoSE: visualization of patterns of sequence evolution using PAML and MATLAB

Abstract

Background

Determining patterns of nucleotide and amino acid substitution is the first step during sequence evolution analysis. However, it is not easy to visualize the different phylogenetic signatures imprinted in aligned nucleotide and amino acid sequences.

Results

Here we present PoSE (Pattern of Sequence Evolution), a reliable resource for unveiling the evolutionary history of sequence alignments and for graphically displaying their contents. Substitutions are displayed by category (transitions and transversions), codon position, and phenotypic effect (synonymous and nonsynonymous). Visualization is accomplished using MATLAB scripts wrapped around PAML (Phylogenetic Analysis by Maximum Likelihood), implemented in an easy-to-use graphical user interface. The application displays inferred substitutions estimated by baseml or codeml, two programs included in the PAML software package. PoSE organizes patterns of substitution in eleven plots, including estimated non-synonymous/synonymous ratios (dN/dS) along the sequence alignment. In addition, PoSE provides visualization and annotation of patterns of amino acid substitutions along groups of related sequences that can be graphically inspected in a phylogenetic tree window.

Conclusions

PoSE is a useful tool to help determine major patterns during sequence evolution of protein-coding sequences, hypervariable regions, or changes in dN/dS ratios. PoSE is publicly available at https://github.com/CDCgov/PoSE

Background

Most molecular evolution analysis depends on choosing a model of substitution; for example, to estimate genetic distances or infer a phylogenetic tree. This initial step relies on determining patterns of substitutions, which results in the quantitative analysis of the mutations found in an alignment. Although this may be a relatively quick computational step, understanding how substitutions accumulated and visualizing substitution patterns along the alignment provides a wealth of useful information about the dynamics of nucleotide and amino acid change. There are several approaches to track unique changes along a phylogenetic path. Here, we present work using ancestral reconstruction as implemented in the software package Phylogenetic Analysis using Maximum Likelihood (PAML) [1] for visualizing evolution patterns. The main strengths of PAML lie in the rich repertoire of evolutionary models implemented, which are used to estimate parameters in models of sequence evolution or to test biological hypotheses. Inferred substitutions can be obtained as an optional output in baseml and codeml programs within PAML, which will generate an additional output file (rst file). This file contains the inferred unique substitutions along the phylogenetic tree and is not straightforward to comprehend, particularly for large data sets. In order to visualize the information stored in it, we used MATLAB for capturing, processing, and graphically displaying all changes inferred by PAML. We used this new resource to analyze sequence alignments of rapidly evolving RNA viruses. For example, we examined the pattern of nucleotide and amino acid substitutions to calibrate poliovirus molecular clocks [2] and to define the evolutionary dynamics of circulating vaccine-derived poliovirus (cVDPV) emergences [3].

Implementation

PoSE is an open source application package with an easy-to-follow graphical user interface (GUI) built in MATLAB. PoSE benefits from MATLAB’s software environment and versatile language syntax for processing large data sets and rendering high-quality graphics. The script also benefits from MATLAB’s extensive development for scientific computing, including applications in bioinformatics and genetic data streamlining (http://www.mathworks.com/company/user_stories/centers-for-disease-control-and-prevention-automates-poliovirus-sequencing-and-tracking.html?by=product). Its GUI was coded using procedural programming, which facilitates addition of future features to the script.

PoSE has over 8,000 lines of MATLAB source code stored in 5 folders and is optimized for MATLAB version 2015b and later versions. PoSE processes the outfile rst file generated by baseml or codeml and produces eleven graphical results and, in addition, interactively displays inferred nucleotide and amino acid substitutions along the phylogenetic tree. The compiled version of PoSE includes all necessary runtime libraries for execution independently from the MATLAB environment. Users do not need MATLAB in order to run PoSE. The compiled version runs in Windows and Mac (10.10–10.13) environments.

The input file for PoSE is the rst file generated after running baseml or codeml in PAML. This file captures the unique nucleotide and amino acid changes along a phylogenetic tree. PoSE requires rst files generated from protein-coding sequence alignments free of gaps and ambiguous bases. The user can refer to the PAML manual for addressing questions related to running baseml or codeml and for treatment of gaps and ambiguities before running PAML.

Each of the eleven plots can be printed or exported as an image in PDF format. In addition, all substitutions displayed in PoSE can be exported in an Excel spreadsheet that includes a Markov matrix of conditional probabilities of observing each type of nucleotide substitution [4]. After reading the rst file from codeml, PoSE annotates a phylogenetic tree by mapping all nucleotide and corresponding amino acid substitutions occurring in both external and internal branches of the tree. Displayed trees can be exported as an annotated Newick-format file for further inspection using specialized phylogenetic programs such as FigTree (http://tree.bio.ed.ac.uk/software/figtree/).

Results

Visualizations include display of nucleotide and amino acid substitutions occurring along a user-defined sequence interval (summary plots via baseml) and along the phylogenetic tree (via codeml). Transition (Ts: A↔G, C↔T) and transversion (Tv: A↔C, A↔T, G↔C, G↔T) substitutions are analyzed by codon position (Fig. 1) and then frequency plots summarize the overall accumulation of Ts and Tv and the accumulation of each substitution within transitions and transversions (Fig. 2).

Fig. 1
figure 1

Transition and transversion substitutions (y-axis) at each base position (x-axis) stratified to three codon positions. Data from a survey of wild poliovirus sequences [2]

Fig. 2
figure 2

Distribution of all types of transitions and transversion substitutions at each codon position. Data from a survey of wild poliovirus sequences [2]

Phylogenetic signatures along the sequence interval are visualized by inspecting each type of substitution according to the phenotypic effect: 1) substitutions not leading to an amino acid change, synonymous transitions (As) and synonymous transversions (Bs), and 2) substitutions leading to an amino acid change, nonsynonymous transitions (Aa) and nonsynonymous transversions (Ba). Occurrence of these four signals can be graphically visualized at each site along the sequence (Fig. 3). PAML estimates quantities for As, Bs, Aa, and Ba according to the nucleotide evolution model set in the control file. PoSE extracts estimated substitutions and calculates the average of these estimations on user-determined sliding sequence window intervals at shifting step sizes. Total synonymous substitutions (dS = As+Bs) and total nonsynonymous substitutions (dN = Aa+Ba) are calculated for dN/dS ratio. The distribution pattern of As, Bs, Aa, Ba, and dN/dS ratio is summarized in subsequent plots (Fig. 4). Sequence windows and step sizes can be dynamically changed in order to refine the plots or explore different parameters. Likewise, the dN/dS ratio is plotted for identifying sequence regions under putative selection (Fig. 5). The last two plots displayed in PoSE show the cumulative number of As, Bs, Aa, Ba, and total number of substitutions (Kt) along the sequence region in user-specified step sizes.

Fig. 3
figure 3

Count of synonymous and nonsynonymous substitutions at each base position. Transitions and transversions were plotted separately. Data from a survey of wild poliovirus sequences [2]

Fig. 4
figure 4

Synonymous substitutions, nonsynonymous substitutions and dN/dS ratio over a sliding window. Upper plot: Dynamics of accumulation of synonymous and non-synonymous substitutions along the sequence interval according to user-defined sliding windows. Lower plot: Estimated dN/dS ratios for each user-defined sliding window. Data from a survey of wild poliovirus sequences [2]

Fig. 5
figure 5

dN/dS ratios estimated from different poliovirus data sets. Upper plot: Wild poliovirus (31 sequences) [2]. Middle: circulating vaccine-derived poliovirus (> 300 sequences) [3]. Lower plot: Immunodeficiency-related vaccine-derived polioviruses (8 sequences) [10]

Inferred nucleotide and amino acid substitutions along the path of the phylogenetic tree can be inspected by processing PAML’s codeml results (Fig. 6). In order to visualize the evolutionary pattern calculated from PAML, the tree is annotated with inferred substitutions as synonymous and nonsynonymous substitutions are displayed separately. The program offers the option of highlighting the substitution types inferred at the tips of the tree. The interactive menu allows further tracking of the substitution pattern along the tree; for example, tracking synonymous transitions in particular branches or exploring nonsynonymous substitutions for particular amino acid changes within a branch.

Fig. 6
figure 6

Annotation of a phylogenetic tree with inferred substitutions in external and internal branches. The sequences that have transitions in second position are highlighted in red. For example, the nucleotide substitutions that lead to nonsynonymous mutations are reported as “18_1@Ts1_3@Ts2_2@Tv1”. This means that sequence #18 has one transition (Ts) at the first codon position, three Ts at the second codon position and two transversions at the first codon position

PoSE exports two annotated phylogenetic trees in Newick format; one with inferred amino acid changes and another with corresponding nonsynonymous nucleotide changes. In addition, PoSE generates two reports in Excel format documenting all the data displayed in the plots and in the phylogenetic tree, including a Markov matrix of relative frequencies of specific base changes.

Discussion

PoSE is a new user-friendly MATLAB script which organizes and graphically displays data from results obtained in baseml and codeml included in the software package PAML. PAML can be run natively using the command-line or using a GUI interface [5]. PoSE can quickly process large data sets (> 1,000 recorded substitutions). However, prior knowledge of the methods used in these two programs is highly recommended. For example, gaps or ambiguities in protein-coding sequence alignments may produce out-of-frame results. Future versions of PoSE will incorporate a single pipeline from PAML to PoSE, including scanning for potential gaps, ambiguities, or misalignments.

There are numerous software packages and bioinformatics resources for estimating genetic distances and inferring phylogenetic trees from sequence data. However, little is known about the nature and dynamics of accumulation of mutations over a sequence region. PoSE provides visualization of the actual changes occurring in phylogenetically related homologous sequences. For example, inspection of transition and transversion changes per site provides information about patterns in the mode of evolution [6, 7]. Also, graphical visualization of current changes provides a higher level of granularity than that observed in sequence alignments; such as detection of hypervariable regions due to increased number of non-synonymous substitutions (Fig. 3).

The molecular clock model of evolution is of particular interest in studies related to rapidly evolving viruses. Inference from sequence data of the tempo and mode of virus transmission can be readily determined using well-known bioinformatics tools such as BEAST [8]. However, mutation saturation due to multiple substitutions per site can underestimate divergence dates [9]. By analyzing all unique nucleotide substitutions, PoSE scans for putative regions of sequence saturation and provides clues about potential substitutions involved in mutation saturation (Fig. 3).

Conclusions

PoSE is an ongoing bioinformatics project aiming at analyzing patterns of sequence evolution in protein-coding sequence alignments. It was developed from studies of large data sets of poliovirus genomes. Compatible with the rapid nature of RNA virus evolution, PoSE facilitates processing and visualization of very large data sets containing thousands of inferred nucleotide and amino acid substitutions. PoSE complements inference of genetic distances and phylogenetic trees by contributing detailed information about the nature, distribution, and dynamics of mutations in easy-to-grasp graphical representations.

Availability and requirements

Project name: PoSE.

Project home page: https://github.com/CDCgov/PoSE

Operating system(s): Windows and Mac (10.10–10.13).

Programming language: MATLAB.

Other requirements: MATLAB Runtime for free execution of PoSE.

License: GNU GPL.

Any restrictions to use by non-academics: none.

Abbreviations

Aa:

nonsynonymous transitions

As:

synonymous transitions

Ba:

nonsynonymous transversions

BEAST:

Bayesian Evolutionary Analysis by Sampling Trees

Bs:

synonymous transversions

cVDPV:

circulating vaccine-derived poliovirus

PAML:

Phylogenetic Analysis using Maximum Likelihood

PoSE:

Visualization of Patterns of Sequence Evolution

References

  1. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–91.

    Article  CAS  Google Scholar 

  2. Jorba J, Campagnoli R, De L, Kew O. Calibration of multiple poliovirus molecular clocks covering an extended evolutionary range. J Virol. 2008;82(9):4429–40.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Burns CC, Shaw J, Jorba J, Bukbuk D, Adu F, Gumede N, Pate MA, Abanida EA, Gasasira A, Iber J, et al. Multiple independent emergences of type 2 vaccine-derived polioviruses during a large outbreak in northern Nigeria. J Virol. 2013;87(9):4907–22.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Allman ES, Rhodes JA. Mathematical models in biology : an introduction. New York: Cambridge University Press; 2004.

    Google Scholar 

  5. Xu B, Yang Z. PAMLX: a graphical user interface for PAML. Mol Biol Evol. 2013;30(12):2723–4.

    Article  CAS  Google Scholar 

  6. Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993;10:512–26.

    CAS  Google Scholar 

  7. Zhao K, Jorba J, Shaw J, Iber J, Chen Q, Bullard K, Kew OM, Burns CC. Are circulating type 2 vaccine-derived polioviruses (VDPVs) genetically distinguishable from immunodeficiency-associated VDPVs? Comput Struct Biotechnol J. 2017;15:456–62.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian Phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29(8):1969–73.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Duchene S, Di Giallonardo F, Holmes EC. Substitution model adequacy and assessing the reliability of estimates of virus evolutionary rates and time scales. Mol Biol Evol. 2016;33(1):255–67.

    Article  CAS  PubMed  Google Scholar 

  10. Yang CF, Chen HY, Jorba J, Sun HC, Yang SJ, Lee HC, Huang YC, Lin TY, Chen PJ, Shimizu H, et al. Intratypic recombination among lineages of type 1 vaccine-derived poliovirus emerging during chronic infection of an immunodeficient patient. J Virol. 2005;79(20):12623–34.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgments

The authors would like to thank Ms. Anita Gajjala (Mathworks) for her technical assistance on MATLAB coding. The findings and conclusions in this report are those of the author(s) and do not necessarily represent the views of the Centers for Disease Control and Prevention. The use of trade names is for identification only and does not imply endorsement by the CDC or the U.S. government.

Funding

All work and publication costs were funded by the Centers for Disease Control and Prevention.

Availability of data and materials

PoSE installation files, example files and a user’s manual about PoSE can be found at https://github.com/CDCgov/PoSE

About this supplement

This article has been published as part of BMC Bioinformatics Volume 19 Supplement 11, 2018: Proceedings from the 6th Workshop on Computational Advances in Molecular Epidemiology (CAME 2017). The full contents of the supplement are available online at https://0-bmcbioinformatics-biomedcentral-com.brum.beds.ac.uk/articles/supplements/volume-19-supplement-11.

Author information

Authors and Affiliations

Authors

Contributions

JJ conceived the software. KZ, EH, and KB designed algorithmic solutions and wrote the code. KZ and JJ wrote the manuscript. MSO, CCB, and JJ contributed to the experimental design and edited the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Kun Zhao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, K., Henderson, E., Bullard, K. et al. PoSE: visualization of patterns of sequence evolution using PAML and MATLAB. BMC Bioinformatics 19 (Suppl 11), 364 (2018). https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-018-2335-7

Download citation

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-018-2335-7

Keywords