Skip to main content
  • Research article
  • Open access
  • Published:

Cluster oligonucleotide signatures for rapid identification by sequencing

Abstract

Background

Oligonucleotide signatures (signatures) have been widely used for studying microbial diversity and function in wet-lab settings, but using them for accurate in silico identification of organisms from high-throughput sequencing (HTS) data is only a proof of concept. Existing signature design programs for sequence signatures (signatures matching exactly one sequence) or clade signatures (signatures matching every sequence in a phylogenetic clade) are not able to identify all possible polymorphic sites for sequences with high similarity and perform poorly when handling large genome sequencing datasets.

Results

We introduce cluster signatures: subsequences that match perfectly and exclusively any group of sequences in a data set. Cluster signatures provide complete recall for primer/probe design and increased discrimination between sequences beyond that of clade signatures. Using cluster signatures for in silico identification of HTS targets achieves good precision/recall and running time performance. This method has been implemented into an open source tool, the Automated Oligonucleotide Design Pipeline (adop), included in supplementary material and available at: https://bitbucket.org/wenchen_aafc/aodp_v2.0_release.

Conclusions

Cluster signatures provide a rapid and universal analysis tool to identify all possible short diagnostic DNA markers and variants from any DNA sequencing dataset. They are particularly useful in discriminating genetic material from closely related organisms and in detecting deleterious mutations in highly or perfectly conserved genomic sites.

Background

Biodiversity research and survey require accurate identification of organisms from the environment, especially those of public concerns, e.g. quarantine species and select agents monitored by national biosafety and biosecurity programs. Identifying the sequences, e.g. DNA markers or genome regions, of concern in ecosystems is the fundamental strategy [1], especially in the metagenomics era which requires high-throughput processing without compromising accuracy and sensitivity.

A widely used strategy for taxonomic assignment of shotgun metagenomes or metabarcodes is to bin [2, 3] or cluster [4, 5] sequencing reads followed by comparing with reference databases using string search algorithms, which, as reviewed previously [6] either depend on alignment-based phylogenetic distances (homology-search), such as BLAST [7] and HMMER [813] or k-mer frequency profile (composition) comparison, such as USEARCH [4, 1420]. These algorithms are implemented in "off-the-shelf" suites for classification of HTS data, such as ShortBRED [21], PanPhlAn [22], MIDAS [23] and mOTU [24], developed for taxonomic classification or for identifying gene homologs from HTS data.

Aligners using BLAST to map reads are very precise, but with high computational cost, while composition-based programs and aligners using suffix-prefix tries are fast but can be imprecise, compounding errors present in most HTS techniques. For example, the average classification accuracy for all fragments of 16S rRNA genes longer than 100 bp was 70% using the Ribosomal Database Project (RDP) Classifier, a text-based Bayesian classifier [17]. A recent study using the same classifier could classify metabarcodes of the 16rRNA genes to family and genus levels with accuracy 75% or lower [25]. Sigma [6] and Pathoscope [10, 26] are systems developed for subspecies and strain-level inference of metagenomics data, but are not applicable to metabarcoding data, since DNA barcodes are known to lack discriminating power for many taxon lineages [2730]. For instance, the internal transcribed spacer 1 (ITS1) of Tilletia indica, a quarantine pathogen in many countries, and T. walkeri which is not regulated by most countries except for South Korea, differ only by two bases.

The Minimum Entropy Decomposition (MED) algorithm implemented in OligoTyping [31, 32] identifies information-rich polymorphic sites and iteratively partitions a set of metabarcodes to homogeneous operational taxonomic units (OTUs), until the Shannon entropy profile of a given node is converged or below a given threshold.

Eren et al. [31] stated that MED was able to discriminate taxa with less than 1% sequence variance and is computationally efficient. While MED is excellent in identifying distinct subgroups of a taxon adapted to specific environmental niches, it works best on abundant OTUs/taxa observed across diverse ecosystems, while many pathogens present as rare taxa in the environment. In addition, alignment is required prior to MED when differences in sequence length do not represent biologically meaningful variation, which can be a main constraint on efficiency when processing HTS reads not of the same length, e.g. quality trimmed Illumina data or 454 pyrosequencing data. While the discriminating positions identified by MED have the potential for strain-typing microoganisms, MED does not directly extract oligonucleotide signatures associated with these positions that may be used as primers or probes for the development of molecular diagnostic assays.

Oligonucleotide signatures (signatures) are short sequence strings (λ-mers) of fixed length (signature length λ), normally 18 to 100 bp, that match exactly and exclusively one or more targeted sequence(s) (targets) in a given genetic data set, usually from the same region of the genomes of targeted taxa. Most existing approaches only design sequence signatures, i.e. signatures for single sequencesFootnote 1 [3335] or a single group of sequences per run [36, 37] as reviewed previously [3840].

A few applications were developed to design signatures for pre-defined groups of genomes [41, 42], gene families [43] or clade signaturesFootnote 2, i.e. signatures for a single phylogenetic cladeFootnote 3 [38]. However, these applications either suffer from memory and runtime issues, or are part of larger, special purpose systems [39].

In addition, phylogenetic clades and other a priori groupings can be very restrictive to the identification of viable signatures, which may be caused by conflicting phylogenetic signals among loci shared by different taxonomic domains [44] as found in our own studies [45, 46]. This restrictiveness is further compounded by additional experimental constraints, such as primer/probe melting temperature [47] or Kane’s conditions [4850].

Signatures have wide applications in the biological field, such as being used as primers and probes in PCR and DNA-hybridization [40, 46, 5153] or lab-on-a-chip detection methods, as well as in targeted enrichment methods for focused high-throughput sequencing (HTS) [54, 55]. Kallisto [56] uses signatures (of length k: k-mers) from RNA-Seq reads to create k-compatibility classes, whose intersection represents the set of possible sequences matching a given read. Similarly, Salmon [57] builds equivalence classes over fragments of reads (in effect signatures), from which it infers statistically the relative abundance of transcripts. Kraken [58] infers the taxonomic classification for HTS reads by building a database of phylogenetic lowest common ancestors using clade signatures. We show further that a significant number of signatures cross clade boundaries. We also show that while signatures work very well on perfectly preserved reads, they are brittle to errors introduced by the HTS process.

Signatures have also been used to detect pathogenic microbes from metagenomics sequencing data. This theoretical approach, termed Electronic probe Diagnostic Nucleic acid Analysis (EDNA) [59, 60], shows promising research and diagnosis direction (75% precision on a mock database) using shotgun metagenomics data, but relies on a priori groupings of the training data set (reference genomes), and a priori differentiation against false positives identified using near neighbor comparisons in a reference database. EDNA also depends on an external program for signature design, the Tool for Oligonucleotide Fingerprint Identification (TOFI) [42], which introduces runtime efficiency constraints. A system that can streamline this process would be ideal as a regulatory tool in pest detection and management.

We present here a research tool for unrestricted design of signatures that can be used for the detection of any kind of groups (mutants, species/subspecies, or any a priori groupings) in a wide range of molecular biology assays or DNA sequence data for in silico probing.

Methods

The usefulness of signatures is based on the low probability of accidental match between a signature and unrelated genetic material. The probability p of an accidental match (collision) between two 4-base nucleotide strings of length λ (4λ possibilities) in a data set of size N can be modeled by the birthday formula [61]:

$$ p = \frac{4^{\lambda} !}{4^{\lambda N} \left(4^{\lambda}-N\right)!} \gtrapprox 1 - e^{-\frac{N^{2} }{ 2 \cdot 4^{\lambda}}} \textrm{ for} N \ll 4^{\lambda} $$
(1)

Accidental matches between portions of nucleotide strings occur in random genetic material when there is no taxonomic or functional relationship between the query and testing sequences. For example, it is likely to encounter the subsequence of length λ=4 “ACGT” multiple times in different sequences in any given large reference database. The birthday formula quantifies the probability of such accidental matches.

Assuming an uniform distribution of nucleotides and signature length λ=36, p<10−4 applies for data sets with N<2.94×1010 nucleotidesFootnote 4 (approximately 274 GiB of unaligned FASTA filesFootnote 5). In practice, data sets of taxonomically related DNA have a higher degree of similarity between sequences, which increases the probability that any two identical subsequences have a taxonomic or functional relationship between them and do not represent accidental matches. Unless explicitly specified, signatures of length λ=36 are used for analyses in this study.

Clusters

We introduce an extension of clade signatures: for a given set of sequences, a cluster is a group of sequences for which at least one signature (cluster signature) can be found, that matches all sequences in the group but does not match any sequences not in the group.

Notably, clusters are not required to represent the same groups as those in phylogenetic clades; they are any groups of sequences for which signatures can be found, as opposed to clade signatures for predefined phylogenetic groups. Any subsequence of length λ of any sequence is a signature for exactly one cluster, i.e. a cluster signature.

It is not obvious how to predict the number of clusters expected for a given data set of taxonomically related sequences. For example, S identical sequences of length Li>λ where 1≤iS, will generate exactly 1 cluster. Since any subsequence of length λ from any sequence can be found in every other sequence, the single cluster will contain all sequences. By contrast, a data set with size N4λ containing S sequences randomly generated using a uniform distribution of nucleotides will have S clusters of signatures of length λ. Each such cluster will contain one sequence, since it is very likely that every subsequence of length λ in any sequence is a sequence signature: it will not be found anywhere else in the data set p<10−4 (Eq. 1).

The automated oligonucleotide design pipeline

We have built an open source tool, aodp, the Automated Oligonucleotide Design Pipeline (aodp v.2.5 is included in the supplementary material), which generates efficiently signatures for sequences, clades and clusters by enumerating all λ-mers (signatures) for each sequence in a given data set. The list of originator sequences is collected for each enumerated signature. All distinct sets of originator sequences for all signatures form the list of clusters for the data set. Facilities for enumerating and cross-referencing signatures and clusters are provided.

Furthermore, aodp can be used to find the closest matching sequences from a training set to a query sequence, assumed to be an imperfectly recovered portion of an unknown sequence (such as an HTS read) by computing the union of all sequences in all clusters matching any portion of the HTS read and then heuristically eliminating all but sequences that explain the largest portion of the HTS read. All remaining sequences are then compared to the HTS read and only the ones with the highest overlap score are kept.

More formally, the matching algorithm 1 works as follows: first, compute a set of matching clusters Ω for each query sequence Q (loop 1): we observe later (Table 3) that the set of all training sequences contained in all matching clusters Ω has average size \(\overline {\Omega } < S\) smaller than the size of the training set; second, minimize a subset (kernel) Ψ of Ω (loop 2): we observe that the average size of the kernel \(\overline {\Psi } \ll S\) is much smaller than the size of the training set; and, finally, compute the sequence similarity of each training sequence in the kernel Ψ against the query sequence (loop 4) using a global alignment algorithm [62].

The result (the set of mapped HTS reads) is the subset of sequences of Ψ which maximize the alignment score to the query sequence.

The main objective of the algorithm is to minimize the number of computationally expensive global alignments (step 5). The complexity has no direct dependency on the size of the training set: loops 1 and 3 have complexity O(|Q|) linear in the size of the query sequence and loop 2 has complexity O(|Ω|) linear in the number of clusters matching the query sequence, which can be further reduced at the implementation level through the elimination of repeated set operations in loop 3.

A general limitation of algorithms for matching HTS reads (including our method) is that metabarcoding regions used in HTS do not always have sufficient discriminating power to differentiate very closely related species [28] represented by clades in a training dataset containing almost identical sequences, which, however, belong to multiple valid species. In this case, all matched reference sequences are given to a query sequence, and it should be the users’ decision if alternative DNA markers or wet lab molecular diagnostic assays should be used to confirm or validate the existence of targeted taxa of interest.

Data sets

Data sets for four important mycotoxin genera (Alternaria, Aspergillus, Claviceps and Penicillium) were built using the following methodology: internal transcribed spacer rDNA region (ITS) data sets were compiled from GenBank [63] using ex-type sequences as backbone when available and building up the database from additional trustworthy taxonomic reviews [64, 65, 6568]. The data sets were aligned in MAFFT v. 7.305b [69], using the G-INS-i algorithm and trimmed manually in Geneious v. 8.1.8. Neighbour-Joining trees were calculated in PAUP* v. 4.0b10 [70].

Reference ITS sequences for fungi (Anisogramma, Ceratorhiza, Ceratocystis, Colletotrichum, Coniella, Diaporthe, Fusarium, Elsinoe, Talaromyces, Tillletia), oomycetes (Peronospora), as well as the 16S rRNA genes of a bacterium (Pectobacterium) were downloaded from GenBank. The ITS dataset Phytophthora was obtained from [46]. The sequences for each dataset were aligned using the G-INS-i algorithm in MAFFT [69], and trimmed manually in BioEdit v.7.2.5 [71]. The approximate maximum likelihood trees were reconstructed using FastTree v.2.1.8 [72].

Each data set contains DNA sequences and a phylogenetic tree with the sequences as leaf nodes. The data sets were combined into a sequence database 17DataSets, provided as supplementary material. Sequences with more than five ambiguous bases were removed from each data set. The characteristics of each data set are summarized in Table 1.

Table 1 Data sets included in database 17DataSets

The distribution of cluster size and number of cluster signatures was also studied on a much larger dataset (Unite; included in the supplementary material) of 271,017 sequences fully identified down to the species level and which include an authoritative Latin binomial name for each species. The data set was extracted from the UNITE+INSD database released by the User-friendly Nordic ITS Ectomycorrhiza Database (UNITE, version 7.1Footnote 6), [73]. A phylogenetic tree was automatically built from the Unite taxonomy using tax2nwk, a companion utility of aodp.

The sequence matching functionality was evaluated using a training set of 1,338 mycotoxin sequences (4Mycotoxins; included in the supplementary material) by combining the data sets Alternaria, Aspergillus, Claviceps and Penicillium. Sequences from each data set not classified to the principal genus of the data set and/or with more than five ambiguous bases were eliminated.

The precision and recall of the matching algorithm were evaluated using a testing set 4MicotoxinsBootstrap bootstrapped from 4Mycotoxins: subsequences of exactly |Q|=100 bp starting at a random position were extracted from each sequence. In each subsequence, each nucleotide was modified to another nucleotide or a gap. Individual modifications were made at one of six error rates: ε{0.00, 0.01, 0.02, 0.03, 0.04, 0.05 }. For each sequence and each error rate, 10 subsequences were generated. A total of 80,280=1,338×6×10 query sequences were generated. All random choices were drawn from uniform distributions driven by a Mersenne twister [74], seeded with a high resolution timestamp.

The efficiency of the matching algorithm was evaluated on a testing set 97AerobiotaSamples containing 4,713,791 sequences (sequence length |Q|≈436bp±55; only sequences at least 325 bp long are selected) from a data set deposited in the Sequence Read Archive (SRA) under project accession number PRJNA358221. The error rate assigned to the data set was ε=0.01 [75].

Comparisons with other algorithms

We have compared the computational efficiency of our matching algorithm with BLAST+ v.2.6.0 [7] and USEARCH v10.0.240_i86linux32 [4] testing on the 97AerobiotaSamples data set and using the 4Mycotoxins reference data set. All test runs were conducted on a system with Intel Core i7-3632QM CPU 2.20GHz ×8 running Ubuntu 16.04.

The following parameters were used for BLAST: “-word_size 11 -outfmt 6 -num_threads 8 -evalue 10 -max_target_seqs 100”.

The following parameters were used for USEARCH: “-usearch_global -strand plus -id 0.98 -maxaccepts 256 -maxrejects 1024 -wordlength 8 -blast6out”.

In all instances, output was ignored (redirected to /dev/null) in order to eliminate I/O contention.

Separately, we compared precision and recall (Eqs. 2 and 3) of our matching algorithm with USEARCH, on the 4MycotoxinsBootstrap dataset using the 4Mycotoxins reference set.

For USEARCH we used the following parameters: “-usearch_global -strand plus -wordlength 8 -blast6out”. Additionally, the “-id” parameter was set to 1−2ε to correspond to the error rate of the data set, “maxaccepts” χ was varied for different runs χ{4,16,64,256,1024} and “maxrejects” was set to 32×χ.

For both USEARCH and aodp, the match between a query sequence and a training sequence was considered correct if it is returned by the tool, and it has the highest percentage overlap compared to all other matching training sequences.

Results

Large scale dependencies for the number of clusters were measured on the data set Unite. Most clusters have a relatively small number of sequences (Fig. 1): 85% have less than 100 sequences, 50% have less than 10 sequences and approximately 15% have one sequence (signable sequences). Clusters have a relatively small number of signatures (Fig. 2): 65% have less than 10 signatures and almost 30% have exactly one signature.

Fig. 1
figure 1

Distribution of number of sequences per cluster (cumulative percentage), data set Unite, λ=36

Fig. 2
figure 2

Distribution of number of signatures per cluster (cumulative percentage), data set Unite, λ=36

Other dependencies for the number of clusters are measured on the 17DataSets database. The number of clusters c is found to be comparable with the number of phylogenetic clades n=S+i in each of the data sets (0.7 ≤c/n≤ 3.6). Power law dependencies on the size of the data set N for the number of clusters c and number of signable clades n are indicated by a log-log plot (Fig. 3). A power law dependency of the number of clusters c on the number of signable clades n is indicated by regression lines.

Fig. 3
figure 3

Dependency of the number of clusters (groups of sequences for which at least one signature can be found) and number of signable clades (phylogenetic clades to which oligonucleotide signatures can be assigned) on number of sequences within each dataset (database 17DataSets, λ=36)

The dependency of the number of clusters and signable clades on signature size 12≤λ≤252 (increments of 4 nucleotides) is measured for the data set Penicillium (Fig. 4). The number of clusters c decreases rapidly with the signature length λ, because of the further reduction of the number of signatures in each cluster. The number of signable clades is relatively stable (slow initial increase).

Fig. 4
figure 4

Dependency of the number of clusters (groups of sequences for which at least one signature can be found) and number signable clades (phylogenetic clades to which oligonucleotide signatures can be assigned) on signature length, data set Penicillium

Clusters for probe design

Characteristics of clusters, signable clades and signable sequences were calculated in aggregate for all data sets and reported in Table 1. An incidence matrix for sequences (vertical axis) against clusters (horizontal axis) for the Ceratorhiza data set is shown in Fig. 5. Signed sequences and internal signable are grouped in regions to the left of the figure.

Fig. 5
figure 5

Sequence by cluster incidence matrix for the Ceratorhiza data set (λ=36). Each row contains cluster matches associated with the sequence with numeric identifier on the y-axis (a fingerprint of the sequence). Each column represents sequences contained in a given cluster. Signable sequences and signable internal clades are at the left. The remaining (bare) clusters are at the right

The number of signable clades is smaller than the number of clades n<n, in some cases substantially, for example n/n=17% for Alternaria.

The number of sequences s0 that are not contained in any signable clades can be substantial. This likely indicates a data set with high degree of similarity between sequences in different phylogenetic clades.

For example, 50% of sequences in the Fusarium data set are not contained in any signable clade. For the Ceratorhiza data set, seven sequences (0, 4, 6, 8, 14, 16 and 17) have no signable clades (18% of the total).

Since every subsequence of length λ of every sequence is a cluster signature, every sequence is a member of at least one cluster. In other words, clusters provide signatures for every sequence of a data set (complete recall).

Each sequence has an associated cluster pattern (fingerprint) in the sequence-to-cluster incidence matrix (Fig. 5). This pattern may be unique for the sequence or can be shared with other sequences. For example, sequence 36 has a unique cluster pattern, but sequences 22, 23 and 24 have identical patterns. We call the number sc of unique row patterns in the incidence matrix, unique cluster patterns.

If only sequence signatures are taken into account, all signable sequences can be uniquely and trivially identified by a sequence signature (a cluster of size 1). In other words, the number ss of sequences that can be uniquely identified using sequence signatures is the number of signable sequences.

If taking into account signable clades and signable sequences, additional sequences may be uniquely identified from their signable clade pattern. For example, sequence 27 can be uniquely identified using its clade signature pattern, although it could not be identified using its sequence signature (it does not have one). We call the number sn of such sequences, unique signable clade patterns.

For each data set, the quantities ss, sn and sc were computed and the percent change δc=scsn, which we call discrimination attributable to clusters. For all data sets except Talaromyces, δc>0. Substantial gains can be seen, for example, for Ceratocystis (35%), Diaporthe (29%) and Fusarium (26%).

For the two data sets with the highest ratio less than 100% of unique signable clade patterns sn: Anisogramma (sn=96%) and Coniella (77%), the ratio of unique cluster patterns sc increases to 100%.

Compared to using only phylogenetic clade signatures, where some sequences do not appear in any signable clade (s0>0 in most cases) recall (Eq. 3) is always 100% when using clusters (since every sequence is a member of at least one cluster). Selectivity is also increased since more sequences can be differentiated through unique cluster patterns vs. unique signable clade patterns (δc>0 in most cases).

Consequently, the design of wet lab probes based on cluster signatures can improve recall and selectivity compared to designing only sequence or clade signature probes.

Clusters for high-throughput sequencing

Precision (Eq. 2), recall (Eq. 3) and the F-measure (Eq. 4) for the matching algorithm 1 were evaluated using the 4Mycotoxins training set and the bootstrapped 4MicotoxinsBootstrap testing setFootnote 7 for signature lengths λ{8, 16, 24, 32, 40 }. Identification was considered correct if for a query sequence its originator sequence from the training set is reported among the sequences with maximum similarity α over a threshold (Eq. 5) dependent on the error rate ε.

$$\begin{array}{*{20}l} \text{precision} & = \frac{\textrm{correctly identified}}{\textrm{total identified}} \end{array} $$
(2)
$$\begin{array}{*{20}l} \text{recall} & = \frac{\textrm{correctly identified}}{\textrm{total number of terms}} \end{array} $$
(3)
$$\begin{array}{*{20}l} F & = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \end{array} $$
(4)
$$\begin{array}{*{20}l} \alpha \geq 1 - 2\epsilon \end{array} $$
(5)

Precision and recall was found to decrease with the increase of the error rate ε and signature length λ. Precision was consistently above 0.9 for λ≥16. Recall degraded below 0.5 for higher error rates.

Imperfect recall is due to two factors: “crowding” of defects in a query sequence to the point where there are no preserved subsequences of length λ, and to query sequences that have more errors, failing the similarity threshold (Eq. 5).

Imperfect precision is due to errors in the query sequence leading to accidental matches and higher similarity scores (α) for sequences in the training set other than the originator sequence. This is likely to happen in training sets with high degree of similarity between sequences, for example, when an error may coincide with a single nucleotide polymorphism site.

Precision and recall for the bootstrapped test sets can be applied to only matching sequences in real test sets, where a substantial portion of the data may be unrelated to the training set.

Precision and recall are driven by the size and nature of the training set and the statistical properties of the error-introducing mechanism.

Precision and recall were compared on the same data sets with USEARCH by varying the maxaccepts parameter χ{4,16,64,256,1024}.

We notice that USEARCH outperforms aodp for the highest value of χ=1024 on the combined F measure (Eq. 4), however aodp outperforms USEARCH for smaller values of χ and at lower error rates ε≤0.03. Moreover, the values of χ and the related USEARCH parameter “maxrejects” must be chosen a priori. The optimal value of this parameter likely depends on the degree of similarity of sequences within the training set. For example, for 4Mycotoxins, the optimal value is in range of the total number of sequences (1,338).

For aodp, the optimal value of λ can be chosen based on the error rate of the testing set. The set of matching sequences self-calibrates to the size of the matching clusters.

The computational efficiency of the matching algorithm was measured on a realistic test set 97AerobiotaSamples, for different values of the signature length λ{16,24,32,40}.

The number μ98 of matching query sequences (query sequences with similarity α≥1−2ε=0.98 to at least one training sequence) is relatively stable for different values of λ. The matching kernel ΨΘ is a close approximation of the result set.

The set of all training sequences contained in all matching clusters Ω has average size \(\overline {\Omega } < S = \textrm {1,338}\) smaller than the size of the training set. The average size of the kernel \(\overline {\Psi } \ll S\) (average number of alignments) is much smaller than the size of the training set, which shows that using clusters and reducing the matching sequence kernel are effective in reducing the number of alignments, and consequently running time compared to a brute force approach that would align every query sequence to every sequence in the training set.

Running time is dependent on the number of alignments (\(\overline {\Psi }\)), and to the set of training sequences in matching clusters (\(\overline {\Omega }\)): loop at line 2 in algorithm 1.

Running time degrades to impractical values for lower values of λ (estimated at over 200h for λ=8). This is due to very large values of \(\overline {\Omega }\) (overfitting) from large number of clusters in the training set at low signature length (Fig. 4) and high likelihood of accidental matches between short signatures and query sequences (Eq. 1).

Running time is also measured for BLAST and USEARCH on the same data sets. aodp outperforms BLAST by one order of magnitude. Running time is also faster, if comparable to USEARCH.

Discussion

The number of clusters is larger than the number of signable clades (Fig. 4), but comparable to the total number of clades. Within experimental constraints, it is feasible to design signatures for each cluster in a data set. Cluster signatures offer increased discrimination compared to sequences or clades signatures.

The number and composition of clusters is an objective property of a given data set. Conversely, phylogenies can be subjective when prepared by human taxonomists or inaccurate when automatically built using specific heuristics, in some cases with subjective parameters.

Most clusters have a small number of signatures (are brittle to additional experimental constraints) and a small number of sequences (have focused discrimination). To achieve optimal discrimination for clusters, signature length should be chosen as small as practical above the lower limit imposed by the birthday formula (Eq. 1).

Clusters provide signatures for every sequence in a data set (complete recall).

This makes it practical to design probes that identify DNA sequences from data sets with very closely related material, where some of the sequences may not be represented in any of the signable clades. Unique cluster patterns associated with sequences (Fig. 5) can help uniquely identify sequences from a data set, beyond the ability of unique signature clade patterns, in some cases for 100% of the sequences.

Cluster signatures can be used as clues for identifying partial, imperfectly copied query sequences (such as produced by HTS) against a training set of reference sequences. Combined with a global alignment algorithm for comparing candidate sequences from the matching sequence kernel of a set of matching clusters, a matching algorithm (algorithm 1) achieves good matching precision and recall for test sets of different quality (Table 2).

Table 2 Precision and recall of our matching algorithm (aodp) and USEARCH using the 4Mycotoxins training set and the 4MicotoxinsBootstrap testing set

Using a set of matching clusters Ω to the query sequence significantly reduces the number of pairwise comparisons (\(\overline {\Omega }\) Table 3) compared to the brute force approach. Reducing to a kernel of matching sequences \(\overline {\Psi }\) further decreases the number of alignments and provides good running time performance, with dependence on the number and size of clusters in the training database, but not on the actual size of the training database.

Table 3 Performance of the matching algorithm using the 4Mycotoxins training set (1,338 sequences) and the 97AerobiotaSamples testing set by signature length λ

Increasing the signature length λ generally increases the precision and decreases the running time of the matching algorithm, but decreases the recall, even to unsatisfactory values (Table 2) for testing sets with high error rates ε. However, lower recall values (e.g. at or below 50%) may be acceptable when the assertion of existence of the target and not the accuracy in abundance was the objective of the investigation. Also, sequencing read accuracy at or above 98% (ε≤0.02) is provided by the majority of HTS techniques, although sometimes through building consensus [76].

Choosing very small values for the sequence length (λ<16) leads to overfitting.

Additional thermodynamic constraints such as the elimination of homopolymer regions and filtering on melting temperature [47, 77] also apply to the design of signatures for assay development. Because of the variability of study objectives and experimental conditions, thermodynamic constraints have not been taken into account in analyses in this study, although support is built into aodp.

Results reported in the current study were drawn primarily from a wide variety of fungal groups, with a focus on plant pathogen and mycotoxin producers. We are confident that these can be generalized to include sequences from other organisms.

Since clusters do not rely on phylogenetic assumptions, but may only coincide with phylogenetic clades, there is no direct dependency of cluster parameters on a specific phylogeny. The phylogeny independent clusters can be particularly useful when it it important to follow some specific DNA sequences such as resistance point mutations or horizontally acquired genes.

A comprehensive study on a variety of data sets with different length distributions and systematically varied completeness and diversity may provide further insights (future research). Power law goodness-of-fit tests such as the Kolmogorov-Smirnoff statistic [78] for the dependencies of the number clusters and signable clades on number of sequences using a larger number of data sets may help quantify results.

Precision and recall were evaluated on a data set (4MicotoxinsBootstrap) of 100bp sequences with randomly-introduced errors at given error rates from a source data set (4Mycotoxins). While this does not account for variable read length generated by different HTS methods or for non-random defects, such as homopolymer errors or issues related to palindromic sequences, precision and recall targeted to specific methods can be modeled into the defect-introducing mechanism.

On precision and recall (F measure), our matching algorithm outperforms USEARCH for lower values of “maxaccepts” (χ≤256) and lower error rates (ε≤0.03). This likely happens in situations of closely related portions of training sequences, of which a large number (possibly overlapping a signature cluster) are equally similar to the training sequence. By imposing a limit on the size of the result set, the source sequence may or may not be included in the first χ USEARCH matches. Similar behaviour can be expected when varying the “max_target_seqs” in BLAST.

USEARCH outperforms our algorithm for higher values of χ (in range of the number of training sequences), but this parameter is dependent on the degree of similarity of the training set and must be chosen a priori. Always choosing large values may be impractical, since the size of the result sets increase dramatically with higher values of χ.

Precision and recall were only evaluated on the bootstrapped data set, where introduced errors could be traced back to the originator sequence (ground truth) for comparison with the reported matches. Such a source of ground truth could not be easily derived for the larger 97AerobiotaSamples data set since the query sequences are, by definition, unknown and the number of matching terms may be too small to draw statistically sound conclusions: est. 1,300 vs. 80,280 terms for 4MycotoxinsBootstrap. Future research may look at comparisons on a real data set with a substantial number of matches with matches reported by BLAST as a source of ground truth. An additional complication for such a study may be the need to choose a high “max_target_seq” parameter to cover all possible matching training sequences, resulting in very high running times.

Our matching algorithm outperforms BLAST on running time by one order of magnitude on a realistic data set, with a "max_target_seqs" χ=100. BLAST performance degrades much further for higher values of χ.

aodp also outperforms USEARCH on running time. This may be due to the nature of the testing set: approximately 1,300 matches in about 4.5 million reads, which may be reasonable for targeted studies of environmental samples, but may not hold for other types of investigations.

On a highly parallel system (80 hardware threads; results unreported), the difference increases to two orders of magnitude for BLAST and increses further for USEARCH because of good multithreading scalability of aodp.

The number and size of clusters are fixed parameters of a given database and represent the main drivers for the running time of the matching algorithm. In situations where short running time is essential without access to large computational resources, running time may be shortened by increasing the signature length (λ) at the expense of recall, e.g. for a preliminary “quick” run, or the size of the source database can be reduced (which will lead to a reduction of the number and size of clusters).

Further improvement of the running time for aodp may be achieved by a more efficient implementation of the global alignment algorithm (step 5 in algorithm 1), such as using nucleotide k-mers, or alignment clues from the positioning of the matching signatures.

Conversely, cluster signatures could be used in a preprocessing step to quickly eliminate or identify candidate matching sequences, to be further validated using a matching algorithm with different objectives.

The study of the signability of other groupings such as gene function may be useful.

Another promising avenue of future research may be the study of cluster signatures for genetic variants in guiding the detection of mutations relevant to evolution, genetic diseases or rapid comparisons of genomes between tumors and healthy cells.

Conclusions

In this study, we evaluated the statistical properties of cluster signatures (oligonucleotide signatures for groups of sequences in data sets of DNA sequences) and their use for mass identification by sequencing.

Our method is universal as it can find oligonucleotide signatures for unique strains, species, higher level phylogenetic clades or mutations linked to genetic diseases or genetic abnormalities. Once diagnostic cluster signatures are known, rapid analysis tools for detection of high risk species, strains or mutations can be developed.

Our matching algorithm using signature clusters increases the efficiency of matching HTS reads against data sets of reference genetic material compared to string alignment methods (orders of magnitude faster than BLAST) and even outperforms high performance k-mer string search algorithms (such as USEARCH) for realistic environmental sample studies.

The matching algorithm also maintains good precision and recall compared to less sensitive string search methods and even outperforms USEARCH for reasonably high settings of the “maxaccepts” value on data sets with lower error rates (ε≤0.03).

The matching algorithm does not rely on a-priory selection of a parameter limiting the result setlength, such as “max_target_seqs” for BLAST or “maxaccepts” for USEARCH, but self-calibrates to the size of the set of matching clusters of each query sequence.

Using cluster signatures improves recall and accuracy of existing in vitro methods of identification, especially for data sets containing closely related genetic material, without needing to rely on a priori hierarchical phylogenetic grouping.

Cluster signatures and the aodp utility can increase the sensitivity and accuracy of PCR-based and DNA hybridization-based experiments compared to traditional methods based on sequence or phylogenetic clade signatures. Cluster signatures can also be used for targeted enrichment-based HTS, developing accurate, sensitive and efficient diagnostic tools for in vivo or in silico detection of high-risk pathogens or mutation of genes linked to genetic disorders or tumors, using genomics, genetics and metagenomics sequencing data.

Notes

  1. Signable sequence: sequence with at least one sequence signature.

  2. Signable clade: phylogenetic clade with at least one clade signature.

  3. Following phylogenetic tree terminology, we call clades with one sequence or with only identical sequences leaf clades and clades with more sequences internal clades.

  4. Since λ is generally much smaller than the size of each sequence, the total number of λ-mers can be approximated by the number of nucleotides in the database.

  5. This may represent the union of reference and query databases covering a single genome, a set of genomes, or a set of single DNA markers or multiple loci shared by different taxonomic lineages.

  6. https://unite.ut.ee/sh_files/UNITE_public_20.11.2016.fasta.zip

  7. The bootstrapped test set 4MicotoxinsBootstrap does not provide true negative examples.

References

  1. National Research Council. Sequence-based Classification of Select Agents: a Brighter Line. Washington: National Academies Press; 2010.

    Google Scholar 

  2. Sedlar K, Kupkova K, Provaznik I. Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput Struct Biotechnol J. 2017; 15:48–55.

    Article  CAS  Google Scholar 

  3. Lin H-H, Liao Y-C. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci Rep. 2016; 6:24175.

    Article  CAS  Google Scholar 

  4. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010; 26(19):2460–1.

    Article  CAS  Google Scholar 

  5. Fu L, Niu B, Zhu Z, Wu W. Sitao amd Li: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28(23):3150–2.

    Article  CAS  Google Scholar 

  6. Ahn T-H, Chai J, Pan C. Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance. Bioinformatics. 2015; 31(2):170–7.

    Article  CAS  Google Scholar 

  7. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.

    Article  CAS  Google Scholar 

  8. Eddy SR. Multiple alignment using hidden Markov models. In: ISMB, vol. 3: 1995. p. 114–20.

  9. Poulsen TM, Frith M. Variable-order sequence modeling improves bacterial strain discrimination for ion torrent dna reads. BMC Bioinformatics. 2017; 18(1):299.

    Article  Google Scholar 

  10. Hong C, Manimaran S, Shen Y, Perez-Rogers JF, Byrd AL, Castro-Nallar E, Crandall KA, Johnson WE. Pathoscope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. Microbiome. 2014; 2(1):33.

    Article  Google Scholar 

  11. Haque M, Ghosh TS, Komanduri D, Mande SS. Sort-items: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics. 2009; 25(14):1722–30.

    Article  Google Scholar 

  12. Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. Genome Biol. 2011; 12(1):11.

    Article  CAS  Google Scholar 

  13. Nguyen N-p, Mirarab S, Liu B, Pop M, Warnow T. Tipp: taxonomic identification and phylogenetic profiling. Bioinformatics. 2014; 30(24):3548–55.

    Article  CAS  Google Scholar 

  14. Rosen G, Garbarine E, Caseiro D, Polikar R, Sokhansanj B. Metagenome fragment classification using n-mer frequency profiles. Adv Bioinformatics. 2008; 2008.

  15. Rosen GL, Reichenberger ER, Rosenfeld AM. Nbc: the naive bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics. 2010; 27(1):127–9.

    Article  Google Scholar 

  16. Lan Y, Wang Q, Cole JR, Rosen GL. Using the rdp classifier to predict taxonomic novelty and reduce the search space for finding novel organisms. PLoS ONE. 2012; 7(3):32491.

    Article  Google Scholar 

  17. Wang Q, Garrity GM, Tiedje JM, Cole JR. Na ive bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007; 73(16):5261–7.

    Article  CAS  Google Scholar 

  18. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009; 75(23):7537–41.

    Article  CAS  Google Scholar 

  19. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010; 6(2):1553–7358.

    Article  Google Scholar 

  20. MacDonald NJ, Parks DH, Beiko RG. Rapid identification of high-confidence taxonomic assignments for metagenomic data. Nucleic Acids Res. 2012; 40(14):1362–4962.

    Article  Google Scholar 

  21. Kaminski J, Gibson MK, Franzosa EA, Segata N, Dantas G, Huttenhower C. High-specificity targeted functional profiling in microbial communities with ShortBRED. PLoS Comput Biol. 2015; 11(12):1–22.

    Article  Google Scholar 

  22. Scholz M, Ward DV, Pasolli E, Tolio T, Zolfo M, Asnicar F, Truong DT, Tett A, Morrow AL, Segata N. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat Methods. 2016; 13:435–8.

    Article  CAS  Google Scholar 

  23. Nayfach S, Rodriguez-Mueller B, Garud N, Pollard KS. An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Res. 2016; 26:1612–25.

    Article  CAS  Google Scholar 

  24. Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger SA, Kultima JR, Coelho LP, Arumugam M, Tap J, Nielsen HB, Rasmussen S, Brunak S, Pedersen O, Guarner F, de Vos WM, Wang J, Li J, Doré J, Ehrlich SD, Stamatakis A, Bork P. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods. 2013; 10:1196–9.

    Article  CAS  Google Scholar 

  25. Bacci G, Bani A, Bazzicalupo M, Ceccherini MT, Galardini M, Nannipieri P, Pietramellara G, Mengoni A. Evaluation of the performances of ribosomal database project (RDP) classifier for taxonomic assignment of 16S rRNA metabarcoding sequences generated from Illumina-Solexa NGS. Journal of genomics; 3:36–39.

  26. Francis OE, Bendall M, Manimaran S, Hong C, Clement NL, Castro-Nallar E, Snell Q, Schaalje GB, Clement MJ, Crandall KA. Pathoscope: species identification and strain attribution with unassembled sequencing data. Genome Res. 2013; 23(10):1721–9.

    Article  CAS  Google Scholar 

  27. Raja HA, Miller AN, Pearce CJ, Oberlies NH. Fungal identification using molecular tools: a primer for the natural products research community. J Nat Prod. 2017; 80(3):756–70.

    Article  CAS  Google Scholar 

  28. Schoch CL, Seifert KA, Huhndorf S, Robert V, Spouge JL, Lévesque CA, Chen W. Fungal Barcoding Consortium: Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proc Natl Acad Sci USA. 2012; 109:6241–6.

    Article  CAS  Google Scholar 

  29. Somervuo P, Douglas WY, Xu C, Ji Y, Hultman J, Wirta H, Ovaskainen O. Quantifying uncertainty of taxonomic placement in dna barcoding and metabarcoding. Methods Ecol Evol. 2017; 8:398–407.

    Article  Google Scholar 

  30. Xu J. Fungal dna barcoding. Genome. 2016; 59(11):913–32.

    Article  CAS  Google Scholar 

  31. Eren AM, Maignien L, Sul WJ, Murphy LG, Grim SL, Morrison HG, Sogin ML. Oligotyping: differentiating between closely related microbial taxa using 16s rrna gene data. Methods Ecol Evol. 2013; 4(12):1111–9.

    Article  Google Scholar 

  32. Eren AM, Morrison HG, Lescault PJ, Reveillaud J, Vineis JH, Sogin ML. Minimum entropy decomposition: unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences. ISME J. 2015; 9(4):968–79.

    Article  CAS  Google Scholar 

  33. Nordberg EK. YODA: selecting signature oligonucleotides. Bioinformatics. 2005; 21(8):1365–70.

    Article  CAS  Google Scholar 

  34. Wernersson R, Nielsen HB. Oligowiz 2.0-integrating sequence feature annotation into the design of microarray probes. Nucleic Acids Res. 2005; 33(suppl 2):611–615.

    Article  Google Scholar 

  35. Lee HP, Sheu T-F, Tang CY. A parallel and incremental algorithm for efficient unique signature discovery on DNA databases. BMC Bioinformatics. 2010; 11(1):132.

    Article  Google Scholar 

  36. Ashelford KE, Weightman AJ, Fry JC. PRIMROSE: a computer program for generating and estimating the phylogenetic range of 16S rRNA oligonucleotide probes and primers in conjunction with the RDP-II database. Nucleic Acids Res. 2002; 30(15):3481–9.

    Article  CAS  Google Scholar 

  37. Ludwig W, Strunk O, Westram R, Richter L, Meier H, Buchner A, Lai T, Steppi S, Jobb G, F orster W. ARB: a software environment for sequence data. Nucleic Acids Res. 2004; 32(4):1363–71.

    Article  CAS  Google Scholar 

  38. Chung W-H, Rhee S-K, Wan X-F, Bae J-W, Quan Z-X, Park Y-H. Design of long oligonucleotide probes for functional gene detection in a microbial community. Bioinformatics. 2005; 21(22):4092–100.

    Article  CAS  Google Scholar 

  39. Bader KC, Grothoff C, Meier H. Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets. Bioinformatics. 2011; 27(11):1546–54.

    Article  CAS  Google Scholar 

  40. Lemoine S, Combes F, Le Crom S. An evaluation of custom microarray applications: the oligonucleotide design challenge. Nucleic Acids Res. 2009; 37(6):1726–39.

    Article  CAS  Google Scholar 

  41. Phillippy AM, Ayanbule K, Edwards NJ, Salzberg SL. Insignia: a DNA signature search web server for diagnostic assay development. Nucleic Acids Res. 2009:286.

  42. Satya RV, Zavaljevski N, Kumar K, Reifman J. A high-throughput pipeline for designing microarray-based pathogen diagnostic assays. BMC Bioinformatics. 2008; 9(1):185.

    Article  Google Scholar 

  43. Feng S, Tillier ER. A fast and flexible approach to oligonucleotide probe design for genomes and gene families. Bioinformatics. 2007; 23(10):1195–202.

    Article  CAS  Google Scholar 

  44. Susko E, Leigh J, Doolittle W, Bapteste E. Visualizing and assessing phylogenetic congruence of core gene sets: a case study of the γ-Proteobacteria. Mol Biol Evol. 2006; 23:1019–30.

    Article  CAS  Google Scholar 

  45. Zahariev M, Dahl V, Chen W, Lévesque CA. Efficient algorithms for the discovery of DNA oligonucleotide barcodes from sequence databases. Mol Ecol Resour. 2009; 9(s1):58–64.

    Article  CAS  Google Scholar 

  46. Chen W, Djama ZR, Coffey MD, Martin FN, Bilodeau GJ, Radmer L, Denton G, Lévesque CA. Membrane-based oligonucleotide array developed from multiple markers for the detection of many Phytophthora species. Phytopathology. 2013; 103(1):43–54.

    Article  CAS  Google Scholar 

  47. SantaLucia Jr. J, Hicks D. The thermodynamics of DNA structural motifs. Annu Rev Biophys Biomol Struct. 2004; 33:415–40.

    Article  Google Scholar 

  48. Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ. Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res. 2000; 28(22):4552–7.

    Article  CAS  Google Scholar 

  49. Ilie L, Ilie S, Khoshraftar S, Bigvand AM. Seeds for effective oligonucleotide design. BMC Genomics. 2011; 12(1):280.

    Article  Google Scholar 

  50. Ilie L, Mohamadi H, Golding GB, Smyth WF. Bond: Basic oligonucleotide design. BMC Bioinformatics. 2013; 14(69).

  51. Churchill GA. Fundamentals of experimental design for cDNA microarrays. Nat Genet. 2002; 32(supplement):490–5.

    Article  CAS  Google Scholar 

  52. Tambong J, de Cock A, Tinker N, Lévesque CA. Oligonucleotide array for identification and detection of Pythium species. Appl Environ Microbiol. 2006; 72(4):2691–706.

    Article  CAS  Google Scholar 

  53. Tsui CK, Woodhall J, Chen W, Lévesque CA, Lau A, Schoen CD, Baschien C, Najafzadeh MJ, de Hoog GS. Molecular techniques for pathogen identification and fungus detection in the environment. IMA Fungus: Glob Mycol J. 2011; 2(2):177.

    Article  Google Scholar 

  54. Mertes F, ElSharawy A, Sauer S, van Helvoort JM, Van Der Zaag P, Franke A, Nilsson M, Lehrach H, Brookes AJ. Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief Funct Genom. 2011:033.

  55. Walsh T, Lee MK, Casadei S, Thornton AM, Stray SM, Pennil C, Nord AS, Mandell JB, Swisher EM, Kinga M-C. Detection of inherited mutations for breast and ovarian cancer using genomic capture and massively parallel sequencing. Proc Natl Acad Sci USA. 2010; 107(28):12629–33.

    Article  CAS  Google Scholar 

  56. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-Seq quantification. Nat Biotechnol. 2016; 34(5):525.

    Article  CAS  Google Scholar 

  57. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017; 14(4):417.

    Article  CAS  Google Scholar 

  58. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014; 15(3):46.

    Article  Google Scholar 

  59. Stobbe AH, Daniels J, Espindola AS, Verma R, Melcher U, Ochoa-Corona F, Garzon C, Fletcher J, Schneider W. E-probe diagnostic nucleic acid analysis (EDNA): A theoretical approach for handling of next generation sequencing data for diagnostics. J Microbiol Meth. 2013; 94:356–66.

    Article  CAS  Google Scholar 

  60. Espindola A, Schneider W, Hoyt PR, Marek SM, Garzon C. A new approach for detecting fungal and oomycete plant pathogens in next generation sequencing metagenome data utilising electronic probes. Int J Data Min Bioinforma. 2015; 12(2):115–28.

    Article  Google Scholar 

  61. Sayrafiezadeh M. The birthday problem revisited. Math Mag. 1994; 67(3):220–3.

    Article  Google Scholar 

  62. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48:443–53.

    Article  CAS  Google Scholar 

  63. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013; 41:36–42.

    Article  Google Scholar 

  64. Lawrence DP, Gannibal PB, Peever TL, Pryor BM. The sections of Alternaria: formalizing species-group concepts. Mycologia. 2013; 105(3):530–46.

    Article  Google Scholar 

  65. Woudenberg J, Groenewald J, Binder M, Crous P. Alternaria redefined. Stud Mycol. 2013; 75:171–212.

    Article  CAS  Google Scholar 

  66. Samson RA, Visagie CM, Houbraken J, Hong S-B, Hubka V, Klaassen CH, Perrone G, Seifert KA, Susca A, Tanney JB. Phylogeny, identification and nomenclature of the genus Aspergillus. Stud Mycol. 2014; 78:141–73.

    Article  CAS  Google Scholar 

  67. Visagie C, Houbraken J, Frisvad JC, Hong S-B, Klaassen C, Perrone G, Seifert K, Varga J, Yaguchi T, Samson R. Identification and nomenclature of the genus Penicillium. Stud Mycol. 2014; 78:343–71.

    Article  CAS  Google Scholar 

  68. Woudenberg J, Seidl M, Groenewald J, de Vries M, Stielow J, Thomma B, Crous P. Alternaria section Alternaria: Species, formae speciales or pathotypes?Stud Mycol. 2015; 82:1–21.

    Article  CAS  Google Scholar 

  69. Katoh K, Standley DM. Mafft multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013; 30(4):772–80.

    Article  CAS  Google Scholar 

  70. Swofford DL. Paup*: Phylogenetic analysis using parsimony (and other methods) 4.0. b5. 2001.

  71. Hall TA. Bioedit: a user-friendly biological sequence alignment editor and analysis program for windows 95/98/nt. Nucleic Acids Symp Ser. 1999; 41:97–9.

    Google Scholar 

  72. Price MN, Dehal PS, Arkin AP. FastTree 2 approximately maximum-likelihood trees for large alignments. PLOS ONE. 2010; 5(3):9490.

    Article  Google Scholar 

  73. Kõljalg U, Nilsson R, Abarenkov K, Tedersoo L, Taylor A, Bahram M, Bates S, Bruns T, Bengtsson-Palme J, Callaghan T, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith G, Hartmann M, Kirk P, Kohout P, Larsson E, Lindahl B, Lücking R, Martín M, Matheny P, Nguyen N, Niskanen T, Oja J, Peay K, Peintner U, Peterson M, Pöldmaa K, Saag L, Saar I, Schüßler A, Scott J, Senés C, Smith M, Suija A, Taylor D, Telleria M, Weiß M, Larsson K-H. Towards a unified paradigm for sequence-based identification of fungi. Mol Ecol. 2013; 22:5271–7.

    Article  Google Scholar 

  74. Matsumoto M, Nishimura T. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans Model Comput Simul. 1998; 8:3–30.

    Article  Google Scholar 

  75. Gilles A, Meglécz E, Pech N, Ferreira S, Malausa T, Martin J-F. Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics. 2011; 12(1):245.

    Article  Google Scholar 

  76. Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012; 13(1):341.

    Article  CAS  Google Scholar 

  77. Markham NR, Zuker M. DINAMelt web server for nucleic acid melting prediction. Nucleic Acids Res. 2005; 33:577–81.

    Article  Google Scholar 

  78. Clauset A, Shalizi CR, Newman MEJ. Power-law distributions in empirical data. SIAM Rev. 2009; 51(4):661–703.

    Article  Google Scholar 

Download references

Acknowledgements

We thank Christine Lowe for testing aodp, Chunfang Zheng for reviewing the manuscript and Anatoly Belov for his help in reviewing existing HTS tools and in evaluating USEARCH and BLAST. Our deepest gratitude goes to the anonymous reviewers for their thoughtful and insightful comments.

Funding

This work is financially supported by Agriculture & Agri-Food Canada (AAFC) funded project #97 and Canadian Safety and Security Program (CSSP) funded project CRTI 09-462RD/CSSP 30vv01.

Availability of data and materials

The Automated Oligonucleotide Design Pipeline (aodp) used in analyses in this study is included in the supplementary material and is also available at: https://bitbucket.org/wenchen_aafc/aodp_v2.0_release.

The 97AerobiotaSamples data set is available as samples in the Sequence Read Archive, https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/bioproject/PRJNA358221; all other datasets are available in the supplementary material.

Author information

Authors and Affiliations

Authors

Contributions

MZ, WC and AL contributed the concept. MZ designed the algorithms and provided the data analysis. WC designed the workflow of aodp and supervised all testing. WC, CV, AL and MZ provided the evaluation datasets. All authors provided input in the interpretation and presentation of data analyses, edited and reviewed the manuscript. All authors read and approved the final version of this manuscript.

Corresponding authors

Correspondence to Manuel Zahariev or Wen Chen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zahariev, M., Chen, W., Visagie, C. et al. Cluster oligonucleotide signatures for rapid identification by sequencing. BMC Bioinformatics 19, 395 (2018). https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-018-2363-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-018-2363-3

Keywords