De novo assembly of bacterial genomes with repetitive DNA regions by dnaasm application

Kuśmirek, Wiktor; Nowak, Robert

doi:10.1186/s12859-018-2281-4

Software
Open access
Published: 18 July 2018

De novo assembly of bacterial genomes with repetitive DNA regions by dnaasm application

BMC Bioinformatics volume 19, Article number: 273 (2018) Cite this article

3859 Accesses
14 Citations
1 Altmetric
Metrics details

Abstract

Background

Many organisms, in particular bacteria, contain repetitive DNA fragments called tandem repeats. These structures are restored by DNA assemblers by mapping paired-end tags to unitigs, estimating the distance between them and filling the gap with the specified DNA motif, which could be repeated many times. However, some of the tandem repeats are longer than the distance between the paired-end tags.

Results

We present a new algorithm for de novo DNA assembly, which uses the relative frequency of reads to properly restore tandem repeats. The main advantage of the presented algorithm is that long tandem repeats, which are much longer than maximum reads length and the insert size of paired-end tags can be properly restored. Moreover, repetitive DNA regions covered only by single-read sequencing data could also be restored. Other existing de novo DNA assemblers fail in such cases.

The presented application is composed of several steps, including: (i) building the de Bruijn graph, (ii) correcting the de Bruijn graph, (iii) normalizing edge weights, and (iv) generating the output set of DNA sequences.

We tested our approach on real data sets of bacterial organisms.

Conclusions

The software library, console application and web application were developed. Web application was developed in client-server architecture, where web-browser is used to communicate with end-user and algorithms are implemented in C++ and Python. The presented approach enables proper reconstruction of tandem repeats, which are longer than the insert size of paired-end tags. The application is freely available to all users under GNU Library or Lesser General Public License version 3.0 (LGPLv3).

Background

Next-generation sequencing (NGS) has dramatically reduced the time and the cost of producing genome sequences using massively parallel technologies [1]; therefore, we observe exponential increase of sequencing data [2]. The reduction of cost and sequencing the time allowed to develop many applications, such as biosurveillance, bioforensics, and infectious disease epidemiology [3]. What is more, genome-scale metabolic modeling and metagenomic sequencing of patient samples could improve the efficiency of diagnosis and treatment of diseases in the near future. All of the shown above practical applications are based largely on the genome sequencing of bacterial organisms.

The sequencing procedure for bacterial organisms has changed a lot over the last 20 years. In 1995, the first two sequenced bacterial organisms were published. Over time, sequencing technology has evolved, and now bacterial sequencing has become the standard procedure. However, many of the sequenced bacterial genomes are currently incomplete - for example 90% of bacterial genomes in GenBank [3] are incomplete. In many cases the incompleteness is a result of the occurrence of repetitive sequences in bacterial genomes that can not always be reconstructed from short DNA reads from second-generation sequencing.

Some of the repetitive DNA regions could represent a structure called tandem repeat - a sequence built from several identical DNA fragments lying one after another, caused mainly by strand-slippage replication [4]. Bacterial genomes contain up to several dozens of tandem repeats divided into two groups: intragenic and intergenic. Nevertheless, only a small number of tandem repeats have been functionally studied to date; for example, some of the functions of specific genes can be modulated by instability of tandem repeats. This process allows bacteria adaptation to a new environment in a short term without complicated mutation [5].

Current DNA assemblers, like ABySS [6], Velvet [7] or SPAdes [8], reconstruct tandem repeats using the information contained in paired-end tags. However, some repetitive regions may be much longer than maximum reads length and the insert size of paired-end tags. Such regions cannot be reconstructed by modern DNA assemblers.

Here, we present a new algorithm for DNA assembly, which uses the relative frequency of DNA reads to properly reconstruct tandem repeats. The main advantage of our approach is that tandem repeats, which are longer than the insert size of paired-end tags, can also be properly reconstructed, while other de novo genome assemblers fail in such cases. What is more, long tandem repeats could also be restored if only single-read sequencing data is available. The presented approach requires high sequencing coverage, currently easily achievable for bacterial genomes, but the tandem repeats reconstruction process could significantly improve contiguity over previous approaches, which was also indicated in the study.

Implementation

In this section, we present the main data processing pipeline that has been implemented in a new DNA assembler named ’dnaasm’. We use de Bruijn graph due to its efficiency for the next generation sequencing data. We mainly focus on describing the process of estimation tandem repeats length and the process of reconstruction repetitive DNA fragments. We also present the main implementation aspects that make our application memory and computing efficient.

Assembly workflow

Building and correcting de Bruijn graph

The first stage of de novo assembling in ’dnaasm’ is de Bruijn graph construction. As in the typical de novo DNA assembler, dnaasm builds de Bruijn graph from input set of DNA reads by splitting each read into set of k-mers. Each k-mer represents a substring of length k from input DNA read - a number of k-mers generated from single DNA read of length L is equal to L−k+1. Then, on the constructed de Bruijn graph, some algorithms for error correction are applied, similar to algorithms implemented previously [7]. Especially, dnaasm uses algorithms for removing tips, bubbles and edges of low weight. At this stage, all edges representing DNA sequencing errors should be removed from the de Bruijn graph. Moreover, the edges of the de Bruijn graph represent substrings of length k and in the presented approach each edge has an additional property, the integer number named edge weight, which depicts a number of occurrence of DNA fragment of length k in the input set of DNA reads, as in A-Bruijn graph [9].

The specified edge weight w is equal to exact k-mer count, where edge represents specified DNA substring of length k in the set of reads. Let’s consider ideal assembler input R, called k-spectrum, where reads are generated without errors from a circular bacterial genome of sequence s₀s₁...s_G−1 of length G, and reads r∈R have identical length L, and R is a set of all substrings s_i...s_i+k−1 for 0≤i≤G. The edge weight w in this case is $w = \frac {N(L-k+1)}{G}$ for non-repetitive k-mers, where N depicts a number of reads, and edge weight $w = \frac {\Delta }{d} \frac {N(L-k+1)}{G}$, for k-mers inside tandem repeat of length n, where repetitive motif of length d is repeated ⌊n/d⌋ times (integer division), tandem repeat is longer than graph dimension, n>k, and Δ=n−k+1.

We prove in [10] the edge weight w for de Bruijn graph of dimension k, for error-less set of N reads with identical length L, assuming uniform distribution of the reads position over input circular bacterial genome of length G, is a random variable with Poisson distribution (probability mass function is $P\left (x \right) = \frac {{e^{- \lambda } \lambda ^{x} }}{{x!}}$), as depicted in Eq. 1.

$$ W \!\sim\! {Poisson}~({\lambda})~\text{where}~\lambda \,=\, \frac{NL(L-k+1)\Delta}{Gkd}, \Delta = n-k+1, $$

(1)

Estimating a number of repeats

After the de Bruijn graph construction and correction, dnaasm application estimates the number of occurrences of a given DNA fragment, represented by the edge in the de Bruijn graph, in the investigated genome. This process consists of two stages: firstly, the normalization factor is calculated in accordance with the equation:

$$ p=\frac{G}{N(L-k+1)} $$

(2)

The presented normalization factor is the result of modeling edge weight by Poisson distribution described in Eq. 1. Then, the edge normalization is carried out - it consists in multiplying the input edge weight (which is the number of occurrences of the DNA fragment represented by the edge in the input set of DNA reads) by the previously calculated normalization factor. The multiplication result is rounded to the nearest integer, which represents the number of occurrences of the DNA fragment represented by the edge in the investigated genome. This step could be briefly described by the following equation:

$$ w^{\prime}= round(p * w) = \lfloor p*w + 0.5 \rfloor $$

(3)

The proper repetitive sequence reconstruction requires high coverage $c = \frac {N*L}{G} \ge 100$. When c≥10 Poisson distribution of edge’s weight can be approximated by Normal distribution $\mathcal {N}(\mu, \sigma)$:

$$ W' \sim \mathcal{N}(\mu, \sigma)~\text{where}~\mu = \frac{\Delta}{d}, \sigma = \sqrt{\frac{\Delta}{d}}, \Delta = n-k+1 $$

(4)

For given level of confidence q, 0≤q≤1 we can calculate a required read coverage c for proper repetitive motif reconstruction, using the Eq. 5, where $\Phi _{N}^{-1}(q)$ is the inverse cumulative distribution function for standard normal distribution (μ=0, σ=1), d is the length of repetitive motif, n is the length of tandem repeats, n>k, k is de Bruijn graph dimension, L is read length.

$$ {}c = \frac{k}{L-k+1}\left(2 \Phi_{N}^{-1}\left(\frac{1+q}2\right)\right)^{2} \frac{\Delta}{d}~\text{where}~\Delta = n-k+1 $$

(5)

The process of estimating the number of occurrences of a given DNA fragment in the investigated genome is presented in Fig. 1.

Detecting tandem repeats

The next step of the tandem repeats reconstruction process is the detection of structures in the de Bruijn graph, which represent tandem repeats in the investigated genome. These structures appears as loops in de Bruijn graph connected with the rest of the graph by only one in-edge and only one out-edge. In other words, tandem repeats are represented by a sub-graph, where exactly one vertex has two in-edges and one out-edge, exactly one vertex has one in-edge and two out-edges, and all other vertices have one in-edge and one out-edge. Such structure consists of two parts:

a branch from a vertex which represents an entry to the loop to a vertex which represents an exit of the loop;
a branch from a vertex which represents an exit of the loop to a vertex which represents an entry to the loop.

An example of a structure representing tandem repeat in the de Bruijn graph is presented in Fig. 2.

Correcting weights in tandem repeats

The next step of the tandem repeats reconstruction process is the correction of the edge weights in the previously detected de Bruijn graph loops. Firstly, the weights in single branches are corrected so that all weights of the branch have the same weight. Secondly, the number of vertices in both parts of the loop are counted. Then, the edge weights in the less numerous parts of the loop are adapted to the weights of the edges of the more numerous parts of the loop, so that all of the vertices in the loop will be of 0 degree. Here, a degree is a sum of weights of vertex edges where the weights of in-edges are positive, and the weights of out-edges are negative. An example of correction of normalized edge weights in the de Bruijn graph loops is presented in Fig. 3.

Resolving tandem repeats in DNA sequence

The last step of reconstructing the repetitive DNA sequence from next-generation sequencing reads is to generate a DNA sequence from the de Bruijn graph. This process involves traverse the vertices of the de Bruijn graph until an ambiguous vertex is encountered.

The vertex is treated as unambiguous if it has zero, one or two input (output) edges and, in the case of exactly two input (output) edges, for one of them a simple return path exists ie. path from the target vertex to the source vertex, that has at least one vertex with more than one input edges and at least one vertex with more than one output edges. This condition makes the number of ambiguous vertices in our approach smaller than in the other existing assemblers, where ambiguity is set if a vertex has more than one input edge or more than one output edge.

The process of resolving tandem repeats consists of two steps: (1) finding vertices without any input edges and with at least one output edge, such vertex starts new contig and becomes current vertex; (2) iteratively processing directly connected vertices ie. adding them to actual contig and decrementing weights of visited edges; if the edge weight is 0, edge is removed from the graph. If the current vertex v is unambiguous, it extends the current contig, otherwise, it starts the new one. Moreover, if current vertex v is unambiguous and has two output edges, the edge, for which a previously defined simple return path exists, is chosen.

This process is repeated until all ambiguous vertices are resolved. An example of generating DNA sequences from de Bruijn graph is presented in Fig. 4.

Final assembly steps

All of the previously described steps of de novo assembly in dnaasm application lead to a generation set of DNA sequences called unitigs. Then, created unitigs could be extended to contigs and scaffolds using paired-end tags and mate-pairs - both algorithms are also implemented in dnaasm application.

Implementation

The web-application was developed in client-server architecture, where web-browser is used to communicate with end-user, Python is used to realize the application server, and algorithms are implemented in C++. The described architecture is based on a bioweb framework [11], the main modules of the application are presented in Fig. 5.

To achieve the high performance of calculation module we used several memory-efficient structures, e.g. Compressed Sparse Row Graph from Boost library to represent de Bruijn graph, Google Sparse Hash to implement hash map. Our advanced memory optimization enabled building and processing graph up to 7∗10⁹ vertices (e.g. for human genome) in 256 GB RAM. We deploy the module as shared C++ library.

Results

In this section, we presented the results of tests for real data sets of bacterial organisms. We compared the results obtained by our approach with tandem repeats detected by algorithms based on paired-end tags. We also briefly describe new real assembly case from the whole genome sequencing project, where our approach gives an advantage. Moreover, we carried out several experiments on simulated datasets to compare efficiency of tandem repeats reconstruction.

Comparison to another applications

We compared the dnaasm application with the three popular de novo DNA assemblers: ABySS [6] ver. 2.0.1, Velvet [7] ver. 1.2.10 and SPAdes [8] ver. 3.11.0. Applications were compared on four sets of bacterial DNA reads obtained from the National Center for Biotechnology Information. The benchmark dataset contains DNA reads from four samples - ERR351243 for Helicobacter pylori PeCan4, SRR5431732 for Mycobacterium bovis, SRR1981622 and SRR1981619 for Helicobacter pylori J99. The description of benchmark data sets is presented in Table 1.

Table 1 Sets of benchmark data

Full size table

De novo assembling of the mentioned DNA reads was carried out in two modes - with and without using paired-ends tags. The results were compared in terms of the number of contigs longer than 1000 bp, the length of N50 contig, the length of the longest contig and two parameters describing the quality of the resultant sequences - the average number of mismatches and indels per 100,000 aligned bases. The above parameters were calculated by the quality assessment tool QUAST [12] ver. 4.1; and the results are presented in Table 2.

Table 2 Evaluation of dnaasm in comparison to ABySS, Velvet and SPAdes assembler

Full size table

In Table 3 we showed the improvement of results by tandem repeat resolution. Furthermore, we counted the number of places in investigated samples, where our approach works properly and other assemblers fail. To compare the number of detected tandem repeats we used Tandem repeats finder application [13]; the results of this application are presented in Table 4.

Table 3 Evaluation of tandem repeats reconstruction algorithm in dnaasm

Full size table

Table 4 Detected tandem repeats in bacterial test datasets

Full size table

Simulated reference genome

The next two experiments were carried out on the simulated data generated from the generated reference genome. This sequence consists of the 20 tandem repeats isolated from each other by a section of 1000 random symbols over {A, C, G, T} alphabet. The repetitive structures include: motif of length 100 bp repeated 2, 3, 4 and 5 times; motif of length 200 bp repeated 2, 3, 4 and 5 times; motif of length 300 bp repeated 2, 3, 4 and 5 times; motif of length 400 bp repeated 2, 3, 4 and 5 times; motif of length 500 bp repeated 2, 3, 4 and 5 times. The motifs were random symbols.

Simulated dataset for different value of insert size

In this experiment we investigated how insert size affects the accuracy of tandem repeats detection. We generated sets of reads from simulated reference genome using the profile-based Illumina pair-end Read Simulator pIRS [14]. Three sets were generated:

mean insert size: 250 bp, standard deviation of insert sizes: 25;
mean insert size: 750 bp, standard deviation of insert sizes: 75;
mean insert size: 1250 bp, standard deviation of insert sizes: 125.

The read length and depth of coverage for all simulated sets of reads was 100 bp and 150x, respectively. The substitution-error rate was 0.01, simulating indel errors in reads was switched on. To compare a number of detected tandem repeats we used Tandem repeats finder application [13]. The results are shown in Table 5.

Table 5 The efficiency of tandem repeats reconstruction from simulated data

Full size table

Simulated dataset for different depth of coverage

In this experiment we checked how read coverage affects the tandem repeats detection for different types of repetitive sequences - we compared efficiency of reconstructing tandem repeats by our approach and by methods based on paired-end tags on simulated datasets generated with another depth of coverage. We used, as in the previous experiment, dataset generated by read simulator from our reference genome, The read length, insert size mean and standard deviation of insert sizes was 100 bp, 250 bp and 25, respectively, the error simulation parameters – as in previous experiment. We generated three sets of input paired-end tags with depth of coverage: 50x, 100x and 150x. The results are depicted in Table 6.

Table 6 The efficiency of tandem repeats reconstruction from simulated data

Full size table

PCR confirmation

To present the correctness and usefulness of our approach, we use our application in a project managed by the Witold Stefański Institute of Parasitology of the Polish Academy of Sciences dealing with, inter alia, the problem of sequencing and assembling mitochondrial DNA of rat tapeworm Hymenolepis diminuta. Despite the small size of this sequence (only 13,900 bp), there is a large repetitive DNA region (tandem repeats), which contains 13 repeats of the same 31-nt sequence [15]. To assemble this sequence, we obtained reads from the Illumina sequencer, the reads were paired (2x100 bp), an average insert size was equal to 300 bp. Unfortunately, the insert size of paired-end tags was smaller than the length of the investigated repetitive region. Due to this fact, our application, as the only one DNA assembler, was able to reconstruct this repetitive region. Moreover, the depth of coverage for this sequencing project was high, ie. for mitochondrial DNA above 1000x, so we were able to use our application several times with different coverage depths (from 300x to 1000x). The results for all these calculations were the same, especially, the DNA fragment with tandem repeat was always reconstructed. What is more, additional ultra-deep sequencing of PCR amplicons for this DNA region confirmed the results obtained by our approach.

Discussion

In this paper we describe an application used to reconstruct some of the repetitive DNA regions based on the normalised read depth. The presented approach was thoroughly tested and the experiments carried out on the simulated data, described in this paper, confirmed our concept. What is more, the reconstruction of repetitive DNA region was proved by biological experiments.

The read coverage of the genome region is key to the correct reconstruction of the repetitive fragment in our approach. However, the read depth of the specific DNA region varies depending on the GC content [16]. There are many methods for correction of the GC bias [17], most of them are implemented in copy number variation (CNV) detection tools based on read depth. Implementation and testing of some correction GC bias algorithm in our approach is one of the most important tasks in the near future.

Nowadays, nanopore sequencers are very popular. They allow to obtain the DNA reads of length greater than 10 kbp. The main disadvantage of nanopore sequencing is that obtained data contains more errors than the second generation sequencing reads. However, the usage of the long reads can improve the assembly results from the short reads [18]. The presented algorithm currently does not use long reads. However, we plan to integrate such sequencing data in the next version of the software.

What is more, in the future we plan to add the possibility of running the application on a computer cluster. The de novo assembler will be divided into the set of containers, which will be managed and run by Apache Spark. The new architecture will allow to disperse the calculation, which will significantly reduce the time of de novo assembling. Furthermore, in the future we plan to create a virtual machine [19] image and an Amazon machine image.

The demo application with web interface as well as source code of the application are available at project homepage^{Footnote 1}. What is more, there is a public Docker container [20] with dnaasm de novo assembler. The presented application is freely available to both academic and commercial users under GNU Library or Lesser General Public License version 3.0 (LGPLv3).

Conclusions

As more and more bacterial genomes are sequenced, it becomes desirable to analyze their tandem repeats. Here we have presented dnaasm, a de novo DNA assembler that uses the relative frequency of reads to properly reconstruct repetitive sequences, especially, in bacterial genomes.

Notes

http://dnaasm.sourceforge.net

References

Henson J, Tischler G, Ning Z. Next-generation sequencing and large genome assemblies. Pharmacogenomics. 2012; 13(8):901–15.
Article PubMed PubMed Central CAS Google Scholar
Koboldt D, Steinberg K, Larson D, Wilson R, Mardis ER. The Next-Generation Sequencing Revolution and Its Impact on Genomics. Cell. 2013; 155(1):27–38.
Article PubMed PubMed Central CAS Google Scholar
Land M, Hauser L, Jun S-R, Nookaew I, Leuze M, Ahn T-H, Karpinets T, Lund O, Kora G, Wassenaar T, Poudel S, Ussery D. Insights from 20 years of bacterial genome sequencing. Funct Integr Genomics. 2015; 15:141–161.
Article PubMed PubMed Central CAS Google Scholar
Fan H, Chu J-Y. A Brief Review of Short Tandem Repeat Mutation. Genomics Proteomics Bioinform. 2007; 5:7–14.
Article CAS Google Scholar
Zhou K, Aertsen A, W Michiels C. The Role of Variable DNA Tandem Repeats in Bacterial Adaptation. FEMS Microbiol Rev. 2013; 38:119–141.
Article PubMed CAS Google Scholar
D Jackman S, Vandervalk B, Mohamadi H, Chu J, Yeo S, Hammond S, Jahesh G, Khan H, Coombe L, Warren R, Birol I. ABySS 2.0: Resource-efficient assembly of large genomes using a Bloom filter. Genome Res. 2017; 27:214346–116.
Article CAS Google Scholar
R Zerbino D, Birney E. Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs. Genome Res. 2008; 18:821–9.
Article CAS Google Scholar
Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov A, M Lesin V, Nikolenko S, Pham S, D Prjibelski A, V Pyshkin A, Sirotkin A, Vyahhi N, Tesler G, A Alekseyev M, A Pevzner P. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol J Comput Mol Cell Biol. 2012; 19:455–77.
Article CAS Google Scholar
Pevzner PA, Tang H, Tesler G. De novo repeat classification and fragment assembly. Genome Res. 2004; 14(9):1786–96.
Article PubMed PubMed Central CAS Google Scholar
Nowak RM. Assembly of repetitive regions using next-generation sequencing data. Biocybernetics Biomed Eng. 2015; 35:276–83.
Article Google Scholar
Nowak RM. Polyglot Programming in Applications Used for Genetic Data Analysis. BioMed Res Int. 2014; 2014:253013.
PubMed PubMed Central Google Scholar
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013; 29(8):1072–5.
Article PubMed PubMed Central CAS Google Scholar
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acid Res. 1999; 27(2):573–80.
Article PubMed CAS Google Scholar
Galaxy Y, Yuan J, Shi Y, lu J, Binghang L, Li Z, Chen Y, Mu D, Zhang H, Li N, Yue Z, Bai F, Li H, Fan W. pIRS: Profile based Illumina pair-end Reads Simulator. Bioinformatics (Oxford, England). 2012; 28:1533–5.
Article CAS Google Scholar
von Nickisch-Rosenegk M, Brown WM, Boore JL. Complete Sequence of the Mitochondrial Genome of the Tapeworm Hymenolepis diminuta: Gene Arrangements Indicate that Platyhelminths Are Eutrochozoans. Mol Biol Evol. 2001; 18(5):721–30.
Article PubMed CAS Google Scholar
D Smith S, K Kawash J, Grigoriev A. GROM-RD: Resolving genomic biases to improve read depth detection of copy number variants. PeerJ. 2015; 3:836.
Article CAS Google Scholar
Benjamini Y STP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acid Res. 2012; 40(10):72.
Article CAS Google Scholar
Tan MH, Austin CM, Hammer MP, Lee YP, Croft LJ, Gan HM. Finding Nemo: Hybrid assembly with Oxford Nanopore and Illumina reads greatly improves the Clownfish (Amphiprion ocellaris) genome assembly. GigaScience. 2018; 7:137.
Article Google Scholar
Nocq J, Celton M, Gendron P, Lemieux S, T Wilhelm B. Harnessing Virtual Machines to simplify next generation DNA sequencing analysis. Bioinformatics (Oxford, England). 2013; 29:2075–2083.
Article CAS Google Scholar
Merkel D. Docker: lightweight linux containers for consistent development and deployment. Linux J. 2014; 2014.

Download references

Acknowledgements

This work was supported by Polish National Science Centre grant No 2014/13/B/NZ6/00881. We would like to thank prof. Daniel Młocicki from the Witold Stefański Institute of Parasitology of the Polish Academy of Sciences for sharing next-generation sequencing data available to us, which allows to test our software.

This research was partially presented as poster titled “dnaasm – new tool to assemble repetitive DNA regions.” (doi: 10.7490/f1000research.1114626.1.) at Joint 25th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) and 16th European Conference on Computational Biology (ECCB) in Prague, Czech Republic (from July 21st through 25th of 2017).

We would like to thank the editor and anonymous reviewers for their constructive comments.

Funding

This work was supported by Polish National Science Centre grant No 2014/13/B/NZ6/00881 and by the statutory research of Institute of Computer Science of Warsaw University of Technology.

The funders had no role in study design, data analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, Warsaw, 00-665, Poland
Wiktor Kuśmirek & Robert Nowak

Authors

Wiktor Kuśmirek
View author publications
You can also search for this author in PubMed Google Scholar
Robert Nowak
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RN identified the problem, RN and WK designed the approach. WK implemented the software. WK worked on testing and validation, WK and RN wrote the manuscript. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Wiktor Kuśmirek.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional information

Availability of data and materials

dnaasm is an open-source project available at http://dnaasm.sourceforge.net with license GNU Library or Lesser General Public License version 3.0 (LGPLv3). It was implemented in C++, run and tested on Windows 7 and Linux 16.04.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Kuśmirek, W., Nowak, R. De novo assembly of bacterial genomes with repetitive DNA regions by dnaasm application. BMC Bioinformatics 19, 273 (2018). https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-018-2281-4

Download citation

Received: 17 May 2018
Accepted: 09 July 2018
Published: 18 July 2018
DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-018-2281-4

De novo assembly of bacterial genomes with repetitive DNA regions by dnaasm application

Abstract

Background

Results

Conclusions

Background

Implementation

Assembly workflow

Building and correcting de Bruijn graph

Estimating a number of repeats

Detecting tandem repeats

Correcting weights in tandem repeats

Resolving tandem repeats in DNA sequence

Final assembly steps

Implementation

Results

Comparison to another applications

Simulated reference genome

Simulated dataset for different value of insert size

Simulated dataset for different depth of coverage

PCR confirmation

Discussion

Conclusions

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional information

Availability of data and materials

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us