Scaffolding algorithm using second- and third-generation reads
Wiktor Franus , Wiktor Kuśmirek , Robert Marek Nowak
AbstractThe second generation sequencing methods produce high-quality short reads, which are assembled into contigs by DNA assemblers. Due to the fact that length of a single read is limited to 500bp it is really hard to assembly full genomes or full chromosomes. Generating longer contigs with low cost of sequencing is a main effort of computer scientists in this area. We propose to link contings created from second-generation reads using reads from third-generation sequencers. Such reads have length 10-20kbp. An existing implementation of this approach appears to be time and memory demanding for larger genomes. We developed an algorithm based on Bloom filter and extremely memory-efficient associative array. Our implementation remarkably exceeds the previous one in terms of time and memory consumption. Presented algorithm, provided as a shared library, is a part of the dnaasm de-novo assembler. The library has been created using C++ programming language, Boost and Google Sparse Hash libraries. Both web browser-based graphical user interface and command line interface are provided. Source code as well as a demo web application and a docker image are available at the dnaasm project web-page: http://dnaasm.sourceforge.net. Our application has been tested on real data of bacteria, yeast and plant genomes.
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.