Scaffolding algorithm using second- and third-generation reads
Wiktor Franus , Wiktor Kuśmirek , Robert Marek Nowak
AbstractThe second generation sequencing methods produce high-quality short reads, which are assembled into contigs by DNA assemblers. Due to the fact that length of a single read is limited to 500bp it is really hard to assembly full genomes or full chromosomes. Generating longer contigs with low cost of sequencing is a main effort of computer scientists in this area. We propose to link contings created from second-generation reads using reads from third-generation sequencers. Such reads have length 10-20kbp. An existing implementation of this approach appears to be time and memory demanding for larger genomes. We developed an algorithm based on Bloom filter and extremely memory-efficient associative array. Our implementation remarkably exceeds the previous one in terms of time and memory consumption. Presented algorithm, provided as a shared library, is a part of the dnaasm de-novo assembler. The library has been created using C++ programming language, Boost and Google Sparse Hash libraries. Both web browser-based graphical user interface and command line interface are provided. Source code as well as a demo web application and a docker image are available at the dnaasm project web-page: http://dnaasm.sourceforge.net. Our application has been tested on real data of bacteria, yeast and plant genomes.
|Publication size in sheets||0.5|
|Book||Romaniuk Ryszard, Linczuk Maciej Grzegorz (eds.): Proceedings of SPIE: Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018, vol. 10808, 2018, SPIE - the International Society for Optics and Photonics, ISBN 9781510622036, 2086 p.|
|Keywords in English||scaffold, next generation sequencing, de-novo, DNA assemblers, hybrid DNA assembly, third-generation sequencing|
|Score|| = 15.0, 16-10-2018, BookChapterMatConf|
= 15.0, 16-10-2018, BookChapterMatConf
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.