Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance

Wiktor Kuśmirek , Agnieszka Szmurło , Marek Wiewiórka , Robert Marek Nowak , Tomasz Gambin


Background There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample set is used to increase the signal-to-noise ratio. Although some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature. To our knowledge, this work is the first attempt to assess the impact of reference sample set selection on CNV detection performance. Methods We used WES data from the 1000 Genomes project to evaluate the impact of various methods of reference sample set selection on CNV calling performance of three chosen state-of-the-art tools: CODEX, CNVkit and exomeCopy. Two naive solutions (all samples as reference set and random selection) as well as two clustering methods (k-means and k nearest neighbours (kNN) with a variable number of clusters or group sizes) have been evaluated to discover the best performing sample selection method. Results and Conclusions The performed experiments have shown that the appropriate selection of the reference sample set may greatly improve the CNV detection rate. In particular, we found that smart reduction of reference sample size may significantly increase the algorithms’ precision while having negligible negative effect on sensitivity. We observed that a complete CNV calling process with the k-means algorithm as the selection method has significantly better time complexity than kNN-based solution.
Author Wiktor Kuśmirek (FEIT / IN)
Wiktor Kuśmirek,,
- The Institute of Computer Science
, Agnieszka Szmurło (FEIT / IN)
Agnieszka Szmurło,,
- The Institute of Computer Science
, Marek Wiewiórka (FEIT / IN)
Marek Wiewiórka,,
- The Institute of Computer Science
, Robert Marek Nowak (FEIT / IN)
Robert Marek Nowak,,
- The Institute of Computer Science
, Tomasz Gambin (FEIT / IN)
Tomasz Gambin,,
- The Institute of Computer Science
Journal seriesBMC Bioinformatics, ISSN 1471-2105
Issue year2019
Publication size in sheets0.5
ASJC Classification1303 Biochemistry; 1312 Molecular Biology; 1315 Structural Biology; 1706 Computer Science Applications; 2604 Applied Mathematics
ProjectSimultaneous analysis of single nucleotide and structural variants from whole exome or targeted sequencing. Project leader: Gambin Tomasz, , Phone: +48 22 234 7148, application date 26-10-2015, start date 21-10-2016, planned end date 21-10-2019, end date 20-10-2019, II/2016/IP/1, Completed
WEiTI Projekty finansowane przez MNiSW
Languageen angielski
Score (nominal)100
Score sourcejournalList
ScoreMinisterial score = 100.0, 17-06-2020, ArticleFromJournal
Publication indicators Scopus Citations = 1; WoS Citations = 0; GS Citations = 1.0; Scopus SNIP (Source Normalised Impact per Paper): 2018 = 0.855; WoS Impact Factor: 2018 = 2.511 (2) - 2018=2.97 (5)
Citation count*1 (2020-06-23)
Share Share

Get link to the record

* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.
Are you sure?