Complexity curve: A graphical measure of data complexity and classifier performance

Julian Zubek , Dariusz Plewczyński


We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of available data with respect to the attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. It demonstrates the relative increase of available information with the growth of sample size. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We then compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining performance of specific classifiers on these sets. We also apply our methodology to a panel of simple benchmark data sets, demonstrating how it can be used in practice to gain insights into data characteristics. Moreover, we show that the complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without compromising classification accuracy. The associated code is available to download at: (open source Python implementation).

Author Julian Zubek - [University of Warsaw, Centre of New Technologies]
Julian Zubek,,
, Dariusz Plewczyński (FMIS / DIPS)
Dariusz Plewczyński,,
- Department of Information Processing Systems
Journal seriesPeerJ Computer Science, ISSN 2376-5992
Issue year2016
Languageen angielski
Score (nominal)0
Score sourcejournalList
ScoreMinisterial score = 0.0, 04-06-2020, ArticleFromJournal
Ministerial score (2013-2016) = 0.0, 04-06-2020, ArticleFromJournal
Publication indicators Scopus Citations = 2
Citation count*
Share Share

Get link to the record

* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.
Are you sure?