Different Strategies of Fitting Logistic Regression for Positive and Unlabelled Data
- Paweł Roman Teisseyre,
- Jan Mielniczuk,
- Małgorzata Łazęcka
In the paper we revisit the problem of fitting logistic regression to positive and unlabelled data. There are two key contributions. First, a new light is shed on the properties of frequently used naive method (in which unlabelled examples are treated as negative). In particular we show that naive method is related to incorrect specification of the logistic model and consequently the parameters in naive method are shrunk towards zero. An interesting relationship between shrinkage parameter and label frequency is established. Second, we introduce a novel method of fitting logistic model based on simultaneous estimation of vector of coefficients and label frequency. Importantly, the proposed method does not require prior estimation, which is a major obstacle in positive unlabelled learning. The method is superior in predicting posterior probability to both naive method and weighted likelihood method for several benchmark data sets. Moreover, it yields consistently better estimator of label frequency than other two known methods. We also introduce simple but powerful representation of positive and unlabelled data under Selected Completely at Random assumption which yields straightforwardly most properties of such model.
- Record ID
- Publication size in sheets
- Krzhizhanovskaya Valeria V. , Valeria V. Krzhizhanovskaya Závodszky Gábor, Gábor Závodszky Lees Michael H. Michael H. Lees [et al.] (eds.): Computational Science – ICCS 2020, 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part IV, Lecture Notes In Computer Science, vol. 12140, 2020, Cham, Springer International Publishing, ISBN 978-3-030-50422-9. DOI:10.1007/978-3-030-50423-6 Opening in a new tab
- Keywords in Polish
- uczenie z danych PU, regresja logistyczna, minimalizacja ryzyka empirycznego, zła specyfikacja
- Keywords in English
- Positive unlabelled learning Logistic regression Empirical risk minimization Misspecification
- Abstract in Polish
- Praca dotyczyła metod estymacji prawdopodobieństwa wystąpienia etykiety 1 na podstawie danych częściowo zaetykietowanych (PU) przy użyciu zmodyfikowanej funkcji straty.
- DOI:10.1007/978-3-030-50423-6_1 Opening in a new tab
- https://link.springer.com/chapter/10.1007%2F978-3-030-50423-6_1 Opening in a new tab
- eng (en) English
- Score (nominal)
- Score source
- = 140.0, 08-06-2021, ChapterFromConference
- Uniform Resource Identifier
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or PerishOpening in a new tab system.