Anomaly detection in discussion forum posts using global vectors

Paweł Cichosz


Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. A Polish Internet discussion forum devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serves as a text source that is both realistic and possibly interesting on its own, due to potential associations with drug-related crime. Forum posts are preprocessed by stopword removal, spelling correction, stemming, and frequency-based term �ltering. The Global Vectors (GloVe) text representation, which is an example of the increasingly popular word embedding approach, is combined with two unsupervised anomaly detection algorithms, based on one-class SVM classi�cation and based on dissimilarity to k-medoids clusters. The cluster dissimilarity approach combined with the GloVe representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.
Author Paweł Cichosz (FEIT / IN)
Paweł Cichosz,,
- The Institute of Computer Science
Publication size in sheets0.55
Book Romaniuk Ryszard, Linczuk Maciej Grzegorz (eds.): Proceedings of SPIE: Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018, vol. 10808, 2018, SPIE - the International Society for Optics and Photonics, ISBN 9781510622036, 2086 p.
Keywords in Englishanomaly detection, text classi�cation, text clustering, word embeddings
Languageen angielski
108081R_Cichosz.pdf 735.53 KB
Score (nominal)15
ScoreMinisterial score = 15.0, 16-10-2018, BookChapterMatConf
Ministerial score (2013-2016) = 15.0, 16-10-2018, BookChapterMatConf
Citation count*
Share Share

Get link to the record

* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.