Anomaly detection in discussion forum posts using global vectors

Paweł Cichosz


Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. A Polish Internet discussion forum devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serves as a text source that is both realistic and possibly interesting on its own, due to potential associations with drug-related crime. Forum posts are preprocessed by stopword removal, spelling correction, stemming, and frequency-based term �ltering. The Global Vectors (GloVe) text representation, which is an example of the increasingly popular word embedding approach, is combined with two unsupervised anomaly detection algorithms, based on one-class SVM classi�cation and based on dissimilarity to k-medoids clusters. The cluster dissimilarity approach combined with the GloVe representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.
Author Paweł Cichosz (FEIT / IN)
Paweł Cichosz,,
- The Institute of Computer Science
Publication size in sheets0.55
Book Romaniuk Ryszard, Linczuk Maciej Grzegorz (eds.): Proceedings of SPIE: Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018, Proceedings of SPIE: The International Society for Optical Engineering, vol. 10808, 2018, SPIE - the International Society for Optics and Photonics, ISBN 9781510622036, 2086 p., DOI:10.1117/12.2504983
Keywords in Englishanomaly detection, text classi�cation, text clustering, word embeddings
projectCreating a system for forecasting the development of crime, as an element of building a security strategy and public policy. Project leader: Wawrzyniak Zbigniew M., , Phone: +48 22 234 7738, application date 15-09-2015, start date 21-12-2015, end date 20-12-2018, ISE/2015/4/NCBiR/7-2015/Prokrym, Completed
WEiTI Projects financed by NCRD [Projekty finansowane przez NCBiR (NCBR)]
Languageen angielski
108081R_Cichosz.pdf 735.53 KB
Score (nominal)15
ScoreMinisterial score = 15.0, BookChapterSeriesAndMatConf
Ministerial score (2013-2016) = 15.0, BookChapterSeriesAndMatConf
Citation count*
Share Share

Get link to the record

* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.