Anomaly detection in discussion forum posts using global vectors
AbstractAnomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. A Polish Internet discussion forum devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serves as a text source that is both realistic and possibly interesting on its own, due to potential associations with drug-related crime. Forum posts are preprocessed by stopword removal, spelling correction, stemming, and frequency-based term �ltering. The Global Vectors (GloVe) text representation, which is an example of the increasingly popular word embedding approach, is combined with two unsupervised anomaly detection algorithms, based on one-class SVM classi�cation and based on dissimilarity to k-medoids clusters. The cluster dissimilarity approach combined with the GloVe representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.
|Publication size in sheets||0.55|
|Book||Romaniuk Ryszard, Linczuk Maciej Grzegorz (eds.): Proceedings of SPIE: Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018, Proceedings of SPIE: The International Society for Optical Engineering, vol. 10808, 2018, SPIE - the International Society for Optics and Photonics, ISBN 9781510622036, 2086 p., DOI:10.1117/12.2504983|
|Keywords in English||anomaly detection, text classi�cation, text clustering, word embeddings|
|Score|| = 15.0, 16-10-2018, BookChapterMatConf|
= 15.0, 16-10-2018, BookChapterMatConf
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.