Determining Cosine Similarity Neighborhoods by Means of the Euclidean Distance, Chapter 17

Marzena Kryszkiewicz

Abstract

Cosine similarity measure is often applied in the area of information retrieval, text classification, clustering, and ranking, where documents are usually represented as term frequency vectors or its variants such as tf-idf vectors. In these tasks, the most time-consuming operation is the calculation of most similar vectors or, alternatively, least dissimilar vectors. This operation has been commonly believed to be inefficient for large high-dimensional datasets. However, using the triangle inequality to determine neighborhoods based on a distance metric, offered recently, makes this operation feasible for such datasets. Although the cosine similarity measure is not a distance metric and, in particular, violates the triangle inequality, in this chapter, we present how to determine cosine similarity neighborhoods of vectors by means of the Euclidean distance applied to (α − )normalized forms of these vectors and by using the triangle inequality. We address three types of sets of cosine similar vectors: all vectors, the similarity of which to a given vector is not less than an ε threshold value, and two variants of the k-nearest neighbors of a given vector.
Author Marzena Kryszkiewicz (FEIT / IN)
Marzena Kryszkiewicz,,
- The Institute of Computer Science
Pages323-345
Book Skowron Andrzej, Suraj Zbigniew (eds.): Rough Sets and Intelligent Systems – Professor Zdzisław Pawlak in Memoriam. Volume 2, Intelligent Systems Reference Library, vol. 43, 2013, Heidelberg New York Dordrecht London, Springer-Verlag, ISBN 978-3-642-30340-1, 604 p., DOI:10.1007/978-3-642-30341-8
front-matter-RSIS.pdf / 344.73 KB / No licence information
Keywords in Englishk-nearest neighbors – ε-neighborhood – the cosine similarity measure – the Euclidean distance – the triangle inequality – normalized vector – data clustering – text clustering – high-dimensional data
ASJC Classification3309 Library and Information Sciences; 1802 Information Systems and Management; 1700 General Computer Science
DOIDOI:10.1007/978-3-642-30341-8_17
URL http://link.springer.com/chapter/10.1007/978-3-642-30341-8_17
ProjectEstablishment of the universal, open, hosting and communication, repository platform for network resources of knowledge to be used by science, education and open knowledge society. Project leader: Kryszkiewicz Marzena, , Phone: +48 22 234 7701, start date 16-08-2010, planned end date 16-08-2013, end date 31-10-2013, WEiTI/2012/PS/1, Completed
BG PW Projects financed by NCRD [Projekty finansowane przez NCBiR (NCBR)]
Languageen angielski
Score (nominal)5
ScoreMinisterial score = 5.0, 19-05-2020, MonographChapterAuthor
Publication indicators Scopus SNIP (Source Normalised Impact per Paper): 2013 = 0.269
Citation count*
Cite
Share Share

Get link to the record


* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.
Back
Confirmation
Are you sure?