Determining Cosine Similarity Neighborhoods by Means of the Euclidean Distance, Chapter 17
Marzena Kryszkiewicz
Abstract
Cosine similarity measure is often applied in the area of information retrieval, text classification, clustering, and ranking, where documents are usually represented as term frequency vectors or its variants such as tfidf vectors. In these tasks, the most timeconsuming operation is the calculation of most similar vectors or, alternatively, least dissimilar vectors. This operation has been commonly believed to be inefficient for large highdimensional datasets. However, using the triangle inequality to determine neighborhoods based on a distance metric, offered recently, makes this operation feasible for such datasets. Although the cosine similarity measure is not a distance metric and, in particular, violates the triangle inequality, in this chapter, we present how to determine cosine similarity neighborhoods of vectors by means of the Euclidean distance applied to (α − )normalized forms of these vectors and by using the triangle inequality. We address three types of sets of cosine similar vectors: all vectors, the similarity of which to a given vector is not less than an ε threshold value, and two variants of the knearest neighbors of a given vector. Author 
Marzena Kryszkiewicz (FEIT / IN)
Marzena Kryszkiewicz,,
 The Institute of Computer Science

Pages  323345 
Book 
Skowron Andrzej, Suraj Zbigniew (eds.): Rough Sets and Intelligent Systems – Professor Zdzisław Pawlak in Memoriam. Volume 2, Intelligent Systems Reference Library, vol. 43, 2013, Heidelberg New York Dordrecht London, SpringerVerlag, ISBN 9783642303401, 604 p., DOI:10.1007/9783642303418

Keywords in English  knearest neighbors – εneighborhood – the cosine similarity measure – the Euclidean distance – the triangle inequality – normalized vector – data clustering – text clustering – highdimensional data 
ASJC Classification  3309 Library and Information Sciences; 1802 Information Systems and Management; 1700 General Computer Science 
DOI  DOI:10.1007/9783642303418_17 
URL 
http://link.springer.com/chapter/10.1007/9783642303418_17 
Project  Establishment of the universal, open, hosting and communication, repository platform for network resources of knowledge to be used by science, education and open knowledge society. Project leader: Kryszkiewicz Marzena,
, Phone: +48 22 234 7701, start date 16082010, planned end date 16082013, end date 31102013, WEiTI/2012/PS/1, Completed
BG PW
Projects financed by NCRD [Projekty finansowane przez NCBiR (NCBR)]

Language  en angielski 
Score (nominal)  5 
Score  Ministerial score = 5.0, 19052020, MonographChapterAuthor 
Publication indicators 
Scopus SNIP (Source Normalised Impact per Paper): 2013 = 0.269 
Citation count*  
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.
Back