On Cosine and Tanimoto Near Duplicates Search among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number
AbstractThe cosine and Tanimoto similarity measures are widely applied in information retrieval, text and Web mining, data cleaning, chemistry and bio-informatics for finding similar objects, their clustering and classification. Recently, a few very efficient methods were offered to deal with the problem of lossless determination of such objects, especially in large and very high-dimensional data sets. They typically relate to objects that can be represented by (weighted) binary vectors. In this paper, we offer methods suitable for searching vectors with domains consisting of zero, a positive number and a negative number; that is, being a generalization of weighted binary vectors. Our results are not worse than their existing analogs offered for (weighted) binary vectors.
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.