Word Sense Induction with Closed Frequent Termsets
Marek Kozłowski , Henryk Rybiński
AbstractThe article is devoted to the problem of word sense induction. We propose a method for inducing senses from a raw text corpus. The proposed sense induction algorithm (called SenseSearcher, or SnS) is based on closed frequent sets, and as a result, it provides a multilevel sense representation. To a large extent, it is a knowledge-poor approach, as it does not need any kind of structured knowledge base about senses and there is no deep language knowledge embedded. By discovering a hierarchy of senses, the algorithm enables identifying subsenses (fine-grained senses). SnS discovers not only frequent (dominating) senses but also infrequent ones (dominated). The method was evaluated in two main areas: lexicography and information retrieval. With the use of the SnS algorithm, we provide a tool able to induce from a textual corpus a structure of senses, with a varying number of granularity levels. In the area of information retrieval, SnS can be used for clustering search result, according to the discovered senses. The experiments have shown that SnS performs better than the methods participating in the SemEval2013 WSI Task 11 competition, and most of the known search result clustering methods.
|Journal series||Computational Intelligence, ISSN 0824-7935, e-ISSN 1467-8640|
|No||online: 30 May 2016|
|Publication size in sheets||1.6|
|Keywords in English||word sense induction, information retrieval, search result clustering, semantic processing|
|project||Development of new algorithms in the areas of software and computer architecture, artificial intelligence and information systems and computer graphics . Project leader: Rybiński Henryk,
, Phone: +48 22 234 7731, start date 18-05-2015, end date 30-11-2016, II/2015/DS/1, Completed
|Score|| = 20.0, 27-03-2017, ArticleFromJournal|
= 20.0, 27-03-2017, ArticleFromJournal
|Publication indicators||: 2016 = 0.964 (2) - 2016=1.378 (5)|
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.