Data Acquisition and Information Extraction for Scientific Knowledge Base Building
Piotr Andruszkiewicz , Henryk Rybiński
Here we present the process of data acquisition and information extraction for building a comprehensive and accurate scientific knowledge base including conferences, publications and scientists. We use two kinds of data sources. Firstly we gather structured and reliable, but incomprehensive and not always up-to-date data sources such as digital libraries. We enrich information extracted from those sources with unstructured data obtained from the Internet by filtering websites using SVM classifier to identify potentially useful web pages. There are two potential sources of errors in the process of information enrichment. The first is the unstructured data origin and another is lack of accuracy of the machine learning methods used for data acquisition and information extraction. We address both problems by proposing a new information extraction method as well as by using crowdsourcing to correct information. Our methods are currently used in a scientific platform; namely, Omega-Psir university knowledge base, containing list of researchers, publications, events, etc.
|Publication size in sheets||0.5|
|Book||O’Conner Lisa (eds.): 12th IEEE International Conference (ICSC). Proceedings, 2018, IEEE, ISBN 978-1-5386-4409-6, [978-1-5386-4408-9], 419 p.|
|Score|| = 15.0, 16-12-2018, BookChapterMatConf|
= 15.0, 16-12-2018, BookChapterMatConf
|Publication indicators||= 0; = 0|
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.