Paper presented at the 6th International Conference on Web Information Systems, WEBIST 2010, 7.-10. April 2010, Valencia
Sybille Peters, Claus-Peter Rückemann and Wolfgang Sander-Beuermann
Search engines typically consist of a crawler which traverses the web retrieving documents and a search frontend which provides the user interface to the acquired information. Focused crawlers refine the crawler by intelligently directing it to predefined topic areas. The evolution of search engines today is expedited by supplying more search capabilities such as a search for metadata as well as search within the content text. Semantic web standards have supplied methods for augmenting webpages with metadata. Machine learning techniques are used where necessary to gather more metadata from unstructured webpages. This paper analyzes the effectiveness of techniques for vertical search engines with respect to focused crawling and metadata integration exemplarily in the field of ceducational research. A search engine for these purposes implemented within the EERQI project is described and tested. The enhancement of focused crawling with the use of link analysis and anchor text classification is implemented and verified. A new heuristic score calculation formula has been developed for focusing the crawler. Full-texts and metadata from various multilingual sources are collected and combined into a common format.
start (C) W.Sander-Beuermann, Leibniz Universität Hannover, RRZN, SearchEngineLab, SuMa-eV, Association for Free Access to Knowledge