Proceedings. The Fourth International Conference on Computer and Information Technology
Download PDF

Abstract

A large amount of information available on the Web is formatted in HTML tables, which are mainly presentation-oriented and are not suited for database applications. As a result, how to capture information in HTML tables semantically and integrate relevant information is a challenge. In this paper, we present a new approach that automatically captures the semantic hierarchies of HTML tables, and semi-automatically integrates HTML tables. It first automatically captures the attribute-value pairs in HTML tables by normalization, and introduces the notion of eigen-value in formatting information to recognize the headings of HTML tables. After generating the global concepts and global schema manually by defining what data to be integrated, it then learns the lexical semantic set for each global concept, the contexts via labelling the attributes of example HTML tables to their corresponding global concept. Finally, it integrates the data of each source HTML table using the lexical semantic sets and the contexts to eliminate the conflicts and solve the nondeterministic problems in mapping each source schema to the global schema.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!