Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)
Download PDF

Abstract

With the exponential growth of both the amount and diversity of the information that the Web encompasses, automatic classification of topic-specific Web sites is highly desirable. We propose a novel approach for Web site classification based on the content, structure and context information of Web sites. In our approach, the site structure is represented as a two-layered tree in which each page is modeled as a DOM (document object model) tree and a site tree is used to hierarchically link all pages within the site. Two context models are presented to capture the topic dependences in the site. Then the hidden Markov tree (HMT) model is utilized as the statistical model of the site tree and the DOM tree, and an HMT-based classifier is presented for their classification. Moreover, for reducing the download size of Web sites but still keeping high classification accuracy, an entropy-based approach is introduced to dynamically prune the site trees. On these bases, we employ the two-phase classification system for classifying Web sites through a fine-to-coarse recursion. The experiments show our approach is able to offer high accuracy and efficient process performance.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles