Pattern Recognition, International Conference on
Download PDF

Abstract

In this paper, a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale sparse high-dimensional data collections is proposed. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data, specifically both asymmetric and symmetric models based on the multinomial and binomial distributions are most appropriate. An EM method of parameter estimation is provided for all these models. An experimental comparison of the models is obtained for two extensive online document collections.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!