| Abstract |
|
We present an efficient algorithm called the Quadtree
Heuristic for identifying a list of similar terms for each
unique term in a large document collection. Term similarity
is defined using the Expected Mutual Information Measure
(EMIM). Since our aim for defining the similarity lists is
to improve information retrieval (IR), we present the outcome of an experiment comparing the performance of an
IR engine designed to use the similarity lists. Two methods
were used to generate similarity lists: a brute-force technique and the Quadtree Heuristic. The performance of the
list generated by the Quadtree Heuristic was commensurate
with the brute force list.
|
Additional Information
|
Citation:
Wolfgang W. Bein, Jeffrey S. Coombs, Kazem Taghva,
"A Method for Calculating Term Similarity on Large Document Collections,"
itcc,
p. 199,
International Conference on Information Technology: Computers and Communications,
2003
|