| Abstract |
|
Detecting and eliminating fuzzy duplicates is a critical
data cleaning task that is required by many applications.
Fuzzy duplicates are multiple seemingly distinct tuples
which represent the same real-world entity. We propose
two novel criteria that enable characterization of fuzzy
duplicates more accurately than is possible with existing
techniques. Using these criteria, we propose a novel
framework for the fuzzy duplicate elimination problem.
We show that solutions within the new framework result in
better accuracy than earlier approaches. We present an
efficient algorithm for solving instantiations within the
framework. We evaluate it on real datasets to demonstrate
the accuracy and scalability of our algorithm.
|
Additional Information
|
Citation:
Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani,
"Robust Identification of Fuzzy Duplicates,"
icde,
pp. 865-876,
21st International Conference on Data Engineering (ICDE'05),
2005
|