|
Published Articles >> Table of Contents >> Abstract
September/October 2003 (Vol. 18, No. 5)
pp. 16-23
Adaptive Name Matching in Information Integration
Mikhail Bilenko, University of Texas at Austin
Raymond Mooney, University of Texas at Austin
William Cohen, Carnegie Mellon University
Pradeep Ravikumar, Carnegie Mellon University
Stephen Fienberg, Carnegie Mellon University
Full Article Text:
  
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIS.2003.1234765
Send link to a friend
| Abstract |
|
Identifying approximately duplicate database records that refer to the same entity is essential for information integration. The authors review traditional approaches to solving this problem and present their recent experimental results on comparing, combining, and learning textual similarity measures for name matching.
|
References
|
[1] R. Durban et al., Biological Sequence Analysis—Probabilistic Models of Proteins and Nucleic Acids, Cambridge Univ. Press, 1998.
[2] A. Monge and C. Elkan, "The Field-Matching Problem: Algorithm and Applications," Proc. 2nd ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, AAAI Press, 1996, pp. 267-270.
[3] A. Monge and C. Elkan, "An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records," Proc. SIGMOD Workshop Data Mining and Knowledge Discovery, ACM Press, 1997, pp. 267-270.
[4] M.A. Jaro, "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida," J. Am. Statistical Assoc., vol. 84, no. 406, June 1989, pp. 414-420.
[5] M.A. Jaro, "Probabilistic Linkage of Large Public Health Data Files," Statistics in Medicine, vol. 14, nos. 5-7, Mar./Apr. 1995, pp. 491-498.
[6] W.E. Winkler, "The State of Record Linkage and Current Research Problems," Statistics of Income Division, Internal Revenue Service Publication R99/04, 1999; www.census.gov/srd/wwwbyname.html.
[7] W.W. Cohen, "Data Integration Using Similarity Joins and a Word-Based Information Representation Language," ACM Trans. Information Systems, vol. 18, no. 3, July 2000, pp. 288-321.
[8] S. Tejada, C.A. Knoblock, and S. Minton, "Learning Object Identification Rules for Information Integration," Information Systems, vol. 26, no. 8, Dec. 2001, pp. 607-633.
[9] T. Joachims, Learning to Classify Text Using Support Vector Machines, Kluwer, 2002.
[10] E.S. Ristad and P.N. Yianilos, Learning String-Edit Distance IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 5, pp. 522-531, May 1998.
[11] L.R. Rabiner, “Tutorial on Hidden Markov Model and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257-285, 1989.
[12] M. Bilenko and R.J. Mooney, "Adaptive Duplicate Detection Using Learnable String Similarity Measures," Proc. 9th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 2003), ACM Press, 2003, pp. 39-48.
Additional References
[1] H.B. Newcombe et al., "Automatic Linkage of Vital Records," Science, vol. 130, no. 3381, Oct. 1959, pp. 954-959.
[2] I.P. Fellegi and A.B. Sunter, "A Theory for Record Linkage," J. American Statistical Assoc., vol. 64, no. 328, Dec. 1969, pp. 1183-1210.
[3] M.A. Hernández and S.J. Stolfo, "The Merge/Purge Problem for Large Databases," Proc. 1995 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD 95), ACM Press, 1995, pp. 127-138.
[4] A.K. McCallum, K. Nigam, and L. Ungar, "Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching," Proc. 6th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 2000), ACM Press, 2000, pp. 169-178.
[5] H. Galhardas et al., "AJAX: An Extensible Data-Cleaning Tool," Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD 00), ACM Press, 2000, p. 590.
[6] H. Galhardas et al., "Declarative Data Cleaning: Language, Model, and Algorithms," Proc. 27th Int'l Conf. Very Large Databases (VLDB 2001), Morgan Kaufmann, 2001, pp. 371-380.
[7] M.-L. Lee, T.W. Ling, and W.L. Low, "Intelliclean: A Knowledge-Based Intelligent Data Cleaner," Proc. 6th Int'l Conf. Knowledge Discovery and Data Mining (KDD 2000), ACM Press, 2000, pp. 290-294.
[8] W.E. Winkler, "Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage," Proc. Section on Survey Research Methods, American Statistical Assoc., 1988, pp. 667-671.
[9] W.E. Winkler, Advanced Methods for Record Linkage, tech. report, Statistical Research Division, US Census Bureau, 1994.
[10] W.W. Cohen, "Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD 98), ACM Press, 1998, pp. 201-212.
[11] K. Seymore, A.K. McCallum, and R. Rosenfeld, "Learning Hidden Markov Model Structure for Information Extraction," Papers from the 16th Nat'l Conf. Artificial Intelligence (AAAI 99), Workshop Machine Learning for Information Extraction, AAAI Press, 1999, pp. 37-42.
[12] T. Churches et al., "Preparation of Name and Address Data for Record Linkage Using Hidden Markov Models," Medical Informatics and Decision Making, vol. 2, no. 9, 13 Dec. 2002; www.biomedcentral.com/1472-6947/2/9abstract .
[13] H. Pasula et al., "Identity Uncertainty and Citation Matching," Advances in Neural Information Processing Systems 15, MIT Press, 2003.
[14] W.W. Cohen, H. Kautz, and D. McAllester, "Hardening Soft Information Sources," Proc. 6th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 2000), ACM Press, 2000, pp. 255-259.
[15] S. Tejada, C.A. Knoblock, and S. Minton, "Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification," Proc. 8th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 2002), ACM Press, 2002, pp. 350-359.
|
Additional Information
|
Index Terms- database integration, text mining, machine learning, similarity measures
Citation:
Mikhail Bilenko, Raymond Mooney, William Cohen, Pradeep Ravikumar, Stephen Fienberg,
"Adaptive Name Matching in Information Integration,"
IEEE Intelligent Systems,
vol. 18,
no. 5,
pp. 16-23,
Sept/Oct,
2003
|
|