Advanced Search
CS Search Google Search
Subscribers, please login

Published Articles >> Table of Contents >> Abstract

Publication Home Page
September/October 2003 (Vol. 18, No. 5)   pp. 54-59
Profile-Based Object Matching for Information Integration

Full Article Text: View linked HTML of full textDownload PDF of full textBuy this articleGet full text from IEEE Xplore

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIS.2003.1234770
Send link to a friend

Abstract
Object matching is a fundamental problem that arises in numerous information integration scenarios. Virtually all existing solutions assume that the objects to be matched share the same attribute set and that systems can match them by comparing attribute similarities. Our work addresses the more general problem in which objects also have disjoint attributes-for example, matching tuples from relational tables that have different schemas, such as (age, name) and (name, salary). Profile-Based Object Matching, which applies this idea, exploits disjoint attributes to improve matching accuracy. PROM first matches any two tuples based on a shared attribute, such as name. It then applies a set of profilers, each of which contains some knowledge about what constitutes a typical person. The profilers examine the tuple pair to see if it plausibly describes a person. A profiler might state, for example, that if the pair produces a person with an age of 6 and a salary of $100,000, the pair doesn't describe a real person, so the tuples don't match. Profilers can be manually specified by domain experts, trained on training data, transferred from other matching tasks, or built from external data. PROM is thus distinct in that it not only exploits disjoint attributes to improve matching accuracy but also facilitates knowledge reuse from previous object-matching tasks.
References
[1] H. Do and E. Rahm, "Coma: A System for Flexible Combination of Schema Matching Approaches," Proc. 28th Conf. Very Large Databases (VLDB 2002), Morgan Kaufmann, 2002, pp. 610-621.
[2] A. Doan, P. Domingos, and A. Halevy, "Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD 01), ACM Press, 2001, pp. 509-520.
[3] J. Madhavan et al., "Matching Schemas by Learning from a Schema Corpus," Proc. IJCAI-03 Workshop Information Integration on the Web, AAAI Press, 2003, pp. 59-65.
[4] M. Craven et al., "Learning to Construct Knowledge Bases from the World Wide Web," Artificial Intelligence, vol. 118, nos. 1-2, 2000, pp. 69-113.
[5] D. Freitag, "Multistrategy Learning for Information Extraction," Proc. 15th Int'l Conf. Machine Learning (ICML 98), Morgan Kaufmann, 1998, pp. 161-169.
[6] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, "Eliminating Fuzzy Duplicates in Data Warehouses," Proc. 28th Int'l Conf. Very Large Databases (VLDB 2002), Morgan Kaufmann, 2002, pp. 586-597.
[7] M. Hernández and S. Stolfo, "The Merge/Purge Problem for Large Databases," Proc. 1995 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD 95), ACM Press, 1995, pp. 127-138.
[8] W. Li, J. Han, and J. Pei, "CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules," Proc. Int'l Conf. Data Mining (ICDM 01), IEEE CS Press, 2001, pp. 369-376.
[9] S. Tejada, C. Knoblock, and S. Minton, "Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification," Proc. 8th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD 02), ACM Press, 2002, pp. 350-359.
[10] M. Bilenko and R. Mooney, Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases, tech. report AI 02-296, Artificial Intelligence Laboratory, Univ. Texas at Austin, 2002.
Additional References
[1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, "Eliminating Fuzzy Duplicates in Data Warehouses," Proc. 28th Int'l Conf. Very Large Databases (VLDB 2002), Morgan Kaufmann, 2002, pp. 586-597.
[2] M. Bilenko and R. Mooney, Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases, tech. report AI 02-296, Artificial Intelligence Laboratory, Univ. Texas at Austin, 2002.
[3] W. Cohen, "Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD 98), ACM Press, 1998, pp. 201-212.
[4] H. Galhardas et al., "An Extensible Framework for Data Cleaning," Proc. 16th Int'l Conf. Data Eng. (ICDE 00), IEEE CS Press, 2000, p. 312.
[5] L. Gravano et al., "Text Joins for Data Cleansing and Integration in an RDBMS," Proc. 19th Int'l Conf. Data Eng. (ICDE 03), IEEE CS Press, 2003.
[6] M. Hernández and S. Stolfo, "The Merge/Purge Problem for Large Databases," Proc. 1995 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD 95), ACM Press, 1995, pp. 127-138.
[7] S. Lawrence, K. Bollacker, and C.L. Giles, "Autonomous Citation Matching," Proc. 3rd Int'l Conf. Autonomous Agents (Agents 99), ACM Press, 1999, pp. 392-393.
[8] A. McCallum, K. Nigam, and L. Ungar, "Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching," Proc. 6th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 2000), ACM Press, 2000, pp. 169-178.
[9] V. Raman and J. Hellerstein, "Potter's Wheel: An Interactive Data Cleaning System," Proc. 27th Conf. Very Large Data Bases (VLDB 2001), Morgan Kaufmann, 2001, pp. 381-390.
[10] S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning," Proc. 8th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD 02), ACM Press, 2002, pp. 269-278.
[11] S. Tejada, C. Knoblock, and S. Minton, "Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification," Proc. 8th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD 02), ACM Press, 2002, pp. 350-359.
[12] W. Yih and D. Roth, "Probabilistic Reasoning for Entity and Relation Recognition," Proc. 19th Int'l Conf. Computational Linguistics (COLING 02), Morgan Kaufmann, 2002.
[13] A. Monge and C. Elkan, "The Field Matching Problem: Algorithms and Applications," Proc. 2nd ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, AAAI Press, 1996, pp. 267-270.
[14] W. Cohen and J. Richman, "Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration," Proc. 8th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 02), ACM Press, 2002, pp. 475-480.
[15] J. Kang and J. Naughton, "On Schema Matching with Opaque Column Names and Data Values," Proc. 2003 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD 03), ACM Press, 2003, pp. 205-216.
[16] W. Cohen and D. Kudenko, "Transferring and Retraining Learned Information Filters," Proc. 14th Nat'l Conf. Artificial Intelligence (AAAI 97), AAAI Press, 1997, pp. 583-590.
[17] J. Berlin and A. Motro, "Database Schema Matching Using Machine Learning with Feature Selection," Proc. 14th Int'l Conf. Advanced Information Systems Eng. (CAiSE 02), LNCS 2348, Springer-Verlag, 2002, pp. 452-466.
[18] H. Do and E. Rahm, "Coma: A System for Flexible Combination of Schema Matching Approaches," Proc. 28th Conf. Very Large Databases (VLDB 2002), Morgan Kaufmann, 2002, pp. 610-621.
[19] A. Doan, P. Domingos, and A. Halevy, "Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD 01), ACM Press, 2001, pp. 509-520.
[20] J. Madhavan et al., "Matching Schemas by Learning from a Schema Corpus," Proc. IJCAI-03 Workshop Information Integration on the Web, AAAI Press, 2003, pp. 59-65.
[21] A. Rosenthal et al., "Data Integration Needs an Industrial Revolution," Proc. Workshop Foundations of Models for Data Integration (FMII 2001), 2001; www.fmldo.org/FMII-2001proceedings.html.
Additional Information
Index Terms- object matching, tuple deduplication, record linkage, data cleaning, data integration

Citation:  AnHai Doan, Ying Lu, Yoonkyong Lee, Jiawei Han, "Profile-Based Object Matching for Information Integration," IEEE Intelligent Systems, vol. 18,  no. 5,  pp. 54-59,  Sept/Oct,  2003

RSS Feed

Similar Articles

Abstract Contents
Abstract
References
Index Terms
Citation




Free access to

  • Abstracts
  • Selected PDFs

Electronic subscribers login to:

  • Access HTML/PDFs of full text articles

Subscription information

Get a Web account

Peer Review Notice

Give us Feedback