SEALM: Semantically Enriched Attributes with Language Models for Linkage Recommendation
Loading...
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
2025-02-02
Type of Work
Department
Program
Citation of Original Publication
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Subjects
Abstract
Matching attributes from different repositories is an important step in the process of schema integration to consolidate heterogeneous data silos. In order to recommend linkages between relevant attributes, a contextually rich representation of each attribute is quite essential, particularly when more than two database schemas are to be integrated. This paper introduces the SEALM approach to generate a data catalog of semantically rich attribute descriptions using Generative Language Models based on a new technique that employs six variations of available metadata information. Instead of using raw attribute metadata, we generate SEALM descriptions, which are used to recommend linkages with an unsupervised matching pipeline that involves a novel multi-source Blocking algorithm. Experiments on multiple schemas yield a 5% to 20% recall improvement in recommending linkages with SEALM-based attribute descriptions generated by the tiniest Llama3.1:8B model compared to existing techniques. With SEALM, we only need to process the small fraction of attributes to be integrated rather than exhaustively inspecting all combinations of potential linkages.