Authors:
Lamisse F. Bouabdelli
1
;
2
;
Fatma Abdelhedi
2
;
Slimane Hammoudi
3
and
Allel Hadjali
1
Affiliations:
1
LIAS Laboratory, ISAE-ENSMA, Poitiers, France
;
2
CBI² Research laboratory, Trimane, Paris, France
;
3
ESEO, Angers, France
Keyword(s):
Data Lakes, Data Quality, Entity Resolution, Entity Matching, Machine Learning.
Abstract:
Entity Resolution (ER) is a critical challenge for maintaining data quality in data lakes, aiming to identify different descriptions that refer to the same real-world entity. We address here the problem of entity resolution in data lakes, where their schema-less architecture and heterogeneous data sources often lead to entity duplication, inconsistency, and ambiguity, causing serious data quality issues. Although ER has been well studied both in academic research and industry, many state-of-the-art ER solutions face significant drawbacks. Existing ER solutions typically compare two entities based on attribute similarity, without taking into account that some attributes contribute more significantly than others in distinguishing entities. In addition, traditional validation methods that rely on human experts are often error-prone, time-consuming, and costly. We propose an efficient ER approach that leverages deep learning, knowledge graphs (KG), and large language models (LLM) to auto
mate and enhance entity disambiguation. Furthermore, the matching task incorporates attribute weights, thereby improving accuracy. By integrating LLM for automated validation, this approach significantly reduces the reliance on manual expert verification while maintaining high accuracy.
(More)