
By automating validation, we eliminate the re-
liance on domain experts for this task, consequently
reducing human effort and operational costs. The ap-
proach is highly scalable, capable of efficiently vali-
dating millions of records, which makes it well suited
for large-scale entity resolution. Using the contex-
tual reasoning capabilities of LLM, our method aims
to ensure high accuracy by minimizing false positives
and false negatives. Furthermore, the execution speed
of our validation mechanism is faster than manual
methods, enabling real-time or near-real-time verifi-
cation. IN summary, by integrating deep learning,
knowledge graphs, and LLM, our entity resolution ap-
proach aims to ensure a more efficient, scalable, and
reliable validation process.
5 CONCLUSION AND FUTURE
WORK
Data quality is a critical challenge in data lakes.
Therefore, entity resolution is crucial to enhance data
quality which is essential for making optimal deci-
sions.
In this paper, we propose a novel entity resolution
approach designed to improve data quality, scalabil-
ity, and automation in data lakes. Our solution uses
deep learning, to improve entity matching, knowledge
graphs, to capture relationships between entities and
LLM to reduce human intervention in the validation
phase.
Our approach presents a potentially effective im-
provement to existing entity resolution solutions, but
its true performance and efficiency can only be vali-
dated through real-world implementation and experi-
mentation.
Since our work is currently a theoretical propo-
sition, our next step is to implement this approach
and conduct a comprehensive evaluation against ex-
isting solutions. We aim to demonstrate its effective-
ness in the real-world and ultimately contribute to the
advancement of entity resolution.
While our approach focuses on the identification
of duplicate entities, we acknowledge that the subse-
quent step data fusion (merging duplicate records into
unified representations) is not addressed in this paper.
Data fusion is a critical and non trivial component of
the ER pipeline, and we plan to investigate scalable
and context-aware fusion strategies as part of future
work.
However, we note that data fusion has already
been explored in previous research efforts (Abdelhedi
et al., 2022a) (Abdelhedi et al., 2022b) (Abdelhedi
et al., 2021), where our team explored merging dupli-
cate records in data lakes using ontology-driven inte-
gration. Building upon such foundations, our future
efforts will aim to incorporate a robust, semantically
informed fusion module to complete the ER pipeline.
ACKNOWLEDGEMENTS
The authors acknowledge Professor Gilles Zurfluh for
his invaluable advice, insightful ideas, and time for
this work. His expertise and thoughtful advice have
been critical in shaping the direction of this work.
REFERENCES
Abdelhedi, F., Jemmali, R., and Zurfluh, G. (2021). Inges-
tion of a data lake into a nosql data warehouse: The
case of relational databases. In KMIS, pages 64–72.
Abdelhedi, F., Jemmali, R., and Zurfluh, G. (2022a). Data
Ingestion from a Data Lake: The Case of Document-
oriented NoSQL Databases. In Filipe, J., Smialek, M.,
Brodsky, A., and Hammoudi, S., editors, Proceedings
of the 24th International Conference on Enterprise In-
formation Systems - ICEIS 2022 ; ISBN 978-989-758-
569-2 ; ISSN 2184-4992, volume 1: ICEIS, pages
226–233, Online Streaming, France. SCITEPRESS :
Science and Technology Publications.
Abdelhedi, F., Jemmali, R., and Zurfluh, G. (2022b).
DLToDW: Transferring Relational and NoSQL
Databases from a Data Lake. SN Computer Science,
3(5):article 381.
AWS (2017). Aws glue. https://aws.amazon.com/fr/glue/.
Barlaug, N. and Gulla, J. A. (2021). Neural networks for
entity matching: A survey. ACM Transactions on
Knowledge Discovery from Data (TKDD), 15(3):1–
37.
Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H.,
Larson, T. E., Menestrina, D., and Thavisomboon,
S. (2007). D-swoosh: A family of algorithms for
generic, distributed entity resolution. In 27th Interna-
tional Conference on Distributed Computing Systems
(ICDCS’07), pages 37–37. IEEE.
Bilenko, M. and Mooney, R. (2003). Adaptive duplicate de-
tection using learnable string similarity measures. In
Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 39–48.
Borzsony, S., Kossmann, D., and Stocker, K. (2001). The
skyline operator. In Proceedings 17th international
conference on data engineering, pages 421 – 430.
IEEE.
Breiman, L. (2001). Random forests. Machine learning,
45:5–32.
Christen, P. (2012). Data Matching. Springer: Data-centric
systems and applications.
An Advanced Entity Resolution in Data Lakes: First Steps
667