Comparative Analysis of Entity Matching Approaches for Product

Taxonomy Integration

Michel Hagenah and Michaela K

umpel

Institute for Artiﬁcial Intelligence, University of Bremen, Am Fallturm 1, 28359 Bremen, Germany

Keywords:

Entity Matching, Knowledge Engineering, Comparative Analysis, Word Embeddings, Large Language

Models, WordNet, Lemmatization.

Abstract:

This work examines different approaches to solving the entity matching problem for product categories by

converting the GS1 Global Product Categorization (GPC) published by GS1 as an ontology and linking it to

the Product Knowledge Graph (ProductKG). For the implementation, methods were developed in Python for

word embeddings, WordNet, lemmatization, and large language models (LLMs), which then link classes of

the GPC ontology with the classes of the ProductKG. All approaches were carried out on the same source data

and each provided an independent version of the linked GPC ontology. As part of the evaluation, the quantities

of linked class pairs were analyzed and precision, recall, and F1 score for the Food / Breakfast segment of the

GS1 GPC taxonomy were calculated. The results show that no single approach is universally superior. LLMs

achieved the highest F1-score due to their deep semantic understanding but suffered from lower precision,

making them suitable for applications requiring broad coverage. Lemmatization achieved perfect precision,

making it ideal for use cases where false matches must be avoided, though at the cost of signiﬁcantly lower

recall. WordNet offered a balanced trade-off between precision and recall, making it a reasonable default

choice. Word embeddings, however, performed poorly in both metrics and did not outperform the other

methods.

1 INTRODUCTION

Organizing products into standardized taxonomies is

essential for e-commerce, supply chain management,

and data integration (Aanen et al., 2015). However,

different classiﬁcation systems, such as the Global

Product Categorization (GPC) by GS1

and the Prod-

uct Knowledge Graph (ProductKG) (K

umpel and

Beetz, 2023), use distinct structures, naming conven-

tions, and levels of granularity. These discrepancies

create challenges in aligning product categories, mak-

ing interoperability between datasets difﬁcult. In the

context of the Semantic Web, where data from diverse

sources should be meaningfully connected, aligning

product taxonomies is crucial for enabling seamless

data exchange and integration (Aanen et al. (2015),

Christen (2012)).

The challenge of aligning product taxonomies is

a particular instance of the broader problem of entity

matching, the task of identifying and linking records

that refer to the same real-world entity (Barlaug and

https://orcid.org/0000-0002-0408-3953

GS1 website: https://www.gs1.org/

Gulla, 2021; Christen, 2012; Elmagarmid et al., 2007;

opcke and Rahm, 2010), which has existed for as

long as databases have been in use. As soon as new

datasets, tables, or ontologies are created, organiza-

tions face the recurring need to integrate them with

existing ones. This challenge has been recognized for

decades Elmagarmid et al. (2007); Christen (2012),

and despite extensive research, no universal solution

has emerged. Entity matching remains a highly rel-

evant problem because every new data source poten-

tially introduces terminological differences, schema

variations, or domain-speciﬁc nuances that require

resolution before meaningful integration is possible.

In product classiﬁcation, entity matching involves

matching categories across different taxonomies,

even when they use varying terminologies or hier-

archical structures. Ontologies, which deﬁne shared

conceptualizations of a domain, can provide struc-

tured knowledge to support this task by making re-

lationships between product categories more explicit.

However, despite signiﬁcant research in ontology-

based and data-driven entity matching, existing ap-

proaches still face notable limitations:

Hagenah, M. and Kümpel, M.

Comparative Analysis of Entity Matching Approaches for Product Taxonomy Integration.

DOI: 10.5220/0013711700004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 2: KEOD and KMIS, pages

40-51

ISBN: 978-989-758-769-6; ISSN: 2184-3228

Figure 1: Overview of the evaluation workﬂow: (1) Preparation of source data, (2) Execution of method implementations on

same source data.

• Rule-based methods, while interpretable, struggle

with complex variations in naming conventions

and require extensive manual effort (Cohen et al.,

2003).

• WordNet-based approaches leverage predeﬁned

lexical relationships but may lack coverage for

domain-speciﬁc terminology (Agirre et al., 2009).

• Word embeddings and Large Language Models

(LLMs) offer more ﬂexibility by capturing seman-

tic similarities, yet they remain prone to errors in

cases where product categories have subtle dis-

tinctions (Narayan et al., 2022).

These shortcomings highlight the need for a system-

atic evaluation of different techniques to determine

the most effective approach for aligning product tax-

onomies in a structured and scalable manner. While

other works have shown how these approaches can be

used for entity matching (e.g. Narayan et al. (2022);

Zhang et al. (2024); Peeters et al. (2023); Aanen et al.

(2015); Zhu and Iglesias (2018); Jatnika et al. (2019),

a lack of a broad comparative analysis is still appar-

ent.

This work presents a comparative analysis of

four entity-matching techniques: lemmatization,

WordNet-based similarity, word embeddings, and

LLMs. Each approach is applied to aligning prod-

uct categories between GPC and ProductKG, assess-

ing their accuracy and effectiveness in resolving nam-

ing and structural inconsistencies. Some of the meth-

ods analyzed, such as lemmatization and WordNet,

are based on older or more foundational techniques.

However, they still offer valuable insights into the

core challenges of entity matching. Although more

modern and sophisticated approaches exist, evaluat-

ing them in depth would exceed the scope of this

work. The goal is to establish a solid baseline by com-

paring a range of fundamental methods, including a

more recent LLM-based approach, and to highlight

their strengths and weaknesses as a basis for future

research.

Based on the current trajectory of research in nat-

ural language processing, we hypothesise that

Hypothesis 1.1. LLMs will perform best in the exper-

iment.

The broad coverage of LLMs and their ability to

capture nuanced semantic relations are likely to give

them an advantage, although tendencies to overgener-

alize and to generate hallucinations will certainly be

limiting factors.

Hypothesis 1.2. Word embeddings are expected to

follow closely.

Since word embeddings are also trained on large

text corpora and can model semantic similarity effec-

tively, they are expected to perform almost as good as

LLMs.

Hypothesis 1.3. WordNet is anticipated to provide

decent performance.

As WordNet encodes structured lexical relation-

ships, it is expected to also perform well. However,

its coverage of domain-speciﬁc terminology is lim-

ited, which leads us to assume that it performs worse

than LLMs and word embeddings.

Hypothesis 1.4. Lemmatization is expected to yield

the weakest results.

Lemmatization merely normalizes words and

checks for string matches without deeper semantic

reasoning, leading us to assume the worst experiment

performance.

The contributions of this work are

1. A systematic evaluation of entity-matching tech-

niques in the context of product classiﬁcation and

the Semantic Web.

2. An analysis of the strengths and weaknesses of

the four different approaches in handling naming

variations, hierarchical differences, and ontology-

based relationships.

3. An experimental validation using real-world prod-

uct taxonomies to assess matching accuracy and

practical applicability.

Comparative Analysis of Entity Matching Approaches for Product Taxonomy Integration

2 RELATED WORK

The usage of various technologies to solve the En-

tity Matching problem has been a topic of discussion

in various previous works. While prior work has ap-

plied these methods in isolation, no systematic head-

to-head comparison has been made in the context of

product taxonomies. This motivates our comparative

study.

2.1 Obtaining Semantic Similarity from

WordNet

WordNet (Fellbaum, 2010) has been used to obtain

semantic similarity of different words in multiple

works. Gurevych and Strube (2004) propose a spo-

ken dialogue summarization method using seman-

tic similarity metrics from WordNet. Their system

extracts key utterances by computing the similarity

between an utterance and the entire dialogue. Al-

though various works have analyzed the effectiveness

of WordNet-based approaches to obtain semantic sim-

ilarity (Gurevych and Strube, 2004; Meng et al., 2013;

Farouk, 2018; Agirre et al., 2009) they do not com-

pare its effectiveness to approaches based on other

methods such as word embeddings or large language

models. Aanen et al. (2015) present an algorithm that

uses WordNet as one of its core components to map

different product taxonomies to each other to aggre-

gate product information from different sources. Al-

though their algorithm is speciﬁcally made for prod-

uct taxonomies, they also do not provide any broader

comparison of other methods.

2.2 Calculating Semantic Similarity

from Word Embeddings

Semantic vector representations of words, referred to

as word embeddings, can also be used to obtain se-

mantic similarity of words. By calculating the dis-

tance between two vectors using the cosine function,

such a value can be retrieved (Farouk, 2018; Zhu and

Iglesias, 2018; Jatnika et al., 2019; Kenter and De Ri-

jke, 2015).

Kenter and De Rijke (2015) proposed an alterna-

tive way to calculate the similarity of text using word

embeddings rather than employing approaches such

as lexical matching, syntactical analysis, or hand-

made patterns.

While Farouk (2018) compares the ability to se-

mantically compare sentences of WordNet and word

embeddings using standardises datasets, it does not

directly compare their ability to resolve two different

identiﬁers for the same actual entity, which represents

the core issue of the entity matching problem.

2.3 Entity Matching with Transformer

Based Models

The ﬁeld of research that analyses the usage of data

integration capabilities of language models, whether

large or small, has gained a lot of traction in recent

years through the rise in popularity of generative AI.

Narayan et al. (2022) presents how general pur-

pose transformer models (Vaswani et al., 2017), in

this case GPT-3, can be used for data integration tasks,

including entity matching. Despite the fact that such

models do not have any task-speciﬁc ﬁne-tuning done

beforehand, it is found that for each analysed data in-

tegration task, the language model outperforms the

state-of-the-art solutions for each task.

Peeters et al. (2023) speciﬁcally presents the capa-

bilities of large language models when solving the en-

tity matching problem. They compare different large

language models against pre-trained language mod-

els (PLMs). By using a range of various prompts

in zero-shot and few-shot scenarios, the work reveals

that there is no single best prompt for a given model

or dataset, but rather for a speciﬁc model and dataset

combination. Furthermore, the quality of the results

is very sensitive to prompt variation. They concluded

that LLMs outperform PLMs in entity matching tasks

in certain scenarios, despite the fact that the PLMs

were trained on task-speciﬁc data.

Zhang et al. (2024) shows how even small lan-

guage models can be used to solve the entity matching

problem. In order to show that entity matching can be

performed while using a signiﬁcantly lower amount

of resources, they created a GPT-2 based model called

AnyMatch. In order to keep the amount of required re-

sources low, they carefully selected the data they used

to train the model for the entity matching task at hand.

They then evaluated AnyMatch in a zero-shot envi-

ronment and found that the resulting F1-Score was

only 4.4% worse than the GPT-4 based MatchGPT,

despite it using signiﬁcantly lesser resources.

These works presented how transformer-based

language models can be used to perform entity match-

ing. While they do, among other things, also present a

comparison to other models, they do not compare the

effectiveness to other entity matching approaches that

are based on WordNet or word embeddings.

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

Table 1: Overview of selected related work on entity matching and semantic similarity.

Domain / Focus Techniques Used

Product taxonomy mapping (Aanen et al., 2015) WordNet + Rule-based

Dialogue summarization (Gurevych and Strube, 2004) WordNet similarity

Semantic similarity measures (Meng et al., 2013) WordNet similarity

Sentence similarity (Farouk, 2018) WordNet vs. Word embeddings

Entity disambiguation in KGs (Zhu and Iglesias, 2018) Word embeddings

Word similarity (Jatnika et al., 2019) Word embeddings (Word2Vec)

Short text similarity (Kenter and De Rijke, 2015) Word embeddings

Data integration tasks (Narayan et al., 2022) Transformer-based LLM (GPT-3)

Entity matching (Peeters et al., 2023) LLMs (GPT-family) vs. PLMs

Resource-efﬁcient entity matching (Zhang et al., 2024) Small LLM (AnyMatch, GPT-2 based)

3 PRODUCT TAXONOMIES

In this section we introduce the product taxonomies

employed in our analysis, detailing their structures,

purposes, and relevance to our research objectives.

Speciﬁcally, we focus on the Global Product Classiﬁ-

cation (GPC) for its usage as a global standard and its

broad coverage of product categories and the Prod-

uct Knowledge Graph (ProductKG) for its practical

applications. These taxonomies serve as the source

and target of the different entity matching methods

explored in this work.

3.1 The Global Product Classiﬁcation

The GPC (GS1, 2024a), developed by GS1, is an in-

ternationally recognized standard for the systematic

categorization of products. Its primary purpose is to

provide a common language that enables companies,

marketplaces, governmental bodies and other stake-

holders to classify products in a consistent and unam-

biguous manner, thereby facilitating interoperability

across supply chains and ensuring that product infor-

mation can be exchanged without semantic conﬂicts.

By establishing such a standardized framework, GPC

reduces inefﬁciencies in trade processes, data align-

ment, and regulatory reporting, all of which would

otherwise be prone to inconsistencies if organizations

relied solely on internal product taxonomies (GS1,

2025b).

GPC is designed as a hierarchical taxonomy con-

sisting of four distinct levels. At the most general

level, the Segment represents a broad industry sec-

tor, such as “Food/Beverage/Tobacco” or “Electron-

ics”. Each Segment is subdivided into Families,

which group products of similar nature within that

sector. Families are then divided into Classes, pro-

viding further speciﬁcity, and the most detailed level

is the Brick, which clusters closely related products

and serves as the operational unit for product identi-

ﬁcation. Bricks are associated with so-called Brick

Attributes, which describe key characteristics of the

products within them, such as screen size for laptops

or roast type for coffee. For example, a bag of roasted

coffee beans would be categorized in the Segment

“Food/Beverage/Tobacco,” within the Family “Bever-

age,” in the Class “Coffee,” and ﬁnally in the Brick

“Roasted Coffee,” with attributes specifying features

such as caffeine content, roast level, or packaging

type. This hierarchical and attribute-based structure

ensures that trading partners in different countries can

describe and exchange information about the same

product with precision and without ambiguity (GS1,

2015).

Unlike static taxonomies, GPC is continuously

updated through a governance process coordinated by

GS1. Industry stakeholders, including manufactur-

ers, retailers, distributors, and regulators, can submit

change requests when gaps, ambiguities, or emerging

product categories are identiﬁed. These proposals are

reviewed and discussed in the Global Standards Man-

agement Process (GSMP), where consensus-based

decisions ensure that modiﬁcations are both relevant

and implementable (GS1, 2024b). Each release con-

tains a complete XML schema, representing the cur-

rent state of the taxonomy, as well as an XML delta

ﬁle that highlights the changes relative to the previ-

ous version (GS1, 2015). These updates are then dis-

tributed across the GS1 system, including the Global

Data Synchronisation Network (GDSN), to guarantee

alignment between stakeholders (GS1, 2015).

The integration of GPC into the GDSN highlights

its signiﬁcance in global data exchange. The GDSN

is a network of interoperable GS1-certiﬁed data pools

that allows companies worldwide to exchange stan-

dardized and trusted product information in real time

(GS1, 2025a). It operates on a publish–subscribe

model: a brand owner enters product data once into

Comparative Analysis of Entity Matching Approaches for Product Taxonomy Integration

a data pool, and all subscribed trading partners au-

tomatically receive the same data, eliminating dupli-

cation and inconsistencies (Wikipedia, 2024). GPC

plays a crucial role within this framework by acting as

the categorical backbone against which product data

is structured and validated (GS1, 2025c). In practi-

cal terms, this ensures that a newly introduced prod-

uct, will be categorized consistently across the net-

work, rather than being subject to differing local tax-

onomies.

In sum, the GS1 Global Product Classiﬁcation

provides a universal, hierarchical, and attribute-

enriched framework for product categorization that

evolves in line with market innovations. Through

its integration into the GDSN, it enables seamless,

real-time, and semantically coherent product data ex-

change, underpinning the reliability and efﬁciency of

global commerce.

For the purpose of this work, the GPC dataset was

transformed from its XML format into an OWL on-

tology to support ontological linking and to facilitate

its use in the different matching tasks. Each prod-

uct category from the original dataset was mapped to

an owl:Class. To preserve the hierarchical structure

of the taxonomy, the rdfs:subClassOf property was

used. Starting from the segment level, every family

was linked to its corresponding segment via this prop-

erty, and the same approach was applied recursively

down to classes and bricks.

3.2 The Product Knowledge Graph

(ProductKG)

ProductKG(K

umpel and Beetz, 2023) is an open-

source product knowledge graph that integrates prod-

uct information obtained from the Web with envi-

ronment data provided by a semantic digital twin

(semDT) K

umpel et al. (2021). The purpose of this

system is to combine abstract product data, such as

taxonomies, ingredients, nutritional values, and la-

bels, with information about the physical world in

which products exist in speciﬁc locations and quan-

tities. In this way, ProductKG enables intelligent ap-

plications to reason about products in ways that are

both semantically rich and contextually grounded.

The semDT component of ProductKG represents

environment information in the form of a spatial and

relational model of a physical setting, for example

a retail store. Its standardised representation en-

ables omni-channel applications K

umpel and Dech

(2025). It represents shelf layouts, product place-

ments, stock levels, and prices, which are often de-

rived from robotic perception systems as described

in Beetz et al. (2022). As a result, the semDT provides

knowledge about where products are located and how

many are available, linking physical product instances

to their digital representations.

On the other hand, ProductKG integrates diverse

sources of product information. It includes product

taxonomies, ingredient classiﬁcations that are con-

nected to allergens, product labels such as brands and

packaging details, nutritional data and physical di-

mensions. Structurally, ProductKG is modular and

consists of interconnected ontologies that focus on

different aspects of product knowledge. Products

are interlinked across these ontologies by standard-

ized identiﬁers, most prominently the Global Trade

Identiﬁcation Number (GTIN), which ensures reliable

alignment between web-based product data and envi-

ronment observations.

The integration of these two knowledge domains

makes ProductKG suitable for applications that re-

quire both semantic understanding and contextual

awareness. Examples include shopping assistants that

highlight suitable products on store shelves accord-

ing to user preferences K

umpel et al. (2023), dietary

recommenders that suggest products or recipes based

on nutritional information and personal proﬁles, and

cooking assistants that relate available products in a

household to recipe steps. ProductKG is exposed

through a SPARQL endpoint, which allows external

applications to query and combine product character-

istics, nutritional attributes, or product hazard infor-

mation.

By combining web-based product information

with a semDT, ProductKG advances the concept of

context-aware product reasoning. It allows applica-

tions not only to identify that a product is vegan or

gluten-free, but also to verify whether such a prod-

uct is present in the immediate environment, whether

suitable alternatives are available, and how it can be

incorporated into a dietary plan or recipe. This ability

makes ProductKG a valuable resource for intelligent

assistants in domains such as retail and household en-

vironments.

One of the central components of ProductKG is

the product taxonomy, which deﬁnes a variety of

product categories. These categories were extracted

from online sitemaps of supermarket websites and are

further enriched through integration with existing on-

tologies such as FoodOn (Dooley et al., 2018). Due

to its broad coverage of everyday consumer products

and its native availability in OWL format, the prod-

uct taxonomy could be directly reused in this work

without the need for further transformation or prepro-

cessing.

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

4 ENTITY MATCHING

To evaluate the different semantic similarity tech-

niques, each matching approach was implemented as

an independent module. All methods operate on the

same input data: the set of ontology classes of the

converted GPC dataset and the ontology classes ob-

tained from the product taxonomy. The modules pro-

duce their output independently, enabling an objec-

tive comparison of their performance. Despite their

independence, all methods adhere to some predeﬁned

rules:

• At Most One Matched Class: A GPC class

should have, at maximum, only one link to one

single class when matching. If there are multiple

valid found matches, the best one should be se-

lected.

• Subclass Inheritance: If a GPC class is matched

to a taxonomy class, all its subclasses are assumed

to belong to the same category.

• Conﬂict Resolution with Superclasses: If a su-

perclass has already been matched to a taxonomy

class, its subclasses are not allowed to match to

that same class. In such cases, the system selects

the next-best available and valid match.

Once a match is conﬁrmed, the ontology class is

linked to the matched taxonomy class using the

oboInOwl:hasDbXref annotation property. This en-

sures that external taxonomy references are properly

encoded in the resulting ontology structure.

4.1 Word Embeddings

Word embeddings are vector representations of words

in a high-dimensional space that capture semantic re-

lationships based on their usage in large text cor-

pora (Kusner et al., 2015; Almeida and Xex

eo, 2023).

Training algorithms such as Word2Vec (Mikolov

et al., 2013) or GloVe (Pennington et al., 2014) lever-

age the distributional hypothesis (Sahlgren, 2008) to

learn these representations so that semantically sim-

ilar words are located near each other in the vector

space. This allows for the computation of semantic

similarity using cosine similarity (Farouk, 2018; Jat-

nika et al., 2019)

In this implementation, the pre-trained

glove-twitter-200 model, available through

the python package gensim, was used (

Reh

rek and

Sojka, 2010). For product class names consisting

of multiple words, embeddings were computed

for each individual word and averaged to obtain a

single vector representation per label. The similarity

between ontology classes and taxonomy entries was

then calculated using cosine similarity between these

average vectors.

Since cosine similarity ranges from −1 to 1, a

threshold must be deﬁned to determine when two

product class names are considered a semantic match.

To analyze the effect of this parameter, two runs were

performed using thresholds of 0.75 and 0.80 respec-

tively. If multiple valid matches are found, then the

one with the highest cosine similarity will be selected.

The algorithm is detailed in the following:

Algorithm 1: Entity Matching with Word Embeddings.

Require: GPC classes C

GPC

, ProductKG classes

PKG

, pre-trained GloVe model, threshold t

1: for each c

∈ C

GPC

2: Compute embedding v

as mean of word vec-

tors of c

3: for each c

∈ C

PKG

4: Compute embedding v

as mean of word

vectors of c

5: Compute cosine similarity s = cos(v

, v

)

6: if s ≥ t then

7: Add (c

, c

, s) to candidate matches

8: end if

9: end for

10: Select candidate with maximum s for c

(if

any)

11: Apply subclass inheritance and conﬂict resolu-

tion rules

12: Link c

→ c

with oboInOwl:hasDbXref

13: end for

4.2 WordNet

WordNet is a lexical database that organizes words

into sets of cognitive synonyms called synsets, which

represent distinct concepts. Each synset is linked to

other synsets through various semantic relations, in-

cluding antonymy, hyponymy (subclass), hypernymy

(superclass), meronymy (part-whole), and holonymy

(whole-part), forming a graph-like structure of se-

mantic relationships (Fellbaum, 2010).

This structure allows the computation of semantic

similarity between words based on the shortest path or

other distance metrics between synsets (Agirre et al.,

2009; Meng et al., 2013). In this implementation,

similarity scores were derived from such path-based

measures.

As with word embeddings, a similarity threshold

must be deﬁned to determine whether two terms are

considered a match. To analyze the effect of this

threshold, two runs were conducted with values of

0.33 and 0.40. Similarly, if multiple valid matches

are detected, the one with the highest value will be

Comparative Analysis of Entity Matching Approaches for Product Taxonomy Integration

selected. WordNet is accessible as a database and

through various programming libraries such as NLTK

(Bird et al., 2009).

Similar to Algorithm 1, we compute the WordNet

similarity as described in the following:

Algorithm 2: Entity Matching with WordNet Similarity.

Require: GPC classes C

GPC

, ProductKG classes

PKG

, threshold t

1: for each c

∈ C

GPC

2: for each c

∈ C

PKG

3: Compute WordNet path-based similarity

s(c

, c

)

4: if s ≥ t then

5: Add (c

, c

, s) to candidate matches

6: end if

7: end for

8: Select candidate with maximum s for c

(if

any)

9: Apply subclass inheritance and conﬂict resolu-

tion rules

10: Link c

→ c

with oboInOwl:hasDbXref

11: end for

4.3 Lemmatization

The lemmatization-based approach performs lexical

matching by ﬁrst normalizing the names of the tax-

onomy classes. Lemmatization reduces inﬂected or

derived words to their base or dictionary form (e.g.,

“running” → “run”), which is particularly useful for

improving consistency in string comparison (Khyani

et al., 2021).

After lemmatizing all terms, the resulting lemmas

are compared to each other. If two terms yield exactly

the same lemma, they are considered a match. This

method extends simple string matching by making it

robust against grammatical variations such as plural

forms or verb conjugations (Khyani et al., 2021).

While it does not capture deeper semantic similar-

ity, lemmatization provides an efﬁcient and linguis-

tically grounded baseline for identifying equivalent

concepts based on surface forms.

4.4 LLMs

The LLM-based approach leverages the emergent

abilities of large-scale language models, which ap-

pear as models that are scaled with more parameters

and data (Zhao et al., 2023; Wei et al., 2022). These

models demonstrate a deeper semantic understanding

of concepts, enabling them to match class labels based

on meaning rather than surface similarity Zhang et al.

(2024); Peeters et al. (2023).

Algorithm 3: Entity Matching with Lemmatization.

Require: GPC classes C

GPC

, ProductKG classes

PKG

1: Lemmatize all terms in C

GPC

and C

PKG

2: for each c

∈ C

GPC

3: for each c

∈ C

PKG

4: if lemma(c

) = lemma(c

) then

5: Match c

→ c

6: Apply subclass inheritance and conﬂict

resolution rules

7: Link c

→ c

with oboInOwl:hasDbXref

8: end if

9: end for

10: end for

Due to the large number of classes in both the

GPC dataset and the product taxonomy, computa-

tional and memory limitations had to be considered.

Instead of prompting the model with every possible

class pair, the entity matching task was executed as a

bulk operation. We selected the remotely available

GPT-4o (OpenAI, 2024) for its support of ﬁle up-

loads, allowing a more ﬂexible input format

To simplify processing and reduce syntactic com-

plexity, both ontologies were converted to plain XML.

The resulting ﬁles were split into smaller chunks to ﬁt

within the model’s processing limits. For each chunk,

the following prompt was used:

• Match GPC classes to ProductKG taxon-

omy classes based on semantic similarity, not

string similarity

• Avoid code generation or syntactic reformu-

lation

• Follow the same matching rules used across

all implementations (e.g., single best match,

subclass inheritance, conﬂict resolution)

• Ensure all matched classes exist in the input

to avoid hallucinations

This process was repeated for all chunks. The

results were then merged, and class names were re-

solved back to their original ontology terms. Valid

matches were linked using the oboInOwl:hasDbXref

annotation property.

5 EVALUATION

This section presents and compares the results of the

individual matching approaches.

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

Figure 2: UpSet plot (Lex et al., 2014) illustrating the dis-

tribution and intersections of matched class pairs across the

different entity matching approaches. Each bar represents

the number of matches found either uniquely by a single

method or jointly by multiple methods, indicated by the

dots. This visualization highlights both agreement and di-

vergence among the implemented techniques. Sets below

the size of 8 are not shown here for the sake of readability.

5.1 Quantitative Evaluation

We analyze the matched class pairs from each method

and evaluate their overlap using an UpSet plot (Lex

et al., 2014). To assess performance more precisely,

we calculate precision, recall, and F1 score on a fo-

cused subset of the dataset.

Figure 2 visualizes the class pairs matched by

each method and their intersections. The LLM-based

approach identiﬁes many unique matches, suggest-

ing a broader semantic range. The word embed-

dings method with a 0.75 threshold also ﬁnds distinct

matches not shared with the stricter 0.80 setting or

other methods.

The two WordNet-based variants yield nearly

identical results, indicating low sensitivity to thresh-

old changes (0.33 vs. 0.40). Interestingly, the

0.80 word embeddings conﬁguration produces some

unique matches not present at 0.75, likely due to

higher similarity scores for more speciﬁc terms.

Lemmatization, by contrast, contributes few unique

matches and mostly overlaps with other methods, re-

ﬂecting its reliance on surface-level similarity.

While Figure 2 highlights overlaps and differ-

ences, it does not indicate which method performs

best. For that, we evaluate the approaches on

a smaller, manageable subset of the dataset: the

Food/Beverages segment of GPC. This domain al-

lowed us to manually create a gold-standard mapping

for reference.

Based on this reference, we compute standard

evaluation metrics: precision, recall, and F1 score.

These are derived from the true positives, false pos-

itives, and false negatives for each method, as shown

in Table 2.

The results reveal clear performance differences.

Lemmatization achieves the highest precision (1.0)

but has low recall, resulting in a modest F1 score.

It identiﬁes only highly accurate matches but misses

many valid ones.

The LLM-based method reaches the highest F1

score due to its much higher recall, though it also has

the lowest precision (0.653), reﬂecting a larger num-

ber of false positives.

Both word embedding conﬁgurations achieve rel-

atively high precision but very low recall, leading to

the lowest F1 scores overall.

This reﬂects a conservative matching strategy that

captures only a small subset of correct pairs.

WordNet-based methods offer a more balanced

trade-off. With slightly lower precision but higher

recall than lemmatization and embeddings, they out-

perform both in terms of F1 score. Nonetheless, the

LLM-based method leads in overall coverage due to

its superior recall.

5.2 Qualitative Evaluation

In addition to the quantitative results, we also provide

a set of representative examples 3 to qualitatively il-

lustrate the error patterns of the different approaches.

These examples were not directly extracted from the

evaluation dataset, which reported aggregated counts

only, but were instead constructed to reﬂect the typi-

cal strengths and weaknesses observed in the quanti-

Table 2: Results for all approaches.

Precision Recall F1-Score

Lemmatization 1.000 0.263 0.417

LLM 0.653 0.410 0.504

W. Emb. (x=0.80) 0.882 0.161 0.273

W. Emb. (x=0.75) 0.895 0.183 0.304

WordNet (x=0.33) 0.931 0.290 0.443

WordNet (x=0.40) 0.935 0.312 0.468

Comparative Analysis of Entity Matching Approaches for Product Taxonomy Integration

Table 3: Representative true positives, false positives, and false negatives for each entity matching approach.

Method True Positive (Correct

Match)

False Positive (Wrong

Match)

False Negative

(Missed Match)

Lemmatization “Coffees” → “Coffee” – “Roasted Coffee” vs.

“Coffee Beans”

WordNet “Bread” ↔ “Loaf” “Oil” ↔ “Petroleum” “Tofu” vs. “Soybean

Product”

Word Embeddings – “Cake” ↔ “Biscuit” “Skimmed Milk” vs.

“Low-Fat Milk”

LLM “Almond Milk” →

“Plant-based Drinks”

“Energy Bar” →

“Chocolate”

“Granola” vs. “Break-

fast Cereals”

tative analysis. For instance, lemmatization produces

exact lexical matches such as Coffee → Coffee but

fails in cases of synonymy, whereas LLMs demon-

strate broader coverage (e.g., Almond Milk → Plant-

based Drinks) at the cost of more frequent false posi-

tives.

The observed error tendencies also have impor-

tant implications for practical applications of taxon-

omy alignment. False positives produced by LLMs,

such as mapping Energy Bar to Chocolate, may be

acceptable in consumer-facing scenarios like recom-

mender systems, where broad semantic coverage is

beneﬁcial and occasional overgeneralization does not

critically harm usability. However, in compliance-

critical contexts such as allergen tracking or regula-

tory reporting, such overextensions could lead to seri-

ous misclassiﬁcations and must therefore be avoided.

Conversely, the false negatives typical of lemmatiza-

tion (e.g., Roasted Coffee vs. Coffee Beans) indicate

that while this method ensures perfect precision, it

risks omitting many valid mappings, limiting its suit-

ability for applications where comprehensive cover-

age is essential. WordNet and word embeddings fall

between these extremes, offering moderate trade-offs

but still showing domain-speciﬁc weaknesses. Taken

together, these qualitative patterns underscore that the

choice of method should be guided not only by aggre-

gate scores but also by the speciﬁc error tolerance of

the intended use case.

5.3 Discussion

The comparative evaluation of the entity matching ap-

proaches conﬁrms several aspects of the initial hy-

pothesis while also revealing some unexpected out-

comes.

Result 5.1. The LLM-based method indeed outper-

formed the others in terms of coverage, identifying the

highest number of matches, including unique ones not

detected by alternative techniques.

This demonstrates that large language models can

capture subtle semantic relationships beyond lexical

or structural similarity, as hypothesized. However,

this strength comes at the cost of precision.

Result 5.2. The results clearly show a tendency of

LLMs to overgeneralize, leading to false positives,

which aligns with the predicted challenge of halluci-

nations and overextension.

Such behavior may still be advantageous in appli-

cation contexts where broad semantic coverage is de-

sired, for example in shopping or recommendation

systems, but it introduces risks in scenarios requiring

high reliability.

Result 5.3. Lemmatization, as expected, performed

the weakest overall in terms of coverage.

Its perfect precision highlights that it is highly

conservative and produces no false positives, but this

comes at the expense of very limited recall.

Result 5.4. This outcome supports the hypothesis that

lemmatization is too restrictive to capture semanti-

cally related but lexically different terms.

This is making it suitable only for use cases where

absolute accuracy is more important than ﬂexibility,

such as medical or allergen-sensitive applications.

Result 5.5. WordNet delivered results that aligned

well with the hypothesis, providing decent perfor-

mance and a balanced trade-off between recall and

precision.

WordNet consistently outperformed lemmatiza-

tion by identifying semantically related terms while

remaining robust against false positives. The mini-

mal impact of changing the similarity threshold fur-

ther indicates that WordNet-based similarity offers

predictable and reliable behavior, though its limited

coverage reﬂects its restricted lexical scope.

WordNet’s predictable balance between recall and

precision makes it attractive for lightweight applica-

tions where stable performance is more valuable than

full coverage, for instance in smaller-scale taxonomy

integration tasks or as an interpretable baseline in

educational and research settings.

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

Result 5.6. Contrary to the hypothesis, word embed-

dings did not closely follow LLMs in performance.

Despite their potential to capture nuanced simi-

larity through training on large text corpora, the re-

sults were signiﬁcantly weaker than anticipated. Both

tested thresholds resulted in low recall, and while

the stricter threshold occasionally identiﬁed matches

missed by the more lenient one, overall effectiveness

remained limited.

This underperformance may be explained by do-

main mismatch, as pre-trained embeddings were not

optimized for product taxonomies.

Result 5.7. This suggests that in the speciﬁc domain

of product taxonomies, pretrained word embeddings

may not capture the necessary semantic granularity

or domain-speciﬁc knowledge.

Although pre-trained embeddings underper-

formed in this study, they may still prove useful

in scenarios where domain-speciﬁc retraining is

feasible, or as a candidate generation step in hybrid

pipelines that rely on more expressive models for

ﬁnal matching.

Overall, the results partially validate our hypothe-

ses. LLMs demonstrated the broadest coverage and

strongest ability to capture semantic relations, though

at the expected cost of precision. WordNet and

lemmatization behaved largely as anticipated, with

WordNet offering moderate effectiveness and lemma-

tization remaining overly restrictive. The unexpected

underperformance of word embeddings indicates that

their usefulness in this task may be constrained with-

out domain-speciﬁc adaptation.

Result 5.8. Ultimately, no single method emerges as

universally optimal, and the choice of approach de-

pends strongly on application requirements, particu-

larly whether broader coverage or higher precision is

prioritized.

6 CONCLUSIONS AND FUTURE

WORK

This work presented a comparative evaluation of four

entity matching approaches for linking product cat-

egories between the GPC ontology and the Produc-

tKG. Lemmatization, WordNet, word embeddings,

and LLMs were implemented independently and as-

sessed based on their ability to detect semantically

equivalent classes.

The results show that each method has speciﬁc

strengths and weaknesses. LLMs achieved the highest

F1-score due to their ability to capture deep semantic

relationships, but their lower precision indicates a ten-

dency to overgeneralize. Lemmatization yielded per-

fect precision and is suitable for applications where

accuracy is critical, though it struggled with seman-

tically related but lexically different terms. WordNet

offered a balanced trade-off, while word embeddings

performed poorly in both recall and precision.

No single method proved best in all cases, sug-

gesting that the optimal choice depends on the spe-

ciﬁc goals of the application.

Future work will include a more detailed analy-

sis of threshold effects for WordNet and word embed-

dings, as well as an investigation into common pat-

terns among false positives and false negatives. More

advanced matching systems will also be explored. In

addition, hybrid methods that combine the strengths

of different approaches, such as pairing LLMs with

lemmatization or WordNet ﬁltering, may improve re-

sults. Evaluating the impact of such combinations on

F1-score could lead to more effective and practical so-

lutions. Furthermore, analysing the effectiveness of

the techniques on other ontology-based datasets could

also give more insight on the real world applicability.

ACKNOWLEDGEMENTS

This work was partially funded by the central re-

search development fund of the University of Bremen

as well as the German Research Foundation DFG, as

part of CRC (SFB) 1320 “EASE - Everyday Activ-

ity Science and Engineering”, University of Bremen

(http://www.ease-crc.org/). The research was con-

ducted in subproject P1 “Embodied semantics for ev-

eryday activities”.

REFERENCES

Aanen, S. S., Vandic, D., and Frasincar, F. (2015). Auto-

mated product taxonomy mapping in an e-commerce

environment. Expert Systems with Applications,

42(3):1298–1313.

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pas¸ca, M.,

and Soroa, A. (2009). A study on similarity and re-

latedness using distributional and WordNet-based ap-

proaches. In Proceedings of Human Language Tech-

nologies: The 2009 Annual Conference of the North

American Chapter of the Association for Computa-

tional Linguistics on - NAACL ’09, page 19, Boulder,

Colorado. Association for Computational Linguistics.

Almeida, F. and Xex

eo, G. (2023). Word embeddings: A

survey. arXiv:1901.09069 [cs.CL].

Barlaug, N. and Gulla, J. A. (2021). Neural networks for en-

tity matching: A survey. ACM Trans. Knowl. Discov.

Data, 15(3).

Comparative Analysis of Entity Matching Approaches for Product Taxonomy Integration

Beetz, M., Stelter, S., Beßler, D., Dhanabalachandran,

K., Neumann, M., Mania, P., and Haidu, A. (2022).

Robots Collecting Data: Modelling Stores, pages 41–

64. Springer International Publishing, Cham.

Bird, S., Loper, E., and Klein, E. (2009). Natural Language

Processing with Python. O’Reilly Media Inc.

Christen, P. (2012). Data Matching: Concepts and Tech-

niques for Record Linkage, Entity Resolution, and Du-

plicate Detection. Springer Publishing Company, In-

corporated. pages. 12–34.

Cohen, W., Ravikumar, P., and Fienberg, S. (2003). A

comparison of string metrics for matching names and

records. In Kdd workshop on data cleaning and object

consolidation, volume 3, pages 73–78.

Dooley, D. M., Grifﬁths, E. J., Gosal, G. S., Buttigieg,

P. L., Hoehndorf, R., Lange, M. C., Schriml, L. M.,

Brinkman, F. S., and Hsiao, W. W. (2018). Foodon:

a harmonized food ontology to increase global food

traceability, quality control and data integration. npj

Science of Food, 2(1):23.

Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S.

(2007). Duplicate record detection: A survey. IEEE

Transactions on Knowledge and Data Engineering,

19(1):1–16.

Farouk, M. (2018). Sentence Semantic Similarity based on

Word Embedding and WordNet. In 2018 13th Interna-

tional Conference on Computer Engineering and Sys-

tems (ICCES), pages 33–37.

Fellbaum, C. (2010). WordNet. In Poli, R., Healy, M.,

and Kameas, A., editors, Theory and Applications of

Ontology: Computer Applications, pages 231–243.

Springer Netherlands, Dordrecht.

GS1 (2015). Global Product Classiﬁcation (GPC) Develop-

ment & Implementation Guide. GS1. Issue 8, Final,

December 2022.

GS1 (2024a). Global Product Classiﬁcation (GPC). GS1.

https://www.gs1.org/standards/gpc.

GS1 (2024b). How is gpc developed and maintained?

https://support.gs1.org/support/solutions/articles/

43000734258-how-is-gpc-developed-and-maintained-.

Accessed: 2025-09-15.

GS1 (2025a). Gs1 gdsn. https://www.gs1.org/services/

gdsn. Accessed: 2025-09-15.

GS1 (2025b). How gpc works. https://www.gs1.org/

standards/gpc/how-gpc-works. Accessed: 2025-09-

15.

GS1 (2025c). How gs1 gdsn works. https://www.gs1.org/

services/gdsn/how-gdsn-works. Accessed: 2025-09-

15.

Gurevych, I. and Strube, M. (2004). Semantic Similar-

ity Applied to Spoken Dialogue Summarization. In

COLING 2004: Proceedings of the 20th International

Conference on Computational Linguistics, pages 764–

770, Geneva, Switzerland. COLING.

Jatnika, D., Bijaksana, M. A., and Suryani, A. A. (2019).

Word2Vec Model Analysis for Semantic Similari-

ties in English Words. Procedia Computer Science,

157:160–167.

Kenter, T. and De Rijke, M. (2015). Short Text Similar-

ity with Word Embeddings. In Proceedings of the

24th ACM International on Conference on Informa-

tion and Knowledge Management, pages 1411–1420,

Melbourne Australia. ACM.

Khyani, D., Siddhartha, B., Niveditha, N., and Divya,

B. (2021). An interpretation of lemmatization and

stemming in natural language processing. Journal of

University of Shanghai for Science and Technology,

22(10):350–357.

opcke, H. and Rahm, E. (2010). Frameworks for entity

matching: A comparison. Data & Knowledge Engi-

neering, 69(2):197–210.

umpel, M. and Beetz, M. (2023). Productkg: A product

knowledge graph for user assistance in daily activities.

In FOIS’23: Ontology Showcase and Demonstrations

Track, 9th Joint Ontology Workshops (JOWO 2023),

co-located with FOIS 2023, 19-20 July, 2023, Sher-

brooke, Qu

ebec, Canada, volume 3637.

Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. (2015).

From word embeddings to document distances. In

Bach, F. and Blei, D., editors, Proceedings of the

32nd International Conference on Machine Learning,

volume 37 of Proceedings of Machine Learning Re-

search, pages 957–966, Lille, France. PMLR.

umpel, M. and Dech, J. (2025). Semantic digital twins for

omni-channel localisation. In Proceedings of the 11th

IFAC MIM Conference on Manufacturing Modelling,

Management and Control.

umpel, M., Dech, J., Hawkin, A., and Beetz, M. (2023).

Robotic shopping assistance for everyone: Dynamic

query generation on a semantic digital twin as a basis

for autonomous shopping assistance. In Proceedings

of the 22nd International Conference on Autonomous

Agents and Multiagent Systems (AAMAS 2023), pages

2523–2525, London, United Kingdom.

umpel, M., Mueller, C. A., and Beetz, M. (2021). Se-

mantic digital twins for retail logistics. In Freitag, M.,

Kotzab, H., and Megow, N., editors, Dynamics in Lo-

gistics: Twenty-Five Years of Interdisciplinary Logis-

tics Research in Bremen, Germany, pages 129–153.

Springer International Publishing, Cham.

Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., and

Pﬁster, H. (2014). Upset: Visualization of intersecting

sets. IEEE Transactions on Visualization and Com-

puter Graphics, 20(12):1983–1992.

Meng, L., Huang, R., and Gu, J. (2013). A review of se-

mantic similarity measures in wordnet. International

Journal of Hybrid Information Technology, 6(1):1–12.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Narayan, A., Chami, I., Orr, L., Arora, S., and R

e, C.

(2022). Can foundation models wrangle your data?

arXiv preprint arXiv:2205.09911.

OpenAI (2024). Chatgpt (gpt-4o, may 2024 version). https:

//chat.openai.com. Large language model.

Peeters, R., Steiner, A., and Bizer, C. (2023). Entity

matching using large language models. arXiv preprint

arXiv:2310.11244.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

Proceedings of the 2014 conference on empirical

methods in natural language processing (EMNLP),

pages 1532–1543.

Reh

rek, R. and Sojka, P. (2010). Software Framework

for Topic Modelling with Large Corpora. In Proceed-

ings of the LREC 2010 Workshop on New Challenges

for NLP Frameworks, pages 45–50, Valletta, Malta.

ELRA.

Sahlgren, M. (2008). The distributional hypothesis. Italian

Journal of linguistics, 20:33–53.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. In Proceedings of

the 31st International Conference on Neural Informa-

tion Processing Systems, NIPS’17, pages 6000–6010,

Red Hook, NY, USA. Curran Associates Inc.

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B.,

Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D.,

Metzler, D., et al. (2022). Emergent abilities of large

language models. arXiv preprint arXiv:2206.07682.

Zhang, Z., Groth, P., Calixto, I., and Schelter, S.

(2024). Anymatch–efﬁcient zero-shot entity match-

ing with a small language model. arXiv preprint

arXiv:2409.04073.

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y.,

Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023).

A survey of large language models. arXiv preprint

arXiv:2303.18223, 1(2).

Zhu, G. and Iglesias, C. A. (2018). Exploiting semantic

similarity for named entity disambiguation in knowl-

edge graphs. Expert Systems with Applications,

101:8–24.

Comparative Analysis of Entity Matching Approaches for Product Taxonomy Integration