BrandNERD: An Extensive Brand Dataset and Analysis Pipeline for

Named Entity Resolution

Nicholas Caporusso

, Alina Campan

, Ayush Bhandari, Stephen Kroeger and Sarita Gautam

Northern Kentucky University, Highland Heights, Kentucky, U.S.A.

Keywords:

Named Entity Resolution, Brand Canonicalization, String Pattern Matching, Text Mining.

Abstract:

Named entity resolution (NER) comprises several steps to address multifaceted challenges, including canon-

icalization, aggregation, and validation. Nonetheless, NER research is hindered by the scarcity of realistic,

labeled corpora that capture the spelling noise and brand proliferation found in data from multiple sources,

from e-commerce to social media. In this paper, we introduce the Brand Name Entity Resolution Dataset

(BrandNERD), an extensive dataset of real-world brand names extracted from an existing high-trafﬁc re-

tail marketplace. BrandNERD consists of multiple datasets along the entity resolution pipeline: raw surface

forms, unique canonical entities, similarity clusters, validated brands, and a lookup table reconciling multiple

canonical forms with a list of validated preferred brand labels. In addition to the BrandNERD dataset, our con-

tribution includes an analysis of adequacy of various text similarity measures to the brand NER task at hand,

the processing algorithms used in each step of the resolution process, and user interfaces and data visualiza-

tion tools for manual reviews, resulting in a modular, fully reproducible, and extensible pipeline that reﬂects

the complete NER workﬂow. BrandNERD, which is released as a public repository, contains the dataset and

processing pipeline for over 390,000 raw brand names. The repository is continuously updated with new data

and improved NER algorithms, making it a living resource for research in marketing and machine learning,

and for enabling more complex downstream tasks such as entity disambiguation and brand sentiment analysis.

1 INTRODUCTION

Named entity resolution (NER) constitutes a criti-

cal information extraction task that seeks to locate

and classify named entities mentioned in unstructured

text into predeﬁned categories. The ﬁeld has tradi-

tionally focused on detecting and classifying entities

such as persons, organizations, and locations within

well-structured text corpora. However, the scope of

NER extends beyond simple classiﬁcation to encom-

pass the more complex challenge of entity resolu-

tion, which involves identifying when different tex-

tual mentions refer to the same real-world entity.

NER still represents a fundamental challenge

in modern natural language processing, particularly

when dealing with brand names, which exhibit signif-

icant variation in their surface forms across different

platforms and data sources. The challenge is com-

pounded by the inherent noise present in real-world

data, including spelling variations, abbreviations, and

inconsistent formatting practices that are prevalent

https://orcid.org/0000-0002-8661-4868

https://orcid.org/0000-0002-9296-3036

in e-commerce and social media environments. As

brand surface form variations can be extensive and

context-dependent, disambiguation becomes essential

for accurate entity linking.

Brand identiﬁers often mutate in spelling, length,

and even language as they move across markets and

media. A single company may appear as an acronym

(“IBM”), its full legal name (“International Business

Machines”), or a colloquial nickname (“Big Blue”).

Typos (“Addidas”) and stylistic tinkering with punc-

tuation or capitalization (“McDonald’s” vs. “Mcdon-

alds”) cause the same entity to manifest in dozens of

orthographic guises. Furthermore, individual digital

venues might introduce naming quirks. For instance,

e-commerce catalogs aggregate or shorten brand and

model names into SKUs (e.g., “Nike AM2090”) and

use localized names (“Unilever Indonesia”). Also,

brand landscapes are anything but static, with thou-

sands of new labels debuting each year, while legacy

giants periodically rebranding (e.g., “Dunkin”’ drop-

ping “Donuts,” or Facebook morphing into Meta), or

consolidating as a result of mergers and acquisitions

(e.g., “Whatsapp by Facebook”), leaving labels that

Caporusso, N., Campan, A., Bhandari, A., Kroeger, S. and Gautam, S.

BrandNERD: An Extensive Brand Dataset and Analysis Pipeline for Named Entity Resolution.

DOI: 10.5220/0013830600004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 481-488

481

no longer match current registries.

While there are existing methods for handling

name disambiguation in other contexts, such as clean-

ing up corporate registries or standardizing author

names in bibliographic databases, these methods of-

ten rely on external validation sources or veriﬁed

dictionaries. NER corpora, while valuable for gen-

eral entity recognition tasks, lack the comprehensive

coverage needed for brand name resolution and fail

to handle diverse naming conventions, abbreviations,

and the continuous evolution of brand nomenclature

in digital spaces. Contemporary brand NER research

faces signiﬁcant limitations due to the scarcity of real-

istic, labeled corpora that adequately capture the com-

plexity of brand names. This is particularly pressing

in cases where a brand has a narrower scope than a

trademark, which makes conventional resources like

the USPTO database (of Public Affairs (OPA), 2025)

not suitable, as the database may not ﬁlter out service-

oriented or expired trademarks and can thus present

too many potential matches that introduce further

confusion. As a result, researchers must devise proce-

dures to systematically recognize, reconcile, and clus-

ter variations.

In this paper, we present BrandNERD, an ex-

tensive open-source named entity resolution dataset

and pipeline speciﬁcally focusing on brands. Built

from an online retail marketplace featuring brands

also available on popular stores and e-commerce plat-

forms, BrandNERD contains: (1) over 390,000 raw

brand names harvested from product listings; (2) a

curated set of canonical entities produced using ex-

tensible canonicalization rules; (3) similarity cluster-

ing tools to expose near-duplicate names; (4) a val-

idation corpus derived from name search and man-

ual review; (5) a lookup table that maps competing

canonical forms to a preferred, manually vetted brand

label. BrandNERD also includes the tools accompa-

nying the entire NER pipeline, organized in a modular

and reproducible set of steps and including convenient

user interfaces for data exploration and visualization.

2 RELATED WORK

NER has been extensively studied across multiple re-

search communities, and the foundational tasks of en-

tity resolution have been well-established in the litera-

ture. The authors of (Brizan and Tansel, 2006; Saeedi

et al., 2021) provide a systematic overview of the

processing steps and execution strategies, identifying

three primary components: deduplication (i.e., elim-

inating exact duplicate copies), record linkage (i.e.,

identifying records referencing the same entity across

different sources), and canonicalization (i.e., convert-

ing data into standardized forms). These core tasks

form the backbone of most ER pipelines and directly

relate to the challenges addressed in brand name res-

olution, where surface form variations must be stan-

dardized and linked to canonical representations. Dif-

ferent approaches have been proposed to tackle each

task in the NER pipeline. For instance, a recent litera-

ture review (Barlaug and Gulla, 2021) surveyed deep

learning techniques for traditional ER challenges.

Product entity resolution has emerged as a spe-

cialized domain within the broader ER landscape,

with unique challenges arising from the heteroge-

neous nature of product descriptions across different

platforms. In (Vermaas et al., 2014), the authors pro-

posed an ontology-based approach for product entity

resolution on the web, using the descriptive power

of product ontologies to improve matching accuracy.

Their method employed domain-speciﬁc similarity

measures for different product feature types and uti-

lized genetic algorithms for feature weight optimiza-

tion, achieving F1-measures of 59% and 72% across

different product categories. This work demonstrates

the importance of domain-speciﬁc approaches to en-

tity resolution, particularly relevant for brand name

matching, where product context signiﬁcantly inﬂu-

ences resolution accuracy. Unfortunately, in many

scenarios, context information is unavailable or un-

reliable. A study from Microsoft Research (Liu et al.,

2007) addressed the challenge of resolving product

feature references within customer reviews. Their ap-

proach combined edit distance measures with con-

text similarity to group references related to the same

product feature, highlighting the importance of con-

textual information in product-related entity resolu-

tion tasks. (Jin et al., 2020) used a combination of

neural network models and human annotators to de-

tect and label with brand information a set of over

1.4 million images. Although the work falls into the

broader scope of NER, it tackles a different set of

challenges. The study focuses on extracting brand

names from images instead of processing text. Fur-

thermore, it involves brand recognition rather than

resolution tasks. Wang et. al. (Wang et al., 2012)

presented a hybrid human-machine approach to en-

tity resolution. The proposed workﬂow uses machine-

based techniques to ﬁnd pairs or clusters of records

likely to refer to the same entity. Then, only these

most likely matches are crowd-sourced to humans

for review. This approach saves resources while still

leading to accurate results. While our approach to ER

is also hybrid, there are some differences. The data

in our ER problem consists of texts that are usually

shorter than product descriptions or restaurant infor-

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

482

mation used to evaluate the methodology proposed in

(Wang et al., 2012).

The availability of high-quality benchmark

datasets has been crucial for advancing entity reso-

lution research. In their work, (Lovett et al., 2014)

Lovett et. al. made available a set of 700 of the top

U.S. national brands from 16 categories and a large

number of descriptive characteristics (such as brand

personality, satisfaction, age, complexity, and brand

equity. This dataset is appropriate for marketing

research, but an ER approach is not proposed in

this work. The dataset presented in (Jin et al.,

2020) introduced a dataset of 1,437,812 images that

contain brands and 50,000 images without brands.

The images containing brands are annotated with

brand name and logo information. The authors

of (Lamm and Keuper, 2023) released the ﬁrst

publicly available large-scale dataset for visual entity

matching. They provide 786,000 manually annotated

product images containing around 18,000 different

retail products, which are grouped into about 3,000

entities. The annotation of these products is based

on a price comparison task, where each entity forms

an equivalence class of comparable products. The

Database Group at Leipzig University (Christophides

et al., 2020) has contributed several widely-used

benchmark datasets for binary entity resolution eval-

uation. These include the Amazon-Google Products

dataset (Christophides et al., 2020; Saeedi et al.,

2021), containing 1,363 Amazon entities and 3,226

Google products with 1,300 known matches, and the

Abt-Buy dataset comprising 1,081 and 1,092 entities

from the respective e-commerce platforms with

1,097 matching pairs. These datasets have become

standard evaluation benchmarks, focusing primarily

on general product matching rather than brand-

speciﬁc resolution challenges. Additional specialized

datasets have emerged for different aspects of entity

resolution. The Web Data Commons project has

created training and test sets for large-scale product

matching using schema.org marked-up data from

e-commerce websites, covering product categories

including computers, cameras, watches, and shoes.

However, despite this variety of available datasets,

none speciﬁcally address the unique challenges of

brand name resolution with the scale and real-world

complexity required for comprehensive evaluation.

3 BrandNERD

BrandNERD addresses the critical challenge of NER

for brand names by providing an extensive brand

dataset of over 394,000 unique raw brand names ex-

tracted from a high-trafﬁc retail marketplace, mak-

ing it signiﬁcantly larger than existing brand-focused

datasets, together with a lookup table that pairs sur-

face names with their resolved names. By doing this,

BrandNERD provides researchers with a large-scale

dataset that can be utilized for developing and bench-

marking machine learning approaches for text sim-

ilarity, clustering, and resolution tasks, as a trusted

source of resolved brand names for sentiment analy-

sis, or as a disambiguation tool for obtaining unique

product information in the context of auction and e-

commerce websites. In addition, the BrandNERD

pipeline implements a comprehensive, modular work-

ﬂow consisting of six main steps. Although the

pipeline itself does not introduce any novel contribu-

tion, it makes it convenient for researchers to inter-

vene in any step in the process, where they can replace

or extend the algorithms.

BrandNERD, including its datasets and algo-

rithms, lives in a public GitHub repository available

at https://bit.ly/3VCc2Sn, and is constantly curated

and expanded by the research team. In addition, the

repository contains detailed technical documentation

about the algorithms in the pipeline, the format of the

datasets, and other information that we could not in-

clude in this paper. The dataset is released under the

Creative Commons BY 4.0 license, so researchers are

free to download, fork, integrate, and redistribute the

corpus and code, provided they give appropriate at-

tribution. The dataset is continuously updated as new

data are processed along the pipeline. Also, the repos-

itory accepts pull requests to encourage community-

driven enhancements and continuous expansion.

3.1 NER Processing, Pipeline, and Tools

3.1.1 Data Acquisition

The list of raw brand names was acquired from the

publicly available product pages of an online mar-

ketplace primarily featuring consumer products also

available on various popular e-commerce websites,

including Amazon, Walmart, Poshmark, Home De-

pot, and Target. The name of the data source is kept

undisclosed to protect the business’s anonymity, and

any identiﬁers linking brands with the original mar-

ketplaces have been removed from the dataset. Also,

although the list was pre-processed to remove other ir-

relevant information and standardize the data, Brand-

NERD does not include data acquisition tools, as the

pipeline assumes that the surface names have already

been acquired. By working with brand strings in iso-

lation, that is, without assuming access to product de-

scriptions, model numbers, category tags, or any other

BrandNERD: An Extensive Brand Dataset and Analysis Pipeline for Named Entity Resolution

483

Figure 1: BrandNERD’s processing pipeline.

curated metadata that might simplify matching, our

methodology deliberately tackles the worst-case sce-

nario in brand NER.

3.1.2 Canonicalization

After acquisition and pre-processing, the ﬁrst step

in our NER pipeline is canonicalization. This pro-

cess reconciles multiple surface names with a unique

canonical name, and it involves the following steps:

1. Cleaning, where names are sanitized to remove

characters that are not usually part of a brand such

as single and double quotes, encoding-speciﬁc

non-alphanumerical symbols (e.g., Unicode char-

acters), or company designators (e.g., LLC, INC,

and CO) that do not commonly appear in the

brand name.

2. Normalization, which involves processing the

sanitized name to remove all non-alphanumerical

characters, including whitespace.

To this end, we initially analyzed the dataset of ac-

quired brands to deﬁne a set of rules that address most

cleaning and normalization requirements. Our tool

implements canonicalization rules using regular ex-

pressions, which enable parsing strings efﬁciently and

maintain a high level of interpretability. Given the it-

erative nature of NER, downstream processing steps

(i.e., similarity clustering, name search, validation,

and review) might inform new rules that can be con-

veniently implemented in the script to further clean or

normalize surface names into more effective canoni-

cal forms. Subsequently, the list can be processed to

aggregate multiple surface forms with their canonical

form.

3.1.3 Similarity Clustering

The second step in our processing pipeline consists

in clustering surface or canonical brand names by

their similarity. In addition to surface forms, multi-

ple canonical names can also relate to the same entity.

Thus, to resolve them, ﬁrst, we identify and resolve

similar canonical names that refer to the same entity.

To this end, we considered several established

text similarity algorithms, including Levenshtein and

Damerau-Levenshtein (i.e., based on edit distance),

Jaro-Winkler, the Jaccard and Sørensen–Dice coefﬁ-

cients on bi- and tri-grams, cosine similarity on TF-

IDF character n-grams, the phonetic pair Soundex

and Double-Metaphone, the probabilistic Monge-

Elkan composite scorer, and the afﬁne-gap Needle-

man–Wunsch sequence aligner. Subsequently, we im-

plemented and compared four algorithms particularly

suitable for our scenario, that is, Levenshtein distance,

Jaro-Winkler, phonetic (metaphone) similarity, and

cosine similarity with sentence transformer embed-

dings. Also, we evaluated two combinations of these

measures: a hybrid combination of phonetic similar-

ity and Levenshtein distance, and a hybrid combina-

tion of sentence transformer embeddings and Leven-

shtein distance.

To evaluate the performance of these metrics and

select one, we manually validated 786 brand names

that served as the ground truth to compare the perfor-

mance of these text metrics in terms of the detection

accuracy of the target, ground-truth brand name. For

each surface name, we found the three most similar

matches according to each text measure from a set of

80,902 validated (through automated searches on re-

tail websites) brand names. We operated with the top

three matches rather than the top match only, because

the best match according to a text comparison metric

is not guaranteed to actually be the name to which the

candidate brand needs to be resolved. The chances

of ﬁnding the correct resolution name increase when

considering more top matches, as shown below in Ta-

ble 1. Also, in practice, the text comparison metrics

have sufﬁcient errors that we cannot fully automate

the brand resolution process; instead, manual resolu-

tion needs to be used, and each brand name can be se-

lected from a series of possible veriﬁed target names,

which are chosen to be a set of the highest matches

– and not just the highest match, according to a text

metric.

The Jaro-Winkler similarity was the most accurate

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

484

Table 1: Percentage of target matches found at each rank

(Top 1–3) or missed entirely using various similarity meth-

ods: JW = Jaro-Winkler, Lev = Levenshtein, Phon = Pho-

netic, Cosine = Cosine with sentence embeddings, P+E =

Hybrid (Phonetic + Edit), E+E = Hybrid (Embedding +

Edit).

#1 #2 #3 Not in Top 3

JW 74.81 14.25 5.09 5.85

Lev 61.58 13.49 4.96 19.97

Phon 13.99 1.15 0.00 84.86

Cosine 52.04 14.12 7.63 26.21

P+E 58.65 7.00 3.18 31.17

E+E 56.23 14.38 4.71 24.68

in matching the surface forms in our test dataset to

their validated name, as shown in Table 1. This table

reports the percentage of candidate brand names in

our test dataset, for which the target resolution name

was found as the ﬁrst, second, and third match, re-

spectively, or when the target brand was not found

among the closest three matches, when using the re-

spective text distance or similarity measure.

Based on this experimental evaluation, we

decided to use the Jaro-Winkler similarity in our

validation tool for calculating pairwise name similar-

ity. One of the advantages of this text metric is that

strings having similar initial segments are assigned a

higher similarity score, which is especially suitable

for brand names.

To better understand the relationship between

brand name resolution and the pairwise text similarity

of brand names, we constructed and analyzed a graph

representing these connections. Speciﬁcally, we con-

structed a multigraph G = (B, (S, R)), where B is the

set of all brand names, S is a set of undirected edges

connecting brand name pairs with Jaro-Winkler simi-

larity ≥ 0.9; each edge has a label the respective sim-

ilarity value, and R is a set of directed edges, where

each edge connects a source node, which is one of

the 786 brand names that were resolved manually, to

a target node, which is the veriﬁed brand name that

replaces the source brand name node. We must note

that not all nodes that had an edge in R also had a cor-

responding edge in S. For example, the surface name

“PHILIPS SONICARE” was resolved to the veriﬁed

brand name “PHILIPS”, but the two brand names did

not have similarity ≥ 0.9, and therefore they were not

connected by an edge in S. There were 38 surface

names in the set of validated brands in the test dataset

that were resolved to veriﬁed brand names with which

they had a Jaro-Winkler similarity of < 0.9. For the

remaining 748 resolved brand names, we compared

their neighborhoods in S and R, and we found, as ex-

pected, that the resolution for surface names, among

Figure 2: Renamed Nodes’ Neighborhood.

Figure 3: Connected Component in the Similarity Graph.

(Large number of nodes and edges make visual inspection

of overall component difﬁcult. Included to illustrate scale).

the list of veriﬁed brand names, is not always the

match with the highest Jaro-Winkler similarity. Fig-

ure 2 illustrates one of these situations, where the

brand name “Colgate Wisp” was resolved to the ver-

iﬁed name “Colgate”, with which it has a similarity

of 0.93, although there is another brand, “Colgates”,

with which it has a higher similarity of 0.95. In this

case, the more similar brand was not a veriﬁed name,

but that situation could occur in a large dataset with

many similar veriﬁed brand names.

Another conﬁrmation that high text similarity is

not foolproof and cannot be used alone to determine

clusters of similar names was obtained by looking at

connected components in the graph (B, S) above. One

such connected component, which consists of nodes

that have pairwise similarity over 0.9, is shown in Fig-

ure 3, where it is visible how some well-formed clus-

ters of similar names are combined together by weak

edges forming incorrect bridges between these dense

clusters.

An essential aspect of similarity clustering is the

choice of a similarity threshold, which is particularly

BrandNERD: An Extensive Brand Dataset and Analysis Pipeline for Named Entity Resolution

485

relevant for the subsequent NER steps. We empiri-

cally found 0.9 as a similarity threshold that captures

signiﬁcant but reasonably small clusters. Also, such

a high threshold helps identify edge cases that re-

veal the need for additional rules. For instance, clus-

tering identiﬁed the presence of a cluster with more

than 3,000 surface names in the form of “VISIT THE

[BRANDNAME] STORE”, a typical pattern that we

addressed with a canonicalization rule.

Our scripts support the integration of other text

metrics, instead of Jaro-Winkler. For example, (1) the

weighted-trigram Jaccard could be utilized to shrink

the impact of boilerplate tokens (e.g., “Manufactur-

ing”, “Products”) and let the distinctive parts of a

brand name dominate the similarity score, and (2) the

Inverse Document Frequency (IDF) can be used to in-

crease similarity in case of typos. However, we de-

cided not to implement these at this stage because val-

idating would have required a larger corpus. Also, our

normalization process, particularly removing white

space, might have an impact on the IDF calculation.

3.1.4 Name Search

The third step of our NER pipeline consists in using

an external source of information to check whether

the brand exists or not. As most brands involve com-

mercial products sold on popular e-commerce plat-

forms, we decided to use a web search engine to query

the brand names in their surface forms and analyze

the results. To this end, we developed a JavaScript

application for NodeJS environments that leverages

the Selenium package to control an instance of the

Chrome browser. Speciﬁcally, we utilized browser

automation to query the Brave search engine for sur-

face names. After storing the ﬁrst 10 results, the ap-

plication veriﬁes whether the brand has a presence on

popular online e-commerce websites (e.g., Amazon,

Walmart, or Home Depot) to validate its canonical

name automatically. Although APIs can be utilized

for this purpose, unfortunately, the large number of

brands in the dataset and the cost of API calls hinder

the feasibility of NER research. Also, we used Brave

as a search engine instead of more popular ones (e.g.,

Google search) because it provides easier search au-

tomation over a large number of queries. In our pro-

cessing pipeline, name search and similarity cluster-

ing can be executed simultaneously because they con-

sume separate input and produce independent output

data.

3.1.5 Brand Validation

The goal of this step is to determine whether any of

the surface forms of a canonical name represent a

valid brand based on the results obtained from the

search engine (i.e., if the brand actually exists). In

addition to the checks realized during name search,

which allow for a quick evaluation of whether the

brand has a store on popular e-commerce websites,

this phase performs a more in-depth analysis of the

search results. To this end, the validation process

utilizes automated string processing and manual re-

view. First, a series of algorithms tokenize the text

and URLs to identify the most common strings and

compare them with a brand’s surface and canonical

names. Then, similarity algorithms are utilized to

identify clusters around brand names, automatically

processing the ones with sufﬁciently high similar-

ity. Subsequently, the results are reviewed manually

thanks to a user interface where the reviewer can mark

the brand as valid, invalid, or unsure. In addition to

conﬁrming the validity of brands, this step also helps

identify brands initially ﬂagged as invalid due to in-

correct spelling, use of uncommon surface forms, or

not having a storefront on e-commerce websites (e.g.,

brands sold outside digital marketplaces). This pro-

cess also helps discover new rules or correct brand

names that were not initially part of the dataset. Con-

sequently, new canonical names can be added, and the

upstream pipeline and datasets are updated accord-

ingly.

3.1.6 Resolution

In this step, similar canonical forms representing the

same brand name are aggregated to validated brand

names. This task is similar to what is realized in

step one after canonicalization, where multiple sur-

face forms are automatically associated with the same

canonical form. However, resolution involves an-

alyzing the semantics of different canonical names

(and their associated raw labels) to address typos

and misspellings and reconcile brands using multiple

names (e.g., “Samsung” and “Samsung electronics”,

or “Starbucks” and “Starbucks coffee”). To this end,

for each similarity cluster found in the second step

of our pipeline, our algorithm compares the similarity

score between the central node and its direct connec-

tions (one-hop neighbors) with the similarity between

each of the adjacent nodes and their two-hop neigh-

bors (excluding the central node), to obtain a rank-

ing of the best matches. Moreover, the algorithm as-

signs a different weight to edges connecting invalid

or valid brands. A user interface presents this infor-

mation to a reviewer who can manually conﬁrm the

results. The output of this process is a lookup table

with pairs where canonical names representing an in-

valid brand or being one of the less common canoni-

cal representations of a valid brand name (i.e., lookup

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

486

value) are reconciled with their corresponding valid

and most representative canonical name (i.e., target

value).

3.1.7 Manual Review

In the ﬁnal step, results are reviewed manually to

check for potential errors. This is done via a user in-

terface where an agent can observe a sample of the

dataset and conﬁrm or reject the results of the valida-

tion and resolution. The output of the user interface is

a list of canonical names where each brand resolution

is ﬂagged as correct, incorrect, or needs review.

3.2 Dataset

As the dataset is designed to continuously grow, the

numerical details about the dataset mentioned later in

this section reﬂect the situation at the time of writing.

3.2.1 Raw Brand Names

The dataset contains a total of 394,542 unique raw

brand names. This number reﬂects publicly available

surface names collected from the websites mentioned

earlier. After obtaining the list using web scraping

techniques, we removed syntactically invalid names,

including 389 brand names that consisted of numbers

only. Raw brand names appear sorted by item count

on the platform in descending order (the brands with

the most item occurrences appear ﬁrst).

3.2.2 Canonical Names

The canonicalized dataset comprises 376,613 unique

cleaned surface names and 368,703 unique canonical

names (93.45% of surface names), as 25,839 brand

names from the original dataset were merged with

their canonical forms after applying canonicalization

rules. Canonical names are sorted following the same

criterion as their raw surface forms, even though

canonicalization could inherently disrupt distribution

sorting, for instance, in case some canonical names

represent raw brand names at the top and bottom of

the list.

As shown in Table 2 360,919 canonical names

(approximately 97.8%) are associated with only one

surface name, 7,668 (2.1%) with two, 108 (0.03%)

with three, 7 (0.002%) with four, and just one

canonical name (0.0003%) is linked to six different

surface names, with an average surface name count

per canonical name of 1.02±0.15. In addition to

individual canonical names, the dataset also retains

the relationship with the original surface names

to maintain consistency and reference across the

datasets.

Table 2: Canonical name counts by surface name count (de-

scending).

Index surfaceNameCount canonicalCount

0 6 1

1 4 7

2 3 108

3 2 7668

4 1 360919

3.2.3 Similarity Clusters

This dataset contains 782,299 pairs featuring 273,845

unique canonical names (74.27% of the total) show-

ing a Jaro-Winker similarity score higher than 0.9

with other brands. The dataset consists of two com-

ponents, one with one-way similarity matches with

unique < brand

, brand

> pairs, and one with two-

way matches, that is, with occurrences featuring both

< brand

, brand

> and < brand

, brand

> pairs, to-

gether with their similarity scores.

This dataset can be used to identify misspellings, re-

solve brand names having different canonical forms,

or identify new canonicalization rules. Additionally,

this dataset can serve as a benchmark for other sim-

ilarity clustering algorithms to enhance the accuracy

of identifying false positives and false negatives.

3.2.4 Brand Search Results

This dataset consists of the following components:

• The list of 368,703 canonical names (100%)

checked with the search engines.

• The list of 72,303 brand names (19.61% of the to-

tal) found through web search that are potentially

valid candidates, comprising the canonical name,

the raw brand name used as the search term (i.e.,

query), and the ﬁrst URL resulting from a match.

• For each canonical name checked with the search

engine, the list of the top 10 results obtained from

the web search, including the title, source URL,

and short description. The results obtained for

each canonical name are sorted using the same

criterion used by the search engine, that is, by rel-

evance.

3.2.5 Validated Brands

This dataset contains the list of 32,114 brand names

(8,7% of the total) that were processed in the val-

idation step, where each canonical name is associ-

ated with a value of 1 if the brand is validated, -1 if

BrandNERD: An Extensive Brand Dataset and Analysis Pipeline for Named Entity Resolution

487

the brand was not validated, and zero if the canoni-

cal name could not be determined to be valid or not.

31,622 brand names (i.e., 98.47% of the total) were

found to be valid, while the remaining were invalid.

3.2.6 Entity Resolution Lookup Tables

This dataset contains:

• The list of 768 canonical names (2.40% of the to-

tal) processed.

• The list of 824 pairs of canonical brand names

where a brand name was reconciled to another

canonical form.

4 CONCLUSION

Our work aims to address a long-standing gap in

brand-oriented NER research by offering a large-

scale, openly licensed dataset and pipeline that mir-

rors the complexity of commercial data. By anchor-

ing BrandNERD in nearly 400,000 raw surface forms

scraped from a high-trafﬁc marketplace, we provide

a benchmark whose size, noise proﬁle, and contin-

ual growth far exceed those of prior corpora, offering

scholars and practitioners an authentic testbed for re-

search and experimentation in various tasks of interest

to several scientiﬁc communities.

BrandNERD is a large-scale, open-source dataset

for brand NER, supported by an end-to-end, modu-

lar workﬂow for disambiguating, deduplicating, and

validating brand names. The framework combines in-

terpretable rule-based canonicalization, modular sim-

ilarity metrics, browser-automated web search, and

user interfaces and data visualization tools for hu-

man curation. The dataset includes multiple intercon-

nected components: 376,613 cleaned surface names

mapped to 368,703 canonical names, 782,299 simi-

larity pairs covering 273,845 unique canonical names,

search results for all canonical names with 72,303 po-

tentially valid candidates identiﬁed, 32,114 validated

brands (with 98.47% conﬁrmed as valid), and 824 en-

tity resolution pairs in lookup tables (with new data

being added regularly), resulting in the most extensive

datasets currently available for research. The GitHub

repository of the project is continuously updated and

shared under Creative Commons licensing, providing

researchers with an authentic, large-scale testbed that

reﬂects the complexity and noise proﬁle of real com-

mercial brand data.

In our future work, we will (1) ﬁnalize valida-

tion of all current canonical entries; (2) manually re-

view brands to expand the lookup table into a gold-

standard; (3) reﬁne canonicalization rules and trial

misspelling-aware similarity measures; (4) enrich res-

olution by incorporating item-level descriptions from

auction listings; and (5) explore density-based clus-

tering and centrality metrics on the similarity graph

to surface latent brand groups and identify authorita-

tive canonical labels.

REFERENCES

Barlaug, N. and Gulla, J. A. (2021). Neural networks for

entity matching: A survey. ACM Transactions on

Knowledge Discovery from Data (TKDD), 15(3):1–

37.

Brizan, D. G. and Tansel, A. U. (2006). A. survey of entity

resolution and record linkage methodologies. Com-

munications of the IIMA, 6(3):5.

Christophides, V., Efthymiou, V., Palpanas, T., Papadakis,

G., and Stefanidis, K. (2020). An overview of end-

to-end entity resolution for big data. ACM Computing

Surveys (CSUR), 53(6):1–42.

Jin, X., Su, W., Zhang, R., He, Y., and Xue, H. (2020).

The open brands dataset: Uniﬁed brand detection and

recognition at scale. In ICASSP 2020-2020 IEEE In-

ternational Conference on Acoustics, Speech and Sig-

nal Processing (ICASSP), pages 4387–4391. IEEE.

Lamm, B. and Keuper, J. (2023). Retail-786k: a large-

scale dataset for visual entity matching. arXiv preprint

arXiv:2309.17164.

Liu, J.-J., Cao, Y.-B., and Huang, Y.-L. (2007). Effective

entity resolution in product review domain. In 2007

International Conference on Machine Learning and

Cybernetics, volume 1, pages 106–111. IEEE.

Lovett, M., Peres, R., and Shachar, R. (2014). A data set of

brands and their characteristics. Marketing Science,

33(4):609–617.

of Public Affairs (OPA), U. O. (2025). United states patent

and trademark ofﬁce.

Saeedi, A., David, L., and Rahm, E. (2021). Matching en-

tities from multiple sources with hierarchical agglom-

erative clustering. In KEOD, pages 40–50.

Vermaas, R., Vandic, D., and Frasincar, F. (2014). An

ontology-based approach for product entity resolu-

tion on the web. In International Conference on

Web Information Systems Engineering, pages 534–

543. Springer.

Wang, J., Kraska, T., Franklin, M. J., and Feng, J. (2012).

Crowder: Crowdsourcing entity resolution. arXiv

preprint arXiv:1208.1927.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

488