
mation used to evaluate the methodology proposed in
(Wang et al., 2012).
The availability of high-quality benchmark
datasets has been crucial for advancing entity reso-
lution research. In their work, (Lovett et al., 2014)
Lovett et. al. made available a set of 700 of the top
U.S. national brands from 16 categories and a large
number of descriptive characteristics (such as brand
personality, satisfaction, age, complexity, and brand
equity. This dataset is appropriate for marketing
research, but an ER approach is not proposed in
this work. The dataset presented in (Jin et al.,
2020) introduced a dataset of 1,437,812 images that
contain brands and 50,000 images without brands.
The images containing brands are annotated with
brand name and logo information. The authors
of (Lamm and Keuper, 2023) released the first
publicly available large-scale dataset for visual entity
matching. They provide 786,000 manually annotated
product images containing around 18,000 different
retail products, which are grouped into about 3,000
entities. The annotation of these products is based
on a price comparison task, where each entity forms
an equivalence class of comparable products. The
Database Group at Leipzig University (Christophides
et al., 2020) has contributed several widely-used
benchmark datasets for binary entity resolution eval-
uation. These include the Amazon-Google Products
dataset (Christophides et al., 2020; Saeedi et al.,
2021), containing 1,363 Amazon entities and 3,226
Google products with 1,300 known matches, and the
Abt-Buy dataset comprising 1,081 and 1,092 entities
from the respective e-commerce platforms with
1,097 matching pairs. These datasets have become
standard evaluation benchmarks, focusing primarily
on general product matching rather than brand-
specific resolution challenges. Additional specialized
datasets have emerged for different aspects of entity
resolution. The Web Data Commons project has
created training and test sets for large-scale product
matching using schema.org marked-up data from
e-commerce websites, covering product categories
including computers, cameras, watches, and shoes.
However, despite this variety of available datasets,
none specifically address the unique challenges of
brand name resolution with the scale and real-world
complexity required for comprehensive evaluation.
3 BrandNERD
BrandNERD addresses the critical challenge of NER
for brand names by providing an extensive brand
dataset of over 394,000 unique raw brand names ex-
tracted from a high-traffic retail marketplace, mak-
ing it significantly larger than existing brand-focused
datasets, together with a lookup table that pairs sur-
face names with their resolved names. By doing this,
BrandNERD provides researchers with a large-scale
dataset that can be utilized for developing and bench-
marking machine learning approaches for text sim-
ilarity, clustering, and resolution tasks, as a trusted
source of resolved brand names for sentiment analy-
sis, or as a disambiguation tool for obtaining unique
product information in the context of auction and e-
commerce websites. In addition, the BrandNERD
pipeline implements a comprehensive, modular work-
flow consisting of six main steps. Although the
pipeline itself does not introduce any novel contribu-
tion, it makes it convenient for researchers to inter-
vene in any step in the process, where they can replace
or extend the algorithms.
BrandNERD, including its datasets and algo-
rithms, lives in a public GitHub repository available
at https://bit.ly/3VCc2Sn, and is constantly curated
and expanded by the research team. In addition, the
repository contains detailed technical documentation
about the algorithms in the pipeline, the format of the
datasets, and other information that we could not in-
clude in this paper. The dataset is released under the
Creative Commons BY 4.0 license, so researchers are
free to download, fork, integrate, and redistribute the
corpus and code, provided they give appropriate at-
tribution. The dataset is continuously updated as new
data are processed along the pipeline. Also, the repos-
itory accepts pull requests to encourage community-
driven enhancements and continuous expansion.
3.1 NER Processing, Pipeline, and Tools
3.1.1 Data Acquisition
The list of raw brand names was acquired from the
publicly available product pages of an online mar-
ketplace primarily featuring consumer products also
available on various popular e-commerce websites,
including Amazon, Walmart, Poshmark, Home De-
pot, and Target. The name of the data source is kept
undisclosed to protect the business’s anonymity, and
any identifiers linking brands with the original mar-
ketplaces have been removed from the dataset. Also,
although the list was pre-processed to remove other ir-
relevant information and standardize the data, Brand-
NERD does not include data acquisition tools, as the
pipeline assumes that the surface names have already
been acquired. By working with brand strings in iso-
lation, that is, without assuming access to product de-
scriptions, model numbers, category tags, or any other
BrandNERD: An Extensive Brand Dataset and Analysis Pipeline for Named Entity Resolution
483