BrandNERD: An Extensive Brand Dataset and Analysis Pipeline for Named Entity Resolution

Nicholas Caporusso, Alina Campan, Ayush Bhandari, Stephen Kroeger, Sarita Gautam

2025

Abstract

Named entity resolution (NER) comprises several steps to address multifaceted challenges, including canonicalization, aggregation, and validation. Nonetheless, NER research is hindered by the scarcity of realistic, labeled corpora that capture the spelling noise and brand proliferation found in data from multiple sources, from e-commerce to social media. In this paper, we introduce the Brand Name Entity Resolution Dataset (BrandNERD), an extensive dataset of real-world brand names extracted from an existing high-traffic retail marketplace. BrandNERD consists of multiple datasets along the entity resolution pipeline: raw surface forms, unique canonical entities, similarity clusters, validated brands, and a lookup table reconciling multiple canonical forms with a list of validated preferred brand labels. In addition to the BrandNERD dataset, our contribution includes an analysis of adequacy of various text similarity measures to the brand NER task at hand, the processing algorithms used in each step of the resolution process, and user interfaces and data visualization tools for manual reviews, resulting in a modular, fully reproducible, and extensible pipeline that reflects the complete NER workflow. BrandNERD, which is released as a public repository, contains the dataset and processing pipeline for over 390,000 raw brand names. The repository is continuously updated with new data and improved NER algorithms, making it a living resource for research in marketing and machine learning, and for enabling more complex downstream tasks such as entity disambiguation and brand sentiment analysis.

Download


Paper Citation


in Harvard Style

Caporusso N., Campan A., Bhandari A., Kroeger S. and Gautam S. (2025). BrandNERD: An Extensive Brand Dataset and Analysis Pipeline for Named Entity Resolution. In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR; ISBN , SciTePress, pages 481-488. DOI: 10.5220/0013830600004000


in Bibtex Style

@conference{kdir25,
author={Nicholas Caporusso and Alina Campan and Ayush Bhandari and Stephen Kroeger and Sarita Gautam},
title={BrandNERD: An Extensive Brand Dataset and Analysis Pipeline for Named Entity Resolution},
booktitle={Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR},
year={2025},
pages={481-488},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013830600004000},
isbn={},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR
TI - BrandNERD: An Extensive Brand Dataset and Analysis Pipeline for Named Entity Resolution
SN -
AU - Caporusso N.
AU - Campan A.
AU - Bhandari A.
AU - Kroeger S.
AU - Gautam S.
PY - 2025
SP - 481
EP - 488
DO - 10.5220/0013830600004000
PB - SciTePress