
standard partition similarity metrics for evaluation.
Singleton entities—those absent from the pairwise
ground truth—are included as degenerate equivalence
classes to complete the partition. Following the alge-
braic model of ER, the equivalence classes induced
by ER correspond to sets of references identifying
the same real-world entity. The pairwise-to-clustered
ground truth transformation is performed using the
union-find data structure (Tarjan, 1979). The effi-
ciency of this method, coupled with optimizations
like rank heuristics and path compression guarantee
near-constant amortized time per operation.
Weakly Connected Components. The connected
components (CC) algorithm (Tarjan, 1971) is a stan-
dard clustering method for undirected graphs and is
a widely adopted baseline in ER pipelines that rely
on undirected reference graphs (Saeedi et al., 2017;
Papadakis et al., 2023). Weakly connected compo-
nents (WCC) is an almost identical algorithm target-
ing directed graphs (Graham et al., 1972). Both al-
gorithms form clusters by grouping nodes connected
by a path—CC in undirected graphs, and WCC in di-
rected graphs by ignoring edge direction. They use
graph traversal to assign all reachable nodes from
each unvisited node to the same cluster. These al-
gorithms are deterministic, non-parametric, and scale
linearly with the number of nodes and edges, mak-
ing them suitable for large-scale ER applications. Be-
cause they treat all edges as symmetric (CC explicitly,
and WCC by disregarding direction), they are unable
to exploit directional signals that may emerge from
asymmetric matchers. Both algorithms are readily
available in the networkx (Hagberg et al., 2008) li-
brary, which we use in our implementation.
CENTER Clustering. The CENTER algo-
rithm (Haveliwala et al., 2000), based on C-
LINK (Hochbaum and Shmoys, 1985), is designed
for large-scale ER on web data modeled as directed
graphs, where directional similarity scores are
preserved and used in clustering. The algorithm
treats each similar pair as a directed edge and forms
clusters by designating the first node to appear in
a scan over the edge list as a cluster center. All
other nodes appearing in outgoing edges from this
center are assigned to its cluster, ensuring all cluster
nodes are directly reachable in the graph from the
center via a similarity edge. CENTER is efficient
and deterministic, requiring only a single pass over
the edge list and producing compact, star-shaped
clusters. Its reliance on edge direction makes it espe-
cially suitable in cases where asymmetric similarity
measures are meaningful. Our implementation of
this algorithm, called Parent CENTER (PC), uses a
path-compression technique to assign nodes to their
most similar reachable predecessor in an iterative
manner. The algorithm is implemented using the
networkx library (Hagberg et al., 2008) library and
available on GitHub
1
.
Markov Clustering. The Markov clustering (MCL)
algorithm (Van Dongen, 2008) is a graph cluster-
ing technique that simulates random walks to iden-
tify regions of dense connectivity. It has been ap-
plied across various domains to cluster similar en-
tity references based on probabilistic flow patterns.
MCL operates on the graph’s transition matrix, apply-
ing two key operations iteratively: expansion, which
models the spread of flow over multiple steps, and
inflation, which sharpens the clustering signal by
strengthening high-probability transitions and sup-
pressing weaker ones. This process continues until
convergence, yielding a block structure in the matrix
that corresponds to the final clusters. The algorithm is
deterministic, highly scalable, and flexible through its
inflation parameter, which controls cluster granular-
ity. Although commonly used with undirected graphs,
MCL naturally supports directed graphs, since flow
inherently respects edge direction. In our evaluation,
we apply MCL to directed reference graphs using the
implementation provided by the markov clustering
library (Allard, 2025). We use the default parameter
settings of the library for our experiments.
4 EXPERIMENTS
The primary goal of our experiment is to empirically
evaluate matcher asymmetry-specifically, asymmetry
related to the order in which inputs are presented at
runtime. While asymmetry can arise from various
factors (e.g., input data type or record attribute pro-
cessing order), we focus on the runtime input order
versus training input order due to the textual nature
of the data: the matcher accepts free text inputs al-
though it was trained as described in Section 3.1. As
described there, the comparison space consists of se-
quential pairs of entity references fed to the matcher
individually. This space results from preprocessing
steps such as entity extraction, blocking, and filtering,
as established in prior work (Papadakis et al., 2020).
The benchmark datasets we use define a fixed com-
parison space, simplifying reproducibility. We refer
to the original order of entity pairs in this space as the
normal pair order. By reversing the order within each
pair, we construct a reversed pair order. To evaluate
input order asymmetry, we apply the matcher and the
earlier-described clustering algorithms to both orders
1
https://github.com/matchescu/clustering/
On the Asymmetrical Nature of Entity Matching Using Pre-Trained Transformers
211