
intensive NLP (Lewis et al., 2020), achieving high re-
call on MS MARCO (Nguyen et al., 2016).
ColPali (Zilliz, 2023) extends the foundational
principles of ColBERT to the multimodal domain,
specifically targeting document retrieval that inte-
grates visual information. ColPali processes visual
documents, such as PDFs, by decomposing them
into multiple image patches and generating high-
dimensional embeddings for each patch. This is anal-
ogous to how ColBERT handles text tokens, allowing
for fine-grained matching between visual queries and
document patches. ColPali has demonstrated supe-
rior performance in tasks requiring multimodal under-
standing, such as visual question answering and doc-
ument understanding, by effectively leveraging both
textual and visual cues. Nevertheless, by inherit-
ing the multi-vector nature of ColBERT, ColPali also
faces substantial challenges related to massive stor-
age requirements (due to the large number of high-
dimensional patch embeddings) and increased com-
putational overhead during retrieval, especially when
deployed at web scale. These inherent limitations,
particularly the storage footprint and retrieval latency,
are the primary motivations behind our development
of HPC-ColPali, which aims to mitigate these effi-
ciency concerns while preserving the high retrieval
quality characteristic of ColPali.
2.2 Embedding Quantization
Techniques
Embedding quantization is a critical technique for re-
ducing the memory footprint and accelerating sim-
ilarity search in high-dimensional vector spaces, a
necessity for large-scale information retrieval sys-
tems. Product Quantization (PQ) (J
´
egou et al.,
2011) is one of the most widely adopted methods in
this domain. PQ works by partitioning the original
high-dimensional vector space into several indepen-
dent sub-spaces. Each sub-vector within these sub-
spaces is then quantized independently by mapping it
to a centroid in its respective sub-space. The orig-
inal high-dimensional vector is thus represented as
a compact concatenation of these centroid indices.
This method allows for remarkable compression ra-
tios, often achieving 90–97% storage savings with
only minor accuracy degradation. Libraries such
as FAISS (Facebook AI Similarity Search) provide
highly optimized implementations of PQ and its vari-
ants, including hybrid indexes like IVF-ADC, which
are extensively used for large-scale approximate near-
est neighbor (ANN) search. Variants like Optimized
PQ (OPQ) (Ge et al., 2013) further reduce distortion
in ColBERT-like systems, with less than 1% MAP
loss (Chen et al., 2021).
Our work in HPC-ColPali leverages K-Means
clustering as a fundamental component for vector
quantization. By clustering the dense patch embed-
dings into K centroids, we effectively replace the orig-
inal high-dimensional float vectors with compact 1-
byte code indices. This process directly contributes to
the substantial compression ratios observed in HPC-
ColPali. While advanced PQ techniques often in-
volve multiple sub-quantizers and more complex en-
coding schemes, our approach focuses on a single-
stage K-Means quantization for its simplicity, inter-
pretability, and direct control over the compression
factor. This design choice allows for a clear analysis
of the trade-offs between compression and accuracy,
and can serve as a foundation for future extensions to
more intricate hierarchical PQ schemes.
2.3 Attention-Based Token/Patch
Pruning
The advent of Transformer architectures has brought
unprecedented performance in various AI tasks, but
often at the cost of significant computational re-
sources. To address this, dynamic token or patch
pruning has emerged as an effective strategy, partic-
ularly relevant for Vision Transformers (ViTs) and
other attention-heavy models. Models like Dynam-
icViT (Tang et al., 2023) have demonstrated that not
all input tokens or patches contribute equally to the fi-
nal model prediction. By analyzing the internal atten-
tion mechanisms, which inherently capture the impor-
tance or salience of different parts of the input, these
methods can dynamically identify and discard less in-
formative tokens or patches during inference. This
selective processing leads to substantial reductions in
computational cost, with reported gains of 60% com-
pute reduction and minimal impact on accuracy (e.g.,
less than 1% accuracy drop) (Rao et al., 2021).
HPC-ColPali adopts a similar philosophy by em-
ploying an attention-guided dynamic pruning mecha-
nism specifically tailored for image patches in multi-
modal documents. During query processing, the Vi-
sion Language Model (VLM) encoder not only gener-
ates patch embeddings but also provides correspond-
ing attention weights for each patch. Our pruning
strategy leverages these weights by sorting patches
based on their attention scores in descending order
and retaining only the most salient top p% of patches.
This intelligent selection directly reduces the number
of patch-wise comparisons required during the late
interaction phase, thereby decreasing the computa-
tional burden and accelerating query latency without
significantly compromising retrieval quality. The pa-
KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval
100