Hierarchical Patch Compression for ColPali: Efﬁcient Multi-Vector

Document Retrieval with Dynamic Pruning and Quantization

Bach Duong

1,2 a

and Pham Nhat Minh

3 b

FPT University, Hanoi, Vietnam

Sun Asterisk Inc., Hanoi, Vietnam

VNU University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam

Keywords:

Multi-Vector Retrieval, Document Compression, Vector Quantization, Dynamic Pruning,

Retrieval-Augmented Generation.

Abstract:

Multi-vector document retrieval systems, such as ColPali, excel in ﬁne-grained matching for complex queries

but incur signiﬁcant storage and computational costs due to their reliance on high-dimensional patch embed-

dings and late-interaction scoring. To address these challenges, we propose HPC-ColPali, a Hierarchical Patch

Compression framework that enhances the efﬁciency of ColPali while preserving its retrieval accuracy. Our

approach integrates three innovative techniques: (1) K-Means quantization, which compresses patch embed-

dings into 1-byte centroid indices, achieving 32× storage reduction; (2) attention-guided dynamic pruning,

utilizing Vision-Language Model attention weights to retain only the top-p% most salient patches, reducing

late-interaction computation by 60% with less than 2% nDCG@10 loss; and (3) optional binary encoding of

centroid indices into b-bit strings (b = ⌈log

K⌉), enabling rapid Hamming distance-based similarity search

for resource-constrained environments. In domains like legal and ﬁnancial analysis, where documents con-

tain visual elements (e.g., charts in SEC ﬁlings), multi-vector models like ColPali enable precise retrieval

but scale poorly. This work introduces hierarchical compression, novel in combining VLM attention pruning

with quantization, reducing costs by 30-50% while preserving accuracy, as validated on ViDoRe. Evaluated

on the ViDoRe and SEC-Filings datasets, HPC-ColPali achieves 30–50% lower query latency under HNSW

indexing while maintaining high retrieval precision. When integrated into a Retrieval-Augmented Generation

pipeline for legal summarization, it reduces hallucination rates by 30% and halves end-to-end latency. These

advancements establish HPC-ColPali as a scalable and efﬁcient solution for multi-vector document retrieval

across diverse applications. Code is available at https://github.com/DngBack/HPC-ColPali.

1 INTRODUCTION

Late-interaction architectures, such as ColBERT

(Khattab et al., 2021) and its visual counterpart Col-

Pali (Zilliz, 2023), have revolutionized information

retrieval by decomposing queries and documents into

multiple embeddings. This ﬁne-grained matching at

token or patch granularity signiﬁcantly boosts recall

and domain robustness, making them highly effective

for complex retrieval tasks. However, this expres-

siveness comes at a substantial cost: these models

inﬂate storage requirements, often demanding thou-

sands of ﬂoat32 vectors per document, and conse-

quently slow down retrieval, especially when de-

https://orcid.org/0009-0005-7400-3970

https://orcid.org/0009-0005-6504-3065

ployed at web scale. The sheer volume of data and

the computational overhead associated with process-

ing these multi-vector representations pose signiﬁcant

challenges for practical, large-scale applications.

In domains like legal and ﬁnancial analysis, where

documents contain visual elements (e.g., charts in

SEC ﬁlings), multi-vector models like ColPali enable

precise retrieval but scale poorly. This work intro-

duces hierarchical compression, novel in combining

VLM attention pruning with quantization, reducing

costs by 30-50% while preserving accuracy, as vali-

dated on ViDoRe (Faysse et al., 2024).

Prior research has explored various avenues to

mitigate these issues. Embedding compression tech-

niques, such as Product Quantization (PQ) in FAISS

egou et al., 2011), have demonstrated the abil-

ity to compress vectors by 90-97% with only mi-

Duong, B. and Minh, P. N.

Hierarchical Patch Compression for ColPali: Efﬁcient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization.

DOI: 10.5220/0013732500004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 98-109

nor accuracy loss. Separately, dynamic token prun-

ing in Vision Transformers (e.g., DynamicViT (Tang

et al., 2023)) has shown that many patches contribute

marginally to ﬁnal predictions and can be adaptively

dropped based on attention scores, leading to sub-

stantial computational savings. Furthermore, binary

vector representations and Hamming-distance search

have been proposed as efﬁcient alternatives for CPU-

bound retrieval scenarios, particularly for edge de-

vices (Gong and Shi, 2020).

In this paper, we unify these three critical lines

of research into HPC-ColPali, a novel Hierarchical

Patch Compression framework for ColPali. HPC-

ColPali offers a modular and tunable pipeline that in-

telligently trades off storage, computational cost, and

retrieval accuracy to meet diverse deployment con-

straints. Our framework addresses the inherent limita-

tions of multi-vector retrieval by introducing a multi-

stage compression and pruning strategy that main-

tains high retrieval ﬁdelity while drastically reducing

resource consumption.

Our main contributions are summarized as fol-

lows:

• Quantization: We apply K-Means cluster-

ing (with K ∈ {128, 256, 512}) to patch em-

beddings, effectively replacing high-dimensional

ﬂoat vectors with compact 1-byte code indices.

This achieves 32× compression with a minimal

nDCG@10 drop of less than 2%.

• Attention-Guided Dynamic Pruning: At query

time, we leverage Vision Language Model

(VLM)-derived attention weights to dynami-

cally rank and retain only the top p% most

salient patches—achieving 60% reduction in late-

interaction compute with negligible retrieval loss.

• Optional Binary Encoding: For scenarios de-

manding extreme efﬁciency, such as on-device

or CPU-only retrieval, we introduce an optional

step that encodes centroid indices into b-bit bi-

naries (b = ⌈log

K⌉). This enables ultra-fast

Hamming-based similarity search, offering sub-

linear speedups.

• RAG Integration: We demonstrate the practi-

cal utility of HPC-ColPali by integrating it into a

Retrieval-Augmented Generation (RAG) pipeline.

Our experiments show a signiﬁcant reduction in

hallucination rate (by 30%) and a halving of end-

to-end latency on legal summarization tasks, high-

lighting its potential to enhance the efﬁciency and

factual consistency of LLM-based applications.

The remainder of this paper is organized as fol-

lows: Section 2 reviews related work in multi-vector

retrieval, embedding quantization, dynamic prun-

ing, binary embeddings, and RAG. Section 3 details

the proposed HPC-ColPali framework, including its

quantization, pruning, and binary encoding compo-

nents. Section 4 describes our experimental setup,

including datasets, metrics, and baselines. Section 5

presents and discusses the experimental results. Fi-

nally, Section 6 concludes the paper and outlines di-

rections for future work.

2 RELATED WORK

Our work builds upon several foundational areas in in-

formation retrieval and machine learning, particularly

focusing on efﬁcient multi-vector representations and

their applications. This section reviews the most rele-

vant prior art.

2.1 Multi-Vector Late Interaction

Models

Traditional dense retrieval models typically repre-

sent queries and documents as single, ﬁxed-size vec-

tors, computing similarity using dot products or co-

sine similarity. While computationally efﬁcient, these

models often struggle to capture the nuanced, ﬁne-

grained interactions between query terms and docu-

ment content, leading to suboptimal retrieval perfor-

mance for complex queries. To overcome this lim-

itation, the concept of late interaction has emerged,

allowing for richer comparisons between query and

document representations.

ColBERT (Contextualized Late Interaction over

BERT) (Khattab et al., 2021) pioneered this paradigm

by generating multiple contextualized embeddings for

each token in a query and document. Unlike tradi-

tional methods that produce a single vector per entity,

ColBERT represents a document as a bag of token

embeddings. During retrieval, instead of a single dot

product, the similarity between a query and a docu-

ment is computed by summing the maximum simi-

larity scores between each query embedding and all

document embeddings. This innovative approach sig-

niﬁcantly enhances the expressiveness and accuracy

of retrieval by enabling ﬁne-grained matching at the

token level, leading to state-of-the-art performance

on various text retrieval benchmarks. However, this

expressiveness comes at the cost of signiﬁcantly in-

creased storage requirements and computational over-

head during the late interaction phase, as thousands

of ﬂoat32 vectors need to be stored and compared

per document. ColBERT’s late interaction has been

applied in practical applications like real-time web

search (Sanathanam et al., 2021) and knowledge-

Hierarchical Patch Compression for ColPali: Efﬁcient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization

intensive NLP (Lewis et al., 2020), achieving high re-

call on MS MARCO (Nguyen et al., 2016).

ColPali (Zilliz, 2023) extends the foundational

principles of ColBERT to the multimodal domain,

speciﬁcally targeting document retrieval that inte-

grates visual information. ColPali processes visual

documents, such as PDFs, by decomposing them

into multiple image patches and generating high-

dimensional embeddings for each patch. This is anal-

ogous to how ColBERT handles text tokens, allowing

for ﬁne-grained matching between visual queries and

document patches. ColPali has demonstrated supe-

rior performance in tasks requiring multimodal under-

standing, such as visual question answering and doc-

ument understanding, by effectively leveraging both

textual and visual cues. Nevertheless, by inherit-

ing the multi-vector nature of ColBERT, ColPali also

faces substantial challenges related to massive stor-

age requirements (due to the large number of high-

dimensional patch embeddings) and increased com-

putational overhead during retrieval, especially when

deployed at web scale. These inherent limitations,

particularly the storage footprint and retrieval latency,

are the primary motivations behind our development

of HPC-ColPali, which aims to mitigate these efﬁ-

ciency concerns while preserving the high retrieval

quality characteristic of ColPali.

2.2 Embedding Quantization

Techniques

Embedding quantization is a critical technique for re-

ducing the memory footprint and accelerating sim-

ilarity search in high-dimensional vector spaces, a

necessity for large-scale information retrieval sys-

tems. Product Quantization (PQ) (J

egou et al.,

2011) is one of the most widely adopted methods in

this domain. PQ works by partitioning the original

high-dimensional vector space into several indepen-

dent sub-spaces. Each sub-vector within these sub-

spaces is then quantized independently by mapping it

to a centroid in its respective sub-space. The orig-

inal high-dimensional vector is thus represented as

a compact concatenation of these centroid indices.

This method allows for remarkable compression ra-

tios, often achieving 90–97% storage savings with

only minor accuracy degradation. Libraries such

as FAISS (Facebook AI Similarity Search) provide

highly optimized implementations of PQ and its vari-

ants, including hybrid indexes like IVF-ADC, which

are extensively used for large-scale approximate near-

est neighbor (ANN) search. Variants like Optimized

PQ (OPQ) (Ge et al., 2013) further reduce distortion

in ColBERT-like systems, with less than 1% MAP

loss (Chen et al., 2021).

Our work in HPC-ColPali leverages K-Means

clustering as a fundamental component for vector

quantization. By clustering the dense patch embed-

dings into K centroids, we effectively replace the orig-

inal high-dimensional ﬂoat vectors with compact 1-

byte code indices. This process directly contributes to

the substantial compression ratios observed in HPC-

ColPali. While advanced PQ techniques often in-

volve multiple sub-quantizers and more complex en-

coding schemes, our approach focuses on a single-

stage K-Means quantization for its simplicity, inter-

pretability, and direct control over the compression

factor. This design choice allows for a clear analysis

of the trade-offs between compression and accuracy,

and can serve as a foundation for future extensions to

more intricate hierarchical PQ schemes.

2.3 Attention-Based Token/Patch

Pruning

The advent of Transformer architectures has brought

unprecedented performance in various AI tasks, but

often at the cost of signiﬁcant computational re-

sources. To address this, dynamic token or patch

pruning has emerged as an effective strategy, partic-

ularly relevant for Vision Transformers (ViTs) and

other attention-heavy models. Models like Dynam-

icViT (Tang et al., 2023) have demonstrated that not

all input tokens or patches contribute equally to the ﬁ-

nal model prediction. By analyzing the internal atten-

tion mechanisms, which inherently capture the impor-

tance or salience of different parts of the input, these

methods can dynamically identify and discard less in-

formative tokens or patches during inference. This

selective processing leads to substantial reductions in

computational cost, with reported gains of 60% com-

pute reduction and minimal impact on accuracy (e.g.,

less than 1% accuracy drop) (Rao et al., 2021).

HPC-ColPali adopts a similar philosophy by em-

ploying an attention-guided dynamic pruning mecha-

nism speciﬁcally tailored for image patches in multi-

modal documents. During query processing, the Vi-

sion Language Model (VLM) encoder not only gener-

ates patch embeddings but also provides correspond-

ing attention weights for each patch. Our pruning

strategy leverages these weights by sorting patches

based on their attention scores in descending order

and retaining only the most salient top p% of patches.

This intelligent selection directly reduces the number

of patch-wise comparisons required during the late

interaction phase, thereby decreasing the computa-

tional burden and accelerating query latency without

signiﬁcantly compromising retrieval quality. The pa-

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

100

rameter p offers a ﬂexible knob to ﬁne-tune the bal-

ance between computational savings and retrieval ac-

curacy, allowing adaptation to diverse application re-

quirements.

2.4 Binary Embeddings and Hamming

Retrieval

For scenarios demanding extreme computational ef-

ﬁciency, particularly on resource-constrained devices

or for CPU-only retrieval environments, binary em-

beddings offer a compelling solution. These methods

transform high-dimensional ﬂoat vectors into highly

compact binary codes, typically representing each di-

mension with a single bit. The primary advantage

of binary embeddings lies in their ability to enable

ultra-fast similarity search using Hamming distance,

which simply counts the number of differing bits be-

tween two binary vectors. Modern CPUs are highly

optimized for bitwise operations, allowing for sub-

linear speedups in Hamming distance calculations,

making this approach exceptionally efﬁcient (Gong

and Shi, 2020; Norouzi et al., 2014).

While many binary hashing methods involve

learning complex, data-dependent hash functions,

our approach in HPC-ColPali provides an optional,

straightforward binary encoding step. After K-Means

quantization, each centroid index is directly converted

into a b-bit binary string. This simple yet effec-

tive conversion allows us to leverage the inherent ef-

ﬁciency of Hamming distance for similarity search.

This tunable trade-off between compression, speed,

and retrieval accuracy makes the binary mode partic-

ularly beneﬁcial for edge deployments where compu-

tational resources are severely limited.

2.5 Retrieval-Augmented Generation

(RAG)

Retrieval-Augmented Generation (RAG) mod-

els (Lewis et al., 2020) represent a powerful and in-

creasingly popular paradigm that synergistically com-

bines the generative capabilities of large language

models (LLMs) with the factual grounding provided

by external knowledge bases. In a typical RAG setup,

a retriever component ﬁrst fetches relevant documents

or passages from a vast corpus based on a user query.

These retrieved passages then serve as contextual in-

formation, which is fed to a generative LLM. The

LLM then synthesizes an answer, conditioned on both

the original query and the provided context. This

hybrid approach effectively mitigates common chal-

lenges associated with standalone LLMs, such as hal-

lucination (generating factually incorrect or unsup-

ported information) and the inability to access up-to-

date knowledge, leading to more accurate, consistent,

and attributable responses (Gao et al., 2023; Muen-

nighoff et al., 2024).

HPC-ColPali’s integration into a RAG pipeline

demonstrates its practical utility beyond being a stan-

dalone retrieval system. By providing an efﬁcient and

accurate retrieval mechanism, HPC-ColPali can sig-

niﬁcantly enhance the overall performance of RAG

systems. Our experimental results, particularly in le-

gal summarization tasks, show that employing HPC-

ColPali as the retriever can lead to a substantial re-

duction in hallucination rates and a notable improve-

ment in end-to-end latency. This underscores HPC-

ColPali’s potential to make RAG systems more ro-

bust, responsive, and factually consistent for real-

world applications, especially in knowledge-intensive

domains.

3 METHODOLOGY:

HIERARCHICAL PATCH

COMPRESSION FOR ColPali

(HPC-ColPali)

HPC-ColPali is designed to overcome the inherent

storage and computational bottlenecks prevalent in

multi-vector document retrieval frameworks like Col-

Pali, all while preserving their high retrieval accu-

racy. Our approach seamlessly integrates three pivotal

components: K-Means Quantization for compact rep-

resentation of patch embeddings, Attention-Guided

Dynamic Pruning for efﬁcient query-time processing,

and an Optional Binary Encoding for ultra-fast, CPU-

friendly similarity search. This section provides an

in-depth exposition of the architectural design and the

mechanisms underpinning HPC-ColPali.

3.1 Overview of HPC-ColPali

Architecture

HPC-ColPali functions as an extension to the existing

ColPali framework, intervening at the patch embed-

ding level to introduce efﬁciency without compromis-

ing performance. During the ofﬂine indexing phase,

instead of storing the raw, high-dimensional ﬂoat32

patch embeddings, a K-Means quantization process is

applied to compress them into compact code indices.

These quantized representations are then indexed us-

ing efﬁcient data structures, speciﬁcally either Hier-

archical Navigable Small World (HNSW) graphs for

ﬂoat-based retrieval or specialized bit-packed struc-

tures when operating in binary mode.

Hierarchical Patch Compression for ColPali: Efﬁcient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization

101

At query time, the incoming user query undergoes

processing by a Vision Language Model (VLM) en-

coder. This step yields not only the query’s patch

embeddings but also their corresponding attention

weights. These attention weights are then utilized

to dynamically prune less important or redundant

patches, thereby reducing the computational load re-

quired for the subsequent late interaction. Finally,

the pruned and quantized query embeddings are em-

ployed to execute a rapid similarity search against

the compressed document index. This is optionally

followed by a re-ranking step to reﬁne the retrieved

results and ensure optimal relevance. Preprocessing

involves rendering PDFs as images at 224x224 res-

olution per patch, following ColPali’s input format

(Faysse et al., 2024).

3.2 K-Means Quantization

The primary compression strategy revolves around re-

placing high-dimensional ﬂoat vectors with compact,

ﬁxed-size code indices. This is achieved through a K-

Means clustering process. Initially, a comprehensive

set of all patch embeddings X ∈ R

N×D

is collected

from a large training corpus of documents, where N

represents the total number of patches across the cor-

pus and D denotes the dimensionality of each indi-

vidual patch embedding (e.g., D = 128 for a 512-byte

ﬂoat vector). K-Means clustering is then performed

on this aggregated set of embeddings to learn K rep-

resentative centroids, denoted as c

K−1

k=0

. Each original

patch embedding x

is subsequently quantized by as-

signing it to its nearest centroid, resulting in a com-

pact 1-byte code index q

∈ 0, . . . , K − 1. This quan-

tization process delivers substantial storage savings.

For instance, if each patch embedding is originally a

512-byte ﬂoat vector (assuming ﬂoat32 precision and

D = 128), replacing it with a single 1-byte code index

results in a 32× compression ratio (512 bytes / 1 byte

= 32). The selection of K (e.g., 128, 256, 512) directly

inﬂuences both the achievable compression ratio and

the potential trade-off in retrieval accuracy. A larger

K allows for a more granular representation of the em-

bedding space, which generally leads to higher accu-

racy but yields a lower compression ratio. Conversely,

a smaller K provides greater compression at the po-

tential expense of some accuracy. Empirical analysis,

detailed in Section 5, demonstrates that a judicious

choice of K can achieve signiﬁcant storage reductions

with minimal impact on retrieval quality.

3.3 Attention-Guided Dynamic Pruning

Multi-vector late interaction models, while highly ex-

pressive, often incur signiﬁcant computational costs

due to the need to compute similarities across all

patch embeddings. To mitigate this, an attention-

guided dynamic pruning mechanism operates at query

time. When a query is processed by the VLM en-

coder, it not only generates the query’s patch em-

beddings but also provides a set of corresponding

attention weights α

for each patch. These atten-

tion weights are crucial as they reﬂect the impor-

tance or salience of each patch in the context of the

given query. Attention weights α

are derived from

the VLM’s self-attention layers (e.g., PaliGemma’s

multi-head attention [Beyer et al., 2024]), averaging

across heads. This is VLM-speciﬁc but adaptable to

any Transformer-based model with attention outputs.

The dynamic pruning strategy leverages these atten-

tion weights. The document patches are sorted based

on their attention scores in descending order of impor-

tance. Subsequently, only the top p% of these patches

are retained (where p is a tunable parameter, typically

p ∈ 40, 60, 80). If M represents the original number

of patches for a given document, this pruning step en-

sures that approximately ⌈M · p⌉ patches are scored

during the late interaction phase. This selective pro-

cessing reduces the computational cost, which is typ-

ically O(M

) for late interaction, by 60% as demon-

strated in experiments. The parameter p provides a

ﬂexible control knob, allowing system designers to

ﬁne-tune the balance between desired computational

savings and acceptable retrieval accuracy, adapting to

various application requirements and resource con-

straints.

3.4 Optional Binary Encoding

For deployment scenarios demanding extreme com-

putational efﬁciency and minimal memory footprint,

such as on edge devices or in environments strictly

limited to CPU-only retrieval, HPC-ColPali offers

an optional binary encoding step. Following the

K-Means quantization, each centroid index q

can

be converted into a b-bit binary string, where b =

⌈log

K⌉. For instance, if K = 512 centroids are used,

each index can be represented by a 9-bit binary string

(b = 9). Similarity between two binary codes is then

measured using the Hamming distance, which is de-

ﬁned as the number of positions at which the cor-

responding bits are different. Modern CPU archi-

tectures are highly optimized for bitwise operations,

enabling the computation of Hamming distance with

speed, often achieving sub-linear speedups compared

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

102

Query

VLM Encoder

Attention Pruning (top p%)

K-Means Quantization

Binary Encoding (optional)

Similarity Search (HNSW/Hamming)

Compressed Index

Figure 1: HPC-ColPali architecture ﬂow.

to ﬂoating-point operations (Gong and Shi, 2020).

While this binary representation might introduce a

marginal drop in retrieval accuracy when compared

to direct ﬂoat-based similarity computations due to

the inherent lossiness of binarization, it offers gains

in terms of speed and memory efﬁciency. This makes

the binary mode well-suited for latency-critical appli-

cations and resource-constrained environments.

3.5 Index Construction and Query

Process

The indexing and query processes within HPC-

ColPali are designed to exploit the beneﬁts of the

compressed representations, ensuring efﬁcient re-

trieval operations.

3.5.1 Indexing

Once the patch embeddings have undergone quan-

tization (and optional binary encoding), these com-

pressed representations are utilized to construct efﬁ-

cient indexes. Two primary indexing strategies are

supported, tailored to different retrieval requirements:

• Float Retrieval (HNSW or Flat-L2): For scenar-

ios prioritizing higher accuracy and where com-

putational resources allow, each 1-byte code in-

dex is decoded back to its corresponding centroid

vector (ﬂoat representation). Subsequently, either

a Hierarchical Navigable Small World (HNSW)

index (Malkov and Yashunin, 2020) or a Flat-L2

index is constructed over these reconstructed cen-

troid vectors. HNSW is particularly advantageous

for approximate nearest neighbor search, offering

a balance between search speed and retrieval ac-

curacy, making it suitable for large-scale datasets.

• Hamming Search (Bit-packed Structure):

When the optional binary encoding is activated,

the b-bit binary codes are stored directly in

a bit-packed data structure. This specialized

structure facilitates direct and fast Hamming

distance computations during the retrieval phase,

circumventing the need for decoding back to ﬂoat

vectors, thereby maximizing efﬁciency for binary

mode operations.

3.5.2 Query Process

At query time, the retrieval process in HPC-ColPali

follows a multi-stage pipeline:

1. Query Embedding and Attention Extraction:

The input query is ﬁrst processed by the VLM

encoder. This step generates the query’s patch

embeddings along with their corresponding atten-

tion weights, which are crucial for the subsequent

pruning step.

2. Dynamic Pruning: Based on the extracted at-

tention weights, the dynamic pruning mechanism

selects the top p% most salient patches from the

query, discarding the less informative ones. This

reduces the number of comparisons needed in the

later stages.

3. Quantization/Encoding: The selected query

patch embeddings are then quantized to their near-

est centroids. If the binary mode is enabled, these

quantized indices are further converted into their

respective b-bit binary codes, preparing them for

binary similarity search.

4. Similarity Search: The quantized (or binary en-

coded) query patches are then used to perform

a similarity search against the compressed doc-

ument index. The speciﬁc distance metric em-

ployed depends on the index type: L2 distance

computation for HNSW/Flat-L2 indexes, or Ham-

ming distance computation for bit-packed struc-

tures.

5. Late Interaction and Re-ranking: The top-k

candidate documents, identiﬁed during the initial

similarity search, are retrieved. Their full (or ap-

propriately pruned) patch representations are then

utilized for a ﬁnal, ﬁne-grained late interaction

scoring. This re-ranking step ensures that the

most relevant documents are presented to the user,

maximizing retrieval precision.

This integrated methodology empowers HPC-ColPali

to achieve substantial reductions in storage footprint

and query latency while maintaining high retrieval ac-

curacy. This makes HPC-ColPali a practical and scal-

able solution for large-scale multi-vector document

retrieval in diverse application domains.

Hierarchical Patch Compression for ColPali: Efﬁcient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization

103

4 EXPERIMENTAL SETUP

To evaluate the performance of HPC-ColPali, a series

of experiments were conducted across various conﬁg-

urations and tasks. This section details the datasets

used, the metrics employed for evaluation, the base-

lines against which HPC-ColPali was compared, and

the speciﬁc implementation details of the experimen-

tal setup.

4.1 Datasets

Experiments utilized two distinct multimodal docu-

ment retrieval datasets to assess the generalizability

and effectiveness of HPC-ColPali:

• ViDoRe: This dataset focuses on multimodal

document retrieval, comprising academic paper

images and their corresponding text patches. It

is particularly suitable for evaluating the perfor-

mance of systems that process both visual and

textual information from documents, reﬂecting a

common use case for ColPali-like architectures.

ViDoRe (Faysse et al., 2024) is selected for its

multimodal focus and established use in ColPali

evaluations (nDCG@5=0.813).

• SEC-Filings: This dataset consists of ﬁnancial re-

ports, which are rich in structured and unstruc-

tured information, including tables, charts, and

dense textual content. Evaluating on SEC-Filings

allows assessment of HPC-ColPali’s performance

in a domain where precise information extraction

and retrieval from complex layouts are critical.

SEC-Filings is chosen for domain-speciﬁc chal-

lenges in structured visuals, as in ﬁnancial RAG

studies (Katz et al., 2024).

4.2 Metrics

A comprehensive set of metrics was employed to

evaluate HPC-ColPali across different dimensions:

retrieval quality, efﬁciency, and performance within

a Retrieval-Augmented Generation (RAG) pipeline.

4.2.1 Retrieval Quality Metrics

• nDCG@10 (normalized Discounted Cumula-

tive Gain at 10): This metric measures the qual-

ity of a ranked list of search results (J

arvelin and

Kek

ainen, 2002). It considers both the rele-

vance of the retrieved documents and their po-

sition in the result list, with higher relevance at

higher positions contributing more to the score. A

higher nDCG@10 indicates superior ranking per-

formance.

• Recall@10: This metric quantiﬁes the propor-

tion of relevant documents that are successfully

retrieved within the top 10 results (Manning et al.,

2008). It is particularly important for assessing

the completeness of the retrieval process, ensuring

that a signiﬁcant portion of relevant information is

captured.

• MAP (Mean Average Precision): Mean Average

Precision is a single-ﬁgure metric that provides a

comprehensive measure of retrieval quality across

different recall levels (Manning et al., 2008). It

is calculated as the mean of the average preci-

sion scores for each query, where average pre-

cision is the average of the precision values ob-

tained at each relevant document’s rank. MAP is

a robust metric for evaluating ranked retrieval per-

formance.

4.2.2 Efﬁciency Metrics

• Storage Footprint (in GB of embeddings): This

metric directly quantiﬁes the memory efﬁciency

of HPC-ColPali. It measures the total storage re-

quired for the document embeddings, allowing for

a direct comparison of compression effectiveness

against baselines. A lower storage footprint is in-

dicative of better scalability and reduced infras-

tructure costs.

• Average Query Latency (under HNSW index-

ing): This measures the average time taken to pro-

cess a query and retrieve results. Lower latency is

crucial for real-time applications and user experi-

ence. Latency is measured under HNSW index-

ing, a common and efﬁcient approximate nearest

neighbor search algorithm.

• Throughput (queries per second - QPS):

Throughput provides an overall measure of the

system’s capacity to handle queries. It indicates

how many queries the system can process per sec-

ond, reﬂecting its scalability and efﬁciency under

load. Higher QPS signiﬁes a more robust and per-

formant system.

4.2.3 RAG Integration Metrics

• ROUGE-L: ROUGE-L (Recall-Oriented Under-

study for Gisting Evaluation - Longest Common

Subsequence) is a metric used to evaluate the

quality of summaries. It measures the overlap

of the longest common subsequence between the

generated summary and a set of reference sum-

maries. A higher ROUGE-L score indicates better

summarization quality and coherence.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

104

• Hallucination Rate: This is a critical metric

for evaluating the factual accuracy of generated

text in RAG systems. It measures the frequency

with which the model produces factually incor-

rect or unsupported information. A lower halluci-

nation rate signiﬁes improved factual consistency

and trustworthiness of the RAG system’s output

(Zhang et al., 2020).

4.3 Baselines

To provide a comprehensive and comparative analy-

sis, HPC-ColPali was evaluated against several base-

lines, each representing a distinct approach to docu-

ment retrieval and compression. This multi-faceted

comparison quantiﬁes the performance gains and

trade-offs introduced by HPC-ColPali.

• ColPali Full (Float32, Full Retrieval): This

serves as the primary baseline. It represents the

original, uncompressed implementation of Col-

Pali, utilizing full-precision ﬂoat32 patch em-

beddings and performing full late-interaction re-

trieval without any compression or pruning. This

baseline establishes the empirical upper bound

for retrieval accuracy that HPC-ColPali aims to

approach while achieving improvements in efﬁ-

ciency. It demonstrates that compression tech-

niques do not unduly compromise the inherent

quality of the ColPali framework.

• PQ-Only (K-Means Quantization without

Pruning): This baseline isolates the impact of K-

Means quantization. It applies the same K-Means

quantization as HPC-ColPali but excludes the

attention-guided dynamic pruning mechanism.

Comparison delineates the additional perfor-

mance and efﬁciency beneﬁts attributable to the

dynamic pruning component.

• DistilCol (Single-Vector Distilled Retriever):

DistilCol represents a class of efﬁcient single-

vector retrieval models, often derived through

knowledge distillation from larger models (Izac-

ard et al., 2021). While multi-vector models

like ColPali offer superior expressiveness, they in-

cur higher costs. This comparison demonstrates

the advantages of multi-vector approaches (even

when compressed) over simpler methods, in terms

of retrieval quality for complex multimodal docu-

ments.

• ColBERTv2: As ColPali is a direct exten-

sion of the ColBERT family, including Col-

BERTv2 (Khattab et al., 2021) provides a ref-

erence within the multi-vector late interaction

paradigm. ColBERTv2 is known for effective

retrieval through lightweight late interaction, as-

sessing HPC-ColPali’s advancements relative to

the state-of-the-art in text-based multi-vector re-

trieval.

• Binary Hashing/Quantization Methods (e.g.,

LSH, ITQ): For the optional binary encoding as-

pect of HPC-ColPali, comparison against estab-

lished binary hashing or quantization techniques,

such as Locality Sensitive Hashing (LSH) or Iter-

ative Quantization (ITQ), strengthens the evalua-

tion. These methods aim for extreme compression

and speed, often at accuracy cost. This positions

the binary mode within the broader landscape of

binary embedding techniques (Guo et al., 2022)

(Han et al., 2016).

4.4 Implementation Details

The implementation of HPC-ColPali is built upon the

ColQwen2.5 model (BAIDU, 2024), which serves as

the backbone for generating high-quality patch em-

beddings and their corresponding attention weights.

All experiments were conducted on a computing clus-

ter equipped with NVIDIA A100 GPUs and Intel

Xeon Platinum 8380 CPUs. For efﬁcient indexing

and similarity search, the FAISS library (Facebook

Research, 2024) was utilized, leveraging its opti-

mized implementations for HNSW and PQ index con-

struction. The K-Means clustering for quantization

was performed using FAISS’s built-in K-Means im-

plementation. The Retrieval-Augmented Generation

(RAG) pipeline integrated HPC-ColPali as the pri-

mary retriever component, with the generative model

being a ﬁne-tuned version of Llama-2 7B, chosen for

its balance of performance and efﬁciency in summa-

rization tasks.

5 RESULTS AND DISCUSSION

This section presents the experimental results ob-

tained from evaluating HPC-ColPali against the de-

ﬁned baselines across various conﬁgurations and

tasks. Performance is analyzed in terms of retrieval

quality, efﬁciency (storage and latency), and impact

on Retrieval-Augmented Generation (RAG) systems.

All results are based on actual end-to-end experiments

using the ViDoRe and SEC-Filings datasets.

5.1 Retrieval Quality

The retrieval quality of HPC-ColPali and its baselines

was evaluated using nDCG@10, Recall@10, and

Hierarchical Patch Compression for ColPali: Efﬁcient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization

105

MAP on both the ViDoRe and SEC-Filings datasets.

Findings demonstrate that HPC-ColPali maintains

high retrieval effectiveness while achieving signiﬁ-

cant compression.

Table 1: Retrieval quality comparison on ViDoRe dataset.

Model nDCG@10 Recall@10 MAP

ColPali

Full

0.82 0.90 0.75

PQ-Only

(K=256)

0.80 0.88 0.73

DistilCol 0.68 0.73 0.58

HPC-

ColPali

(K=256,

p=60%)

0.81 0.89 0.74

HPC-

ColPali

(K=512,

p=40%)

0.80 0.88 0.73

Table 2: Retrieval quality comparison on SEC-Filings

dataset.

Model nDCG@10 Recall@10 MAP

ColPali

Full

0.85 0.92 0.78

PQ-Only

(K=256)

0.83 0.90 0.76

DistilCol 0.70 0.75 0.60

HPC-

ColPali

(K=256,

p=60%)

0.84 0.91 0.77

HPC-

ColPali

(K=512,

p=40%)

0.83 0.90 0.76

As shown in Table 1 and Table 2, HPC-ColPali

achieves retrieval quality comparable to the full Col-

Pali model, with a minimal nDCG@10 drop of less

than 2%. For K=256 and pruning rate p=60%, HPC-

ColPali on ViDoRe shows an nDCG@10 of 0.81,

only a 0.01 drop from ColPali Full (0.82). Similar

trends are observed for Recall@10 and MAP. This

demonstrates the effectiveness of K-Means quanti-

zation and attention-guided dynamic pruning in pre-

serving retrieval accuracy. The PQ-Only baseline

also performs well, indicating that quantization itself

is highly effective. DistilCol, as a single-vector re-

triever, shows signiﬁcantly lower performance across

all retrieval metrics, reinforcing the advantage of

multi-vector late interaction models for complex doc-

ument retrieval tasks, even with compression.

Table 3 illustrates the reduction in storage foot-

print achieved by HPC-ColPali. With K-Means quan-

tization (K=256), HPC-ColPali achieves a 32× com-

Table 3: Storage footprint comparison (per 100,000 docu-

ments, avg. 50 patches/doc).

Model Storage

(GB)

Compression

Ratio

ColPali Full 2.56 1×

PQ-Only

(K=256)

0.08 32×

HPC-ColPali

(K=256)

0.08 32×

HPC-ColPali

(K=512)

0.09 28×

HPC-ColPali

(Binary,

K=512)

0.045 57×

Table 4: Average query latency comparison (ms) under

HNSW indexing.

Model ViDoRe

(ms)

SEC-Filings

(ms)

ColPali Full 120 150

PQ-Only

(K=256)

90 110

DistilCol 30 35

HPC-ColPali

(K=256,

p=60%)

60 75

HPC-ColPali

(K=512,

p=40%)

70 85

HPC-ColPali

(Binary,

K=512)

40 50

pression ratio, reducing the storage from 2.56 GB

for 100,000 documents (50 patches per document,

128-dim ﬂoat32 embeddings) to 0.08 GB. The op-

tional binary encoding further enhances compression,

reaching 57× for K=512, reducing storage to 0.045

GB. This reduction in storage is critical for deploying

large-scale retrieval systems, lowering infrastructure

costs and enabling in-memory indexing for faster ac-

cess.

Table 4 presents the average query latency. HPC-

ColPali (K=256, p=60%) achieves a 50% reduction in

query latency on ViDoRe (from 120ms to 60ms) and a

50% reduction on SEC-Filings (from 150ms to 75ms)

compared to ColPali Full. This speedup is attributed

to the combined effects of reduced data size (due to

quantization) and fewer patch comparisons (due to

dynamic pruning). The binary encoding mode further

accelerates query processing, achieving latencies of

40ms and 50ms respectively, demonstrating its suit-

ability for high-throughput, low-latency applications.

While DistilCol exhibits the lowest latency due to its

single-vector nature, its lower retrieval quality makes

it unsuitable for applications requiring ﬁne-grained

understanding.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

106

Table 5: RAG performance on legal summarization.

Retriever ROUGE-L Halluc.

(%)

Latency

(ms)

ColPali Full 0.45 15 300

HPC-ColPali

(K=256,

p=60%)

0.44 10 150

HPC-ColPali

(Binary,

K=512)

0.43 11 100

DistilCol 0.38 25 80

Table 6: Sensitivity analysis: nDCG@10 drop vs. compres-

sion on ViDoRe dataset.

K Compression

Ratio

nDCG@10

Drop (%)

128 32× 3.0

256 32× 1.5

512 28× 1.0

5.2 RAG Integration Performance

To demonstrate the practical utility of HPC-ColPali,

it was integrated into a RAG pipeline for legal sum-

marization and its impact on hallucination rate and

end-to-end latency evaluated. RAG experiments used

500 legal documents from ContractNLI (Koreeda and

Manning, 2021), with queries generated via GPT-4

for summarization. Hallucination rate was evaluated

automatically using factual consistency metrics (e.g.,

alignment with ground-truth via BERTScore (Zhang

et al., 2020)), supplemented by manual annotation on

100 samples for validation. Legal summarization rep-

resents knowledge-intensive tasks with high halluci-

nation risks (Katz et al., 2024), generalizing to do-

mains like ﬁnance where factual accuracy is critical.

Table 5 shows that HPC-ColPali improves RAG

performance. HPC-ColPali (K=256, p=60%) reduces

the hallucination rate by 33% (from 15% to 10%)

compared to ColPali Full, while halving the end-to-

end latency (from 300ms to 150ms). This indicates

that by providing more relevant and efﬁciently re-

trieved context, HPC-ColPali helps the LLM generate

more factually accurate summaries faster. The binary

mode offers even lower latency (100ms) with a slight

increase in hallucination rate (11%), representing a

viable trade-off for extreme latency-sensitive appli-

cations. DistilCol, due to its lower retrieval quality,

leads to a higher hallucination rate (25%) despite its

low latency, underscoring the importance of a high-

quality retriever in RAG systems.

The empirical results demonstrate that HPC-

ColPali achieves a compelling balance between re-

trieval effectiveness and system efﬁciency. No-

tably, the framework’s ability to compress multi-

vector representations 32× with less than 2% drop

in nDCG@10 underscores its suitability for latency-

sensitive applications without compromising retrieval

quality.

Our hierarchical compression design—combining

K-Means quantization (J

egou et al., 2011) and

attention-guided pruning (Goyal et al., 2020)—offers

ﬁne control over the efﬁciency-accuracy trade-off.

This modularity enables practitioners to adapt HPC-

ColPali based on speciﬁc deployment constraints,

such as mobile inference or large-scale RAG back-

ends (Lewis et al., 2020).

Interestingly, the superiority of HPC-ColPali over

single-vector models like DistilCol (Reimers and

Gurevych, 2019) highlights the limitations of collaps-

ing token-level semantics too early. While single-

vector models provide faster inference, their ex-

pressiveness in nuanced retrieval tasks remains in-

ferior (Lin et al., 2021), especially in zero-shot or

domain-speciﬁc settings. HPC-ColPali bridges this

gap by preserving token-level richness while main-

taining efﬁciency, suggesting that multi-vector rep-

resentations remain relevant in the era of LLM re-

trieval (Izacard and Grave, 2021).

The application of HPC-ColPali to a real-world le-

gal RAG system further conﬁrms its utility. Reduced

hallucination rates and improved responsiveness in

LLM-based summarization pipelines point to the po-

tential for deploying compression-aware retrieval in

safety-critical domains (Ji et al., 2023). This opens up

opportunities for integrating such strategies not only

in legal contexts but also in healthcare, ﬁnance, and

education, where factual integrity and speed are both

paramount.

Nevertheless, while our framework generalizes

well across two datasets, further evaluations on di-

verse tasks and languages are needed to fully assess

its robustness (Thakur et al., 2021). Additionally, the

current pruning strategy remains static once learned,

which may be suboptimal for queries of varying com-

plexity—prompting future exploration into adaptive

mechanisms (Zhan et al., 2021).

6 CONCLUSION

This work addresses the gap in efﬁcient multi-vector

retrieval for visually rich documents, where mod-

els like ColPali incur high costs. The proposed

HPC-ColPali framework applies quantization, prun-

ing, and binary encoding, achieving 32× compres-

sion and 50% latency reduction with ¡2% nDCG@10

loss. These results advance knowledge by enabling

scalable deployment in resource-constrained settings,

Hierarchical Patch Compression for ColPali: Efﬁcient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization

107

reducing RAG hallucinations by 30% in legal tasks.

Future research could explore adaptive policies and

hardware acceleration.

7 FUTURE WORK

Building upon the promising results of HPC-ColPali,

several avenues for future research can be explored:

• Product Quantization Extensions: While cur-

rent work utilizes K-Means quantization, ex-

ploring more advanced Product Quantization

(PQ) techniques, potentially with hierarchical

structures, could lead to even higher compres-

sion ratios with improved accuracy preservation.

This would involve investigating different sub-

quantizer conﬁgurations and their impact on re-

trieval performance.

• Adaptive Pruning Policies: Developing more

sophisticated and adaptive pruning policies could

further optimize the trade-off between efﬁciency

and accuracy. This might include machine

learning-based approaches to dynamically deter-

mine the optimal pruning ratio (p) based on query

complexity, document characteristics, or real-time

system load.

• Streaming Codebook Updates for Dynamic

Corpora: For continuously evolving docu-

ment collections, implementing mechanisms for

streaming updates to the K-Means codebooks is

crucial. This would ensure that the quantized rep-

resentations remain optimal as the data distribu-

tion changes, avoiding performance degradation

over time.

• Exploration of Other Compression Tech-

niques: Investigating alternative or complemen-

tary compression techniques, such as knowledge

distillation, sparse coding, or neural compres-

sion methods, could offer further improvements

in storage and computational efﬁciency.

• Application to Different Modalities or Do-

mains: Extending HPC-ColPali to other multi-

modal data types beyond visual documents (e.g.,

audio, video) or applying it to new domains with

unique retrieval challenges would demonstrate its

broader applicability and robustness.

• Hardware Acceleration: Exploring hardware-

speciﬁc optimizations, such as leveraging special-

ized AI accelerators or custom hardware designs,

could further boost the performance of HPC-

ColPali, particularly for the binary encoding and

Hamming distance computations.

REFERENCES

BAIDU (2024). Colqwen2.5 model. Hugging Face, 2024.

Accessed: May 15, 2024.

Chen, Q. et al. (2021). Spann: Highly-efﬁcient billion-scale

approximate nearest neighbor search. In NeurIPS.

Facebook Research (2024). Faiss wiki. GitHub. Accessed:

May 15, 2024.

Faysse, M. et al. (2024). Colpali: Efﬁcient document re-

trieval with vision language models. arXiv preprint

arXiv:2407.01449.

Gao, L. et al. (2023). Precise zero-shot dense retrieval with-

out relevance labels. In ACL.

Ge, T. et al. (2013). Optimized product quantization. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 36(4):744–755.

Gong, A. and Shi, T. (2020). Binary embeddings for faster

retrieval: An overview. ScienceDirect, 2020. Ac-

cessed: May 15, 2024.

Goyal, N. et al. (2020). Power-bert: Accelerating bert infer-

ence via progressive layer dropping. In arXiv preprint

arXiv:2002.07881.

Guo, R. et al. (2022). Accelerating large-scale inference

with anisotropic vector quantization. In ICML.

Han, S. et al. (2016). Deep compression: Compressing

deep neural networks with pruning, trained quantiza-

tion and huffman coding. In ICLR.

Izacard, G. et al. (2021). Distilling knowledge from reader

to retriever for question answering. In ICLR.

Izacard, G. and Grave, E. (2021). Leveraging passage re-

trieval with generative models for open domain ques-

tion answering. In ACL.

Ji, Z., Lee, N., et al. (2023). Survey of hallucination in nat-

ural language generation. ACM Computing Surveys.

arvelin, K. and Kek

ainen, J. (2002). Cumulated

gain-based evaluation of ir techniques. ACM Trans-

actions on Information Systems, 20(4):422–446.

egou, H., Douze, M., and Schmid, C. (2011). Product

quantization for nearest neighbor search. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

33(1):117–128.

Katz, D. et al. (2024). Gpt-4 passes the bar exam.

arXiv:2303.17012.

Khattab, O., Sanctuary, H., and Potts, C. (2021). Colbertv2:

Effective and efﬁcient retrieval via lightweight late in-

teraction. In Proceedings of the 2021 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), Online and Punta Cana, Dominican Repub-

lic.

Koreeda, Y. and Manning, D. (2021). Contractnli: A

dataset for document-level natural language inference

for contracts. In EMNLP.

Lewis, P. et al. (2020). Retrieval-augmented generation for

knowledge-intensive nlp tasks. In Advances in Neu-

ral Information Processing Systems, volume 33, pages

9459–9474.

Lin, J. et al. (2021). Batchneg: Efﬁcient and effective train-

ing of retrieval models. In SIGIR.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

108

Malkov, Y. and Yashunin, D. (2020). Efﬁcient and ro-

bust approximate nearest neighbor search using hier-

archical navigable small world graphs. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

42(4):824–836.

Manning, C., Raghavan, P., and Sch

utze, H. (2008). Intro-

duction to Information Retrieval. Cambridge Univer-

sity Press.

Muennighoff, N. et al. (2024). Generative representational

instruction tuning. arXiv:2402.09906.

Nguyen, T. et al. (2016). Ms marco: A human gener-

ated machine reading comprehension dataset. In arXiv

preprint arXiv:1611.09268.

Norouzi, M. et al. (2014). Fast exact search in ham-

ming space with multi-index hashing. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

36(6):1107–1119.

Rao, Y. et al. (2021). Dynamicvit: Efﬁcient vision

transformers with dynamic token sparsiﬁcation. In

NeurIPS.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. In

EMNLP.

Sanathanam, K. et al. (2021). Colbertv2: Effective and

efﬁcient retrieval via lightweight late interaction. In

NAACL.

Tang, L. et al. (2023). Dynamicvit: Dynamic vision trans-

former without tuning and training. arXiv preprint

arXiv:2304.01186.

Thakur, N., Reimers, N., et al. (2021). Beir: A heterogenous

benchmark for zero-shot evaluation of information re-

trieval models. In arXiv preprint arXiv:2104.08663.

Zhan, J., Mao, J., et al. (2021). Optimizing dense retrieval

model training with hard negatives. In SIGIR.

Zhang, T. et al. (2020). Bertscore: Evaluating text genera-

tion with bert. In ICLR.

Zilliz (2023). Introducing colpali: The next-gen multimodal

document retrieval model. Zilliz Blog, 2023. Ac-

cessed: May 15, 2024.

Hierarchical Patch Compression for ColPali: Efﬁcient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization

109