Simple Base Frame Guided Residual Network for RAW Burst Image

Super-Resolution

Anderson Nogueira Cotrim

1 a

, Gerson Barbosa

2,3 b

, Cid Adinam Nogueira Santos

2 c

and Helio Pedrini

1 d

Institute of Computing, University of Campinas, Campinas, SP, 13083-852, Brazil

Eldorado Research Institute, Campinas, SP, 13083-898, Brazil

ao Paulo State University, Guaratinguet

a, SP, 12516-410, Brazil

Keywords:

Super-Resolution, Deep Learning, RAW Image, Multi-Frame, Burst.

Abstract:

Burst super-resolution or multi-frame super-resolution (MFSR) has gained signiﬁcant attention in recent years,

particularly in the context of mobile photography. With modern handheld devices consistently increasing their

processing power and the ability to capture multiple images even faster, the development of robust MFSR

algorithms has become increasingly feasible. Furthermore, in contrast to extensively studied single-image

super-resolution (SISR), burst super-resolution mitigates the ill-posed nature of reconstructing high-resolution

images from low-resolution ones by merging information from multiple shifted frames. This research intro-

duces a novel and effective deep learning approach, SBFBurst, designed to tackle this challenging problem.

Our network takes multiple noisy RAW images as input and generates a denoised, super-resolved RGB image

as output. We demonstrate that signiﬁcant enhancements can be achieved in this problem by incorporating

base frame-guided mechanisms through operations such as feature map concatenation and skip connections.

Additionally, we highlight the signiﬁcance of employing mosaicked convolution to enhance alignment, thus

enhancing the overall network performance in super-resolution tasks. These relatively simple improvements

underscore the competitiveness of our proposed method when compared to other state-of-the-art approaches.

1 INTRODUCTION

In recent times, smartphones have become the pre-

ferred choice for capturing image devices, surpassing

the popularity of traditional digital cameras (Lafen-

etre et al., 2023). While smartphone cameras have

made signiﬁcant strides in enhancing image quality

and resolution, they still face limitations when com-

pared to professional DSLR cameras. A primary

limitation concerns the small sensor size, which de-

mands a ﬁne balance between noise levels and resolu-

tion. While increasing resolution using smaller pixel

sensors is possible, it comes at the cost of compro-

mising image quality, particularly in low-light condi-

tions, where noise becomes more prominent. Further-

more, optical zoom capabilities in smartphone cam-

eras remain signiﬁcantly inferior to those in profes-

https://orcid.org/0009-0006-8115-589X

https://orcid.org/0000-0002-1147-2519

https://orcid.org/0000-0002-9278-5356

https://orcid.org/0000-0003-0125-630X

sional cameras, primarily due to physical constraints.

Large lenses, which are essential for high-quality op-

tical zoom, would be impractical in portable devices.

Given these constraints inherent to portable cam-

eras, which are progressively more demanding to

address only through sensor enhancements, coupled

with the ongoing evolution of artiﬁcial intelligence

and onboard processing capabilities, researchers have

dedicated their efforts to developing deep learning al-

gorithms aimed at improving image quality. These

techniques encompass various domains, including

super-resolution, noise reduction, and high dynamic

range (HDR).

Super-resolution (SR) is an extremely challeng-

ing and relevant task that consists of the generation

of a high-resolution (HR) image from one or several

low-resolution (LR) observations. This advancement

could potentially reduce even more the gap in image

quality between small devices and professional cam-

eras, enabling improved zoom capabilities without re-

quiring larger lenses.

In recent years, the SR community has primar-

Cotrim, A., Barbosa, G., Santos, C. and Pedrini, H.

Simple Base Frame Guided Residual Network for RAW Burst Image Super-Resolution.

DOI: 10.5220/0012348300003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

77-87

ISBN: 978-989-758-679-8; ISSN: 2184-4321

ily focused on the single-image super-resolution task

(SISR), where an HR image is estimated from a single

LR input. This is a very ill-posed problem, and meth-

ods strive to hallucinate high-frequency information,

which is limited from previously learned image infor-

mation. Conversely, the multi-frame super-resolution

(MFSR) approach has recently gained signiﬁcant in-

terest (Figure 1). In MFSR, the objective is to recon-

struct the original HR image using multiple LR im-

ages. These images inherently feature spatial shifts

induced by natural hand movements, and recent re-

search (Wronski et al., 2019) has convincingly shown

that these shifts can produce multiple aliased repre-

sentations of the underlying scene. This phenomenon

allows for the aggregation of subpixel information

from several images of the same scene, thereby mit-

igating the ill-posed nature of single-image super-

resolution.

Figure 1: Evolution of the number of publications biyearly

in major conferences or journals since 2003 dedicated to the

topic of multi-frame or burst super-resolution, according to

Web of Science.

This work develops a simple but effective deep-

learning method, denoted as Simple Base Frame

Burst (SBFBurst), to address the burst image super-

resolution problem. This is particularly signiﬁcant

given the context previously outlined. Additionally,

our emphasis is placed on RAW images, as they con-

tain more information than processed ISP (Image Sig-

nal Processing) RGB images, potentially leading to

improved accuracy in high-resolution image recon-

struction.

The main contributions of this work are as fol-

lows:

• Proposition of a simple but effective deep convo-

lutional architecture that beneﬁts from base frame

guidance and deformable convolution (Dai et al.,

2017) for solving burst image super-resolution.

• Experimental evaluation of the effectiveness of in-

corporating the base frame guidance into the con-

volutional network ﬂow design, whether through

concatenation or skip connections.

• Use of SpyNet (Ranjan and Black, 2017) on

a mosaicked convolutional feature map (Cilia

et al., 2023) to obtain optical ﬂows between

frames, which can guide the deformable convolu-

tions (Dai et al., 2017; Chan et al., 2022) to obtain

features with better alignment.

• Development of mixed gradient loss (Lu and

Chen, 2019) in order to guide our network to a

better edge reconstruction.

• Contrary to some methods, the proposed method

can deal with an arbitrary number of frames in

an invariant permutation way without losing ac-

curacy.

• Our experiments, conducted on both synthetic and

real-world datasets, indicate that our approach not

only surpasses state-of-the-art methods in both

quantitative and qualitative measures but also ex-

hibits efﬁcient inference capabilities (Figure 2).

Figure 2: Comparison of performance and runtime on Syn-

theticBurst dataset (Bhat et al., 2021a). Our method out-

performs others, even when utilizing a smaller number of

frames.

The text is structured as follows. In Section 2,

we provide a concise overview of relevant works in

the ﬁeld of super-resolution literature, with a particu-

lar focus on multi-frame super-resolution. Following

that, in Section 3, we introduce our proposed method,

providing an in-depth explanation of the architectural

design and loss function. Moving forward to Sec-

tion 4, we describe the adopted experimental setup,

present the results we have obtained, and engage in

a thorough discussion comparing our ﬁndings with

other relevant approaches from the literature. Addi-

tionally, we include a brief ablation study to under-

score the effectiveness of our architectural choices.

Finally, in Section 5, we conclude our discussion with

some ﬁnal remarks and outline some directions for fu-

ture work.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

2 RELATED WORK

Single Image Super-Resolution (SISR). SISR has

been a longstanding ﬁeld of research in computer vi-

sion that focuses on reconstructing a high-resolution

image from its degraded low-resolution version. Sig-

niﬁcant strides have been made in this ﬁeld thanks to

deep convolutional learning-based approaches. Since

the ﬁrst work, SRCNN (Dong et al., 2015), numer-

ous other learning-based methods have emerged in an

effort to address this problem.

Some studies have focused on reﬁning archi-

tectural designs (Kim et al., 2016a; Shi et al.,

2016a; Zhang et al., 2018c; Haris et al., 2018; Lim

et al., 2017), including recursive learning (Kim et al.,

2016b; Ahn et al., 2018), progressive reconstruc-

tion (Wang et al., 2015; Lai et al., 2017), atten-

tion mechanisms (Zhang et al., 2018b; Dai et al.,

2019; Zamir et al., 2020), generative adversarial net-

works (Ledig et al., 2017; Wang et al., 2019), vision

transformers (Lu et al., 2022) and different loss func-

tions (Johnson et al., 2016; Lugmayr et al., 2020).

However, even with all these improvements, it is still

really hard to recover rich details for real-world im-

ages because of the extremely ill-posed nature of this

problem.

Multi-Frame Super-Resolution (MFSR). In order

to address the challenging nature of ill-posed prob-

lems in SISR, the concept of MFSR has been intro-

duced. MFSR involves the fusion of pixel data from

multiple images of the same scene, each with some

spatial displacement, thereby providing supplemen-

tary sub-pixel information to enhance the quality of

image reconstruction (Tsai and Huang, 1984; Hardie,

2008).

The journey of MFSR began with (Tsai and

Huang, 1984), who introduced a pioneering fre-

quency domain approach, assuming that the transla-

tions between input images are known. HighRes-

Net (Deudon et al., 2020), on the other hand, was

explicitly tailored for satellite imagery. It implicitly

aligns each frame with a reference frame and employs

recursive fusion techniques to enhance image quality.

DBSR (Bhat et al., 2021a) took a step further

by introducing a weighted-based fusion mechanism,

which predicted element-wise weights between the

base frame and the other frames, allowing for more

precise fusion. They have also introduced a real-

world dataset named BurstSR, which has played a

crucial role in motivating and encouraging new re-

search endeavors in this ﬁeld. The evolution con-

tinued with MFIR (Bhat et al., 2021b), which ex-

tended this fusion mechanism by diving into deep

feature space to handle both SR and denoising.

LKR (Lecouat et al., 2021) advances proposing an

end-to-end approach for joint image alignment and

super-resolution from raw burst inputs.

Recent developments in the ﬁeld, such as BIP-

Net (Dudhane et al., 2022), have recognized the im-

portance of comprehensive fusion. BIPNet introduces

a pseudo-burst fusion strategy by fusing temporal fea-

tures channel-by-channel, resulting in a more robust

approach. However, it is worth noting that BIPNet re-

quires ﬁxing the input frame number, which can be a

limitation in speciﬁc scenarios.

Another work (Cilia et al., 2023) introduces a

novel convolutional block, namely mosaicked convo-

lution feature extractor (MCFE), in order to improve

feature extraction directly from raw mosaicked sen-

sor data. The key concept behind MCFE is to extract

high-level features while preserving the Bayer color

arrangement. Through a series of experiments, the

authors demonstrated the effectiveness of this block

and presented competitive results.

In addition to fusion strategies, some recent

works (Dudhane et al., 2023; Mehta et al., 2023) have

explored the use of inter-frame attention-based mech-

anisms to enhance feature interaction. These mech-

anisms enable cross-attention on the channel or spa-

tial dimension, leading to better information exchange

among frames. While these approaches hold great

promise, it is essential to acknowledge that they can

be computationally intensive.

3 PROPOSED METHOD

In this section, we present an in-depth explanation

of our burst super-resolution network, SBFBurst. It

takes multiple low-resolution RAW images captured

quickly in a burst as input and performs denoising,

demosaicking, and super-resolution simultaneously,

resulting in a high-quality RGB output. Since burst

images might have slight misalignments due to quick

capture, they provide extra information for super-

resolution. By effectively combining all the burst

data, our network can better reconstruct the scene,

resulting in a higher-quality output than single-frame

methods.

Figure 3 contains an overview of the proposed ar-

chitecture. Our network takes a series of RAW im-

ages captured in a burst, denoted as {b

}

i=1

, where

N can be of any size. Each image, represented as

∈ R

W ×H

, contains the RAW sensor data from the

camera.

Before proceeding further, we independently ex-

tract feature representations from each unshufﬂed

burst

, resulting in {p

}

i=1

Simple Base Frame Guided Residual Network for RAW Burst Image Super-Resolution

RGGB (W x H x 1)

...

RGGB (W/2 x H/2 x 3)

Pixel

Unshuﬄe

Alignment

Preencoder

Encoder

Fusion

Decoder

RGGB (W x H x 1)

RGB (sW x sH x 3)

Preencoder

ConvBlock

ResBlock

...

Encoder

ConvBlock

ResBlock

...

x40

128

...

Weight

Predictor

Attention-based

Fusion

MCFE

Conv1x1

161

SpyNet

AvgPool2x2

0.5

Element wise

Multiplication

Deformable

Alignment

Wrap

Alignment

ConvBlock

ResBlock

...

6464

128

UpSample

ConvBlock

UpSample

64 64

Decoder

Figure 3: The overall architecture of the proposed SBFBurst for raw image burst super-resolution, which follows a compre-

hensive pipeline to transform a RAW burst of degraded images into a clean, high-quality RGB image in ﬁve main components:

the preencoder, alignment module, encoder, fusion, and decoder. The preencoder relies on a series of residual blocks for fea-

ture extraction from each burst. Subsequently, deformable convolutional alignment is employed to align these bursts, aided by

SpyNet (Ranjan and Black, 2017), and MCFE (Cilia et al., 2023). Following alignment, each burst is concatenated with the

reference frame, generating a deeper embedding by the encoder. All embeddings are then subjected to a fusion process (Bhat

et al., 2021a) that reduces to a single one. Finally, the decoder block plays a crucial role by upsampling the fused embedding,

which is concatenated with the base frame feature. Additionally, it maintains a skip connection with an upsampled version of

the base frame.

Subsequently, we perform alignment and warp-

ing of each feature map p

to align them with the

reference frame, denoted as b

. This alignment is

achieved using offsets estimated through mosaicked

convolution (Cilia et al., 2023), SpyNet (Ranjan

and Black, 2017), and deformable convolutions (Dai

et al., 2017).

Additionally, we employ a secondary deeper en-

coder to create deep feature representations, guided

by the base frame p

, leading to {e

}

i=1

In the fusion process, we utilize an attention-based

module built upon the approach presented in (Bhat

et al., 2021a). This module predicts fusion weights

at the element level, enabling the network to dynam-

ically select the most relevant information from each

image within the burst. This process produces a uni-

ﬁed feature map denoted as ˆe.

Finally, the merged feature map ˆe, along with the

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

base frame representation p

, is fed into the decoder

module. This module gradually upsamples the input

feature map, enhanced by a skip connection derived

from the pixel shufﬂe operation applied to the base

frame

. This process generates the ultimate RGB

image denoted as y ∈ R

sW ×sH×3

, where s is the super-

resolution factor.

The following subsections detail each module of

our proposed method and the loss function.

3.1 Preencoder

We begin by arranging the raw Bayer pattern into

2×2 blocks along the channel dimension, yielding

a 4-channel image at half the initial resolution (

∈

×4

). This lower-resolution image (

) is subse-

quently passed through the encoder, which consists of

an initial convolutional layer followed by a sequence

of residual blocks, yielding 64 feature maps for each

burst. This process could be represented in Equa-

tion 1.

= Preencoder(

) (1)

3.2 Alignment

The absence of precise information regarding pixel-

level displacements among the images is one of the

main problems faced in burst super-resolution. Sev-

eral elements, such as scene changes and overall cam-

era motions, could have been responsible for these

displacements. It becomes essential to align the infor-

mation to effectively fusion multiple frames. Our ap-

proach tackles this challenge using the SpyNet (Ran-

jan and Black, 2017) for optical ﬂow estimations and

deformable convolutions (Dai et al., 2017) for pixel

alignment reﬁnements.

We decided to build an alignment module that re-

lies on b

instead of

. This could be preferable be-

cause b

keeps spatial information, which may be im-

portant for a better optical ﬂow estimation. Firstly, we

employ two mosaicked convolutional layers on each

ﬂattened RAW image burst (b

) to avoid disrupting the

Bayer color arrangement, generating 16 maps. These

maps are then downsized to 3 by 1x1 convolutional

layers to obtain a suitable input for the pre-trained

SpyNet designed for RGB images.

SpyNet is able to calculate calculating dense

pixel-wise optical ﬂow vectors denoted as

∈

W ×H×2

between each burst image b

. However,

as our network relies on wrapping the feature maps

∈ R

×2

∈ R

W ×H×2

should be reduced by

half. To achieve the desired optical ﬂow output, we

apply average pooling followed by element-wise mul-

tiplication with a factor of 0.5 on

, as the displace-

ments were reduced by half on the unshufﬂe operation

of burst b

. This results in the ﬁnal pixel-wise ﬂow

information f

∈ R

×2

, which not only captures

global camera motion but also accommodates any ob-

ject motion within the scene, concerning to the base

image

The estimated ﬂow vectors, f

, are subsequently

employed to warp the preencoded feature maps p

align with the base frame. These aligned maps are

further reﬁned using a deformable convolutional net-

work (DCN) inspired by BasicVSR++ (Chan et al.,

2022), which outputs the ﬁnal aligned maps denoted

by ˆp

This entire process could expressed in Equa-

tions 2 and 3.

= FlowEst(b

, b

) (2)

ˆp

= DCN(Wrap(p

, f

), f

, p

) (3)

3.3 Encoder

The encoder process takes as input the aligned fea-

ture map ( ˆp

) concatenated with reference feature

). These concatenated maps subsequently passed

the initial convolutional layer followed by a sequence

of residual blocks, yielding 64 feature maps for each

burst, expressed in Equation 4.

= Encoder( ˆp

, p

) (4)

3.4 Fusion

The fusion module’s purpose is to combine informa-

tion from individual burst images ({e

}

i=1

) to create a

uniﬁed feature representation called ˆe. In this study,

we adopt the fusion module introduced by (Bhat et al.,

2021a), an attention-based method that adaptively

merges information based on factors such as image

content and noise levels.

The fusion process relies on aligned embeddings

and ﬂow vectors f

to estimate attention weights for

each embedding. It is worth noting that, unlike the

approach outlined in (Bhat et al., 2021a), we chose to

employ the embeddings e

in their raw form without

projecting them into a lower dimension, as initially

suggested by the authors.

Subsequently, this merged feature map ( ˆe) serves

as input for a decoder module responsible for produc-

ing the ﬁnal output. The dependences of ˆe follows

Equation 5,

ˆe = Fusion({e

, f

}

i=1

) (5)

Simple Base Frame Guided Residual Network for RAW Burst Image Super-Resolution

3.5 Decoder

The decoder module is responsible for generating the

high-resolution RGB output image from the fused

feature map ˆe. Initially, we concatenate the base fea-

ture map p

with fusion embedding ˆe, then pass it

through a residual network.

To achieve the desired resolution of sH × sW , we

employ sub-pixel convolution (Shi et al., 2016b) for

upsampling. This operation is also applied to the

base frame image (

), and the data is merged using

element-wise summation at each appropriate resolu-

tion. Lastly, we apply a ﬁnal convolution operation

to obtain the ultimate RGB image output (

Y ). This

whole process could be expressed in Equation 6.

Y = Decoder(ˆe, p

) (6)

3.6 Loss Function

In our training process for both real and synthetic

track models, we employ mixed gradient loss (Lu and

Chen, 2019) with l

to assess the model’s prediction

errors, deﬁned in Equation 7.

MixGE(Y,

Y ) = l

(Y,

Y ) + λ(l

(G(Y ), G(

Y ))) (7)

where l

stands for the mean absolute error function

and G represents the gradient function obtained from

the Sobel operator. Additionally, λ is a weighting fac-

tor applied to the gradient differences, effectively pe-

nalizing high-frequency reconstructions such as edges

aligned with the objective of the super-resolution task.

4 EXPERIMENTS

This section presents the experimental setting, results,

discussion, and an ablation study achieved through

the proposed burst image super-resolution method.

4.1 Experimental Settings

In this subsection, we describe the datasets used in the

experiments, training settings and evaluation metrics.

4.1.1 Datasets

As in previous works (Bhat et al., 2021a; Bhat et al.,

2021b; Dudhane et al., 2022), our approach is sub-

jected to comprehensive evaluation using both syn-

thetic and real-world datasets, as provided by the au-

thors of (Bhat et al., 2021a). The synthetic dataset

comprises 46,839 RGB images sourced from the

Zurich RAW to RGB Dataset (Ignatov et al., 2020)

by Canon 5D Mark IV DSLR Images. This dataset

serves for generating sets of low-quality RAW burst

images through random translations, rotations, and

the introduction of additional noise in the RGB-to-

RAW inverse camera pipeline (Brooks et al., 2018).

This process yields synthetic low-resolution and noisy

burst images.

The real-world dataset, known as BurstSR, in-

cludes 5,405 RAW burst patches captured in real-

world conditions using a Samsung Galaxy S8 smart-

phone, each with dimensions of 160×160 pixels. The

corresponding high-resolution images are obtained

from a Canon DSLR camera.

For our evaluation, we used a validation set con-

sisting of 300 synthetically generated images (sized

at 96×96 pixels) and 882 real-world patches (sized at

160×160 pixels).

4.1.2 Training Settings

In our training and testing procedures, we adhere

to established conventions. Initially, our model un-

dergoes training using the synthetic dataset and sub-

sequently undergoes ﬁne-tuning with the real-world

dataset.

Throughout all experiments, we maintain a ﬁxed

scale factor s set to 4. During both synthetic and

real training phases, we employ the MixGE loss func-

tion with parameter λ set to 0.01, as detailed in Sec-

tion 3.6, to optimize the entire model. When train-

ing with real-world data, as the ground truth images

are not initially aligned with the input data, we in-

corporate an aligned MixGE loss that accounts for

both spatial and color disparities between the low-

resolution (LR) input bursts and high-resolution (HR)

ground truth. This alignment process involves ini-

tially matching the ground truth image with the super-

resolved image using a pre-trained PWC-Net (Sun

et al., 2018). This loss function enables the model to

learn the generation of HR images that closely align

with the ground truth, considering both spatial and

color information.

For both datasets training, we employ the AdamW

optimizer with exponential decay rates of 0.9 and

0.999. During synthetic training, we conduct 500

epochs with a batch size of 16, starting with an ini-

tial learning rate of 2 × 10

−4

and halving it at epochs

80, 120, 280, 350, 410, and 460. To facilitate bet-

ter convergence, we freeze the weights of the pre-

trained SpyNet for the ﬁrst 130 epochs. In each

training batch, HR images are cropped to dimensions

of 384×384 pixels, and 14 burst LR image patches

(96×96) are randomly synthesized based on the HR

image.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

For real-world training, we ﬁne-tune the model,

which was originally pre-trained on synthetic data, for

an additional 80 epochs. This ﬁne-tuning process em-

ploys a batch size of 8, beginning with an initial learn-

ing rate of 5 × 10

−5

and then reducing it by half every

15 epochs.

Our implementation of the proposed method is re-

alized using the PyTorch framework and takes advan-

tage of an NVIDIA Tesla A100 GPU, with comple-

tion times of approximately 3 days for synthetic and

1 day for real-world scenarios.

4.1.3 Evaluation Metrics

We adhere to the established evaluation protocols and

datasets utilized in prior studies (Bhat et al., 2021a;

Bhat et al., 2021b; Dudhane et al., 2022) to assess our

approach.

Our evaluation metrics encompass the peak

signal-to-noise ratio (PSNR), structural similarity in-

dex measure (SSIM) (Wang et al., 2004), and learned

perceptual image patch similarity (LPIPS) (Zhang

et al., 2018a). While these metrics can be directly

employed for synthetic datasets, we employ aligned

PSNR, SSIM and LPIPS for assessing our model on

real-world data, as has been the practice in previous

works (Bhat et al., 2021a; Bhat et al., 2021b; Dud-

hane et al., 2022).

4.2 Results and Discussion

We compare SBFBurst with state-of-the-art burst

super-resolution approaches including HighRes-

Net (Deudon et al., 2020), DBSR (Bhat et al., 2021a),

LKR (Lecouat et al., 2021), MFIR (Bhat et al.,

2021b) and BipNet (Dudhane et al., 2022).

Table 1 presents the quantitative results for both

testing datasets. When comparing SBFBurst to Bip-

Net, our approach surpasses 0.26dB and 0.38dB in

terms of PSNR on synthetic and real data, respec-

tively, without substantially increasing the number of

parameters. It is noteworthy that these improvements

are nearly equal on both synthetic and real datasets,

indicating the potential for our approach to generalize

effectively when ﬁne-tuned on diverse datasets.

When dealing with a variable number of in-

put frames, our method also exhibits superior per-

formance compared to the previous ones, as visu-

ally demonstrated in Figure 4. Notably, our SBF-

Burst achieves comparable results using only 4 in-

put frames, matching the performance of DBSR (Bhat

et al., 2021a), which requires 14 frames. More-

over, even with just 10 frames as input, our SBFBurst

achieves performance on par with MFIR (Bhat et al.,

2021b).

Figure 4: Comparison in terms of PSNR with other ap-

proaches when facing variable input frame numbers on Syn-

thetic Dataset (Bhat et al., 2021a). It is worth noting that

BipNet encounters a limitation in handling input of variable

sizes during prediction.

For a visual representation of our results using

both synthetic and real-world data, please refer to Fig-

ures 5 and 6. The visual representations in Figures 5

and 6 illustrate that our SBFBurst method excels in

recovering highly detailed textures with better color

while exhibiting fewer artifacts compared to alterna-

tive approaches.

For instance, in the 1st row of Figure 5, our

method excels in producing a clean and denoised sig-

naling cone, preserving all the crucial details. Simi-

larly, in the 4th row of Figure 5, our method shows

superior edge reconstruction for Venetian blinds, un-

derscoring its capability to effectively handle high-

frequency information. In contrast, all other ap-

proaches failed to handle the noisy details.

Furthermore, our method demonstrates its proﬁ-

ciency in restoring additional information from real-

world burst images, as we can see in Figure 6. In the

third row, SBFBurst practically eliminates ghosting

artifacts on letters, a challenge often unaddressed by

other methods. Moreover, there is an enhancement in

color continuity, as evidenced by the letter “r” when

compared to the MFIR method.

4.3 Ablation Study

In our ablation study, we rigorously evaluate the ef-

fectiveness of our alignment module and the guid-

ance provided by the network’s base frame. Table 2

presents a comprehensive summary of our results as

we progressively incorporate the proposed mecha-

nisms, indicated by checkmark symbols (✓).

Our baseline experiments begin with the architec-

ture excluding base frame guidance, denoted as p

the encoder, p

and

on the decoder, and the align-

Simple Base Frame Guided Residual Network for RAW Burst Image Super-Resolution

Table 1: Comparison between SBFBurst and the other approaches. The best one marks in red and the second best are in blue.

All results are reported for a 4× super-resolution task.

Methods Authors #Parameters

Synthetic Real-World

PSNR (dB) ↑ SSIM ↑ LPIPS ↓ PSNR (dB) ↑ SSIM ↑ LPIPS ↓

HighResNet (Deudon et al., 2020) 34.78M 37.45 0.924 0.106 46.64 0.980 0.038

DBSR (Bhat et al., 2021a) 13.01M 39.17 0.946 0.081 47.70 0.984 0.029

LKR (Lecouat et al., 2021) - 41.45 0.950 - - - -

MFIR (Bhat et al., 2021b) 12.13M 41.55 0.964 0.045 48.32 0.985 0.023

BipNet (Dudhane et al., 2022) 6.66M 41.93 0.967 0.035 48.49 0.985 0.026

SBFBurst 7.64M 42.19 0.968 0.036 48.87 0.987 0.022

HR Image Base frame DBSR MFIR BIPNet SBFBurst Ground Truth

Figure 5: Qualitative results of a comparison between our and other approaches in Synthetic Dataset (Bhat et al., 2021a).

ment module applied directly to the unshufﬂed

image, following prior work (Bhat et al., 2021a).

The introduction of the Base Frame Decoder en-

tails the inclusion of p

and

in the decoder, while

the Base Frame Encoder involves the addition of p

to the decoder. Lastly, the MCFE Alignment rep-

resents the incorporation of mosaicked convolutional

features on a ﬂattened image, which serves as a strat-

egy to preserve spatial information during the align-

ment process.

As illustrated in Table 2, it becomes evident that

all the relatively straightforward enhancements con-

tribute incrementally to our baseline method. These

improvements play a crucial role in driving our base-

line method to surpass all the state-of-the-art ap-

proaches, as evidenced in Table 1.

Table 2: Ablation experiments to assess the impact of SBF-

Burst’s contributions, we evaluate PSNR performance on

the SyntheticBurst dataset for a 4× super-resolution task.

Our ﬁndings indicate that simple design decisions, such as

base frame guidance and preserving spatial information on

raw alignment can lead to signiﬁcant improvements.

Improvements AS1 AS2 AS3 AS Final

Baseline ✓ ✓ ✓ ✓

Base Frame Decoder ✓ ✓ ✓

Base Frame Encoder ✓ ✓

MCFE Alignment ✓

PSNR (dB) 41.34 41.58 42.12 42.19

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

HR Image Base frame DBSR MFIR BIPNet SBFBurst Ground Truth

Figure 6: Qualitative results of a comparison between our and other approaches in Real World Dataset BurstSR (Bhat et al.,

2021a).

5 CONCLUSIONS

This research presents a signiﬁcant advancement in

addressing the burst image super-resolution problem,

presenting a straightforward yet highly effective deep-

learning method. By focusing on RAW images and

leveraging base frame guidance, deformable convolu-

tion, SpyNet, and mixed gradient loss, this work has

made several key contributions to the ﬁeld. Impor-

tantly, the proposed method demonstrates remarkable

performance versatility by accommodating an arbi-

trary number of frames without sacriﬁcing accuracy.

Through comprehensive experiments on synthetic

and real-world datasets, this approach not only out-

performs existing state-of-the-art methods in both

quantitative and qualitative assessments but also

demonstrates its efﬁciency in terms of inference

speed. Therefore, this research opens up new pos-

sibilities for improving the quality of high-resolution

image reconstruction, with broad implications for var-

ious applications in image processing, especially on

portable cameras.

While the contributions of this work are indeed

promising, there remain several directions that could

be explored to further enhance burst image super-

resolution techniques. Firstly, addressing the chal-

lenge of scenes featuring fast-moving objects, which

represents a signiﬁcant hurdle in this ﬁeld. More-

over, it would be desirable to explore ways to reduce

computational complexity to make it more suitable

for real-time applications on resource-constrained de-

vices. Furthermore, developing new evaluation meth-

ods tailored to real-world scenarios is crucial for a

better assessment of the methods, as aligned versions

of traditional metrics such as PSNR, SSIM, or LPIPS

might be biased for methods that employ aligned loss

on training. Overall, these future efforts aim to push

the boundaries of burst image super-resolution and

unlock even greater potential for its application in var-

ious domains.

ACKNOWLEDGEMENTS

The authors would like to thank Eldorado Re-

search Institute, Brazilian National Council for

Scientiﬁc and Technological Development (CNPq

#304836/2022-2) for their support.

REFERENCES

Ahn, N., Kang, B., and Sohn, K.-A. (2018). Fast, Ac-

curate, and Lightweight Super-Resolution with Cas-

Simple Base Frame Guided Residual Network for RAW Burst Image Super-Resolution

cading Residual Network. In 15th European Con-

ference on Computer Vision, page 256–272, Munich,

Germany. Springer-Verlag.

Bhat, G., Danelljan, M., Van Gool, L., and Timofte, R.

(2021a). Deep Burst Super-Resolution. In IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 9209–9218.

Bhat, G., Danelljan, M., Yu, F., Gool, L. V., and Timofte,

R. (2021b). Deep Reparametrization of Multi-Frame

Super-Resolution and Denoising. In IEEE/CVF In-

ternational Conference on Computer Vision, pages

2440–2450, Los Alamitos, CA, USA. IEEE Computer

Society.

Brooks, T., Mildenhall, B., Xue, T., Chen, J., Sharlet,

D., and Barron, J. T. (2018). Unprocessing Images

for Learned Raw Denoising. IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

11028–11037.

Chan, K. K., Zhou, S., Xu, X., and Loy, C. (2022). Ba-

sicVSR++: Improving Video Super-Resolution with

Enhanced Propagation and Alignment. In IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, volume 1, pages 5962–5971, Los Alamitos, CA,

USA. IEEE Computer Society.

Cilia, M., Valsesia, D., Fracastoro, G., and Magli, E. (2023).

Multi-Level Fusion for Burst Super-Resolution with

Deep Permutation-Invariant Conditioning. In IEEE

International Conference on Acoustics, Speech and

Signal Processing, volume 1, pages 1–5.

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and

Wei, Y. (2017). SimDeformable Convolutional Net-

works. In IEEE International Conference on Com-

puter Vision.

Dai, T., Cai, J., Zhang, Y., Xia, S.-T., and Zhang, L. (2019).

Second-Order Attention Network for Single Image

Super-Resolution. In IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, volume 1, pages

11057–11066.

Deudon, M., Kalaitzis, A., Goytom, I., Areﬁn, M. R.,

Lin, Z., Sankaran, K., Michalski, V., Kahou, S. E.,

Cornebise, J., and Bengio, Y. (2020). HighRes-Net:

Recursive Fusion for Multi-Frame Super-Resolution

of Satellite Imagery.

Dong, C., Loy, C. C., He, K., and Tang, X. (2015). Im-

age Super-Resolution Using Deep Convolutional Net-

works.

Dudhane, A., Zamir, S., Khan, S., Khan, F., and Yang, M.

(2022). Burst Image Restoration and Enhancement. In

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, volume 1, pages 5749–5758, Los

Alamitos, CA, USA. IEEE Computer Society.

Dudhane, A., Zamir, S., Khan, S., Khan, F., and Yang, M.

(2023). Burstormer: Burst Image Restoration and En-

hancement Transformer. In IEEE/CVF Conference on

Computer Vision and Pattern Recognition, volume 1,

pages 5703–5712, Los Alamitos, CA, USA. IEEE

Computer Society.

Hardie, R. (2008). A Fast Image Super-Resolution Algo-

rithm Using an Adaptive Wiener Filter. IEEE Trans-

actions on Image Processing, 16:2953–64.

Haris, M., Shakhnarovich, G., and Ukita, N. (2018). Deep

Back-Projection Networks for Super-Resolution. In

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, volume 1, pages 1664–1673, Los

Alamitos, CA, USA. IEEE Computer Society.

Ignatov, A., Gool, L. V., and Timofte, R. (2020). Replac-

ing Mobile Camera ISP with a Single Deep Learning

Model. In IEEE/CVF Conference on Computer Vision

and Pattern Recognition Workshops, volume 1, pages

2275–2285, Los Alamitos, CA, USA. IEEE Computer

Society.

Johnson, J., Alahi, A., and Fei-Fei, L. (2016). Percep-

tual Losses for Real-Time Style Transfer and Super-

Resolution. In Leibe, B., Matas, J., Sebe, N., and

Welling, M., editors, European Conference on Com-

puter Vision, pages 694–711, Cham. Springer Interna-

tional Publishing.

Kim, J., Lee, J. K., and Lee, K. M. (2016a). Accurate Im-

age Super-Resolution Using Very Deep Convolutional

Networks. In IEEE Conference on Computer Vision

and Pattern Recognition, volume 1, pages 1646–1654.

Kim, J., Lee, J. K., and Lee, K. M. (2016b). Deeply-

Recursive Convolutional Network for Image Super-

Resolution. In IEEE Conference on Computer Vision

and Pattern Recognition, volume 1, pages 1637–1645.

Lafenetre, J., Facciolo, G., and Eboli, T. (2023). Imple-

menting Handheld Burst Super-Resolution. Image

Processing On Line, 13:227–257.

Lai, W.-S., Huang, J.-B., Ahuja, N., and Yang, M.-H.

(2017). Deep Laplacian Pyramid Networks for Fast

and Accurate Super-Resolution. In IEEE Conference

on Computer Vision and Pattern Recognition, vol-

ume 1, pages 5835–5843.

Lecouat, B., Ponce, J., and Mairal, J. (2021). Lucas-Kanade

Reloaded: End-to-End Super-Resolution from Raw

Image Bursts. In IEEE/CVF International Conference

on Computer Vision, volume 1, pages 2350–2359, Los

Alamitos, CA, USA. IEEE Computer Society.

Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham,

A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang,

Z., and Shi, W. (2017). Photo-Realistic Single Im-

age Super-Resolution Using a Generative Adversarial

Network. In IEEE Conference on Computer Vision

and Pattern Recognition, volume 1, pages 105–114,

Los Alamitos, CA, USA. IEEE Computer Society.

Lim, B., Son, S., Kim, H., Nah, S., and Lee, K. M. (2017).

Enhanced Deep Residual Networks for Single Image

Super-Resolution. In IEEE Conference on Computer

Vision and Pattern Recognition Workshops, volume 1,

pages 1132–1140.

Lu, Z. and Chen, Y. (2019). Single Image Super Resolution

based on a Modiﬁed U-Net with Mixed Gradient Loss.

CoRR, abs/1911.09428.

Lu, Z., Li, J., Liu, H., Huang, C., Zhang, L., and Zeng,

T. (2022). Transformer for Single Image Super-

Resolution. In IEEE/CVF Conference on Computer

Vision and Pattern Recognition Workshops, volume 1,

pages 456–465, Los Alamitos, CA, USA. IEEE Com-

puter Society.

Lugmayr, A., Danelljan, M., Van Gool, L., and Timofte,

R. (2020). SRFlow: Learning the Super-Resolution

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

Space with Normalizing Flow. In Vedaldi, A.,

Bischof, H., Brox, T., and Frahm, J.-M., editors, Euro-

pean Conference on Computer Vision, pages 715–732,

Cham. Springer International Publishing.

Mehta, N., Dudhane, A., Murala, S., Zamir, S., Khan, S.,

and Khan, F. (2023). Gated Multi-Resolution Transfer

Network for Burst Restoration and Enhancement. In

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, volume 1, pages 22201–22210, Los

Alamitos, CA, USA. IEEE Computer Society.

Ranjan, A. and Black, M. J. (2017). Optical Flow Estima-

tion Using a Spatial Pyramid Network. In IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), volume 1, pages 2720–2729.

Shi, W., Caballero, J., Huszar, F., Totz, J., Aitken, A. P.,

Bishop, R., Rueckert, D., and Wang, Z. (2016a). Real-

Time Single Image and Video Super-Resolution Using

an Efﬁcient Sub-Pixel Convolutional Neural Network.

In IEEE Conference on Computer Vision and Pattern

Recognition, volume 1, pages 1874–1883, Los Alami-

tos, CA, USA. IEEE Computer Society.

Shi, W., Caballero, J., Huszar, F., Totz, J., Aitken, A. P.,

Bishop, R., Rueckert, D., and Wang, Z. (2016b). Real-

Time Single Image and Video Super-Resolution Using

an Efﬁcient Sub-Pixel Convolutional Neural Network.

In IEEE Conference on Computer Vision and Pattern

Recognition, volume 1, pages 1874–1883, Los Alami-

tos, CA, USA. IEEE Computer Society.

Sun, D., Yang, X., Liu, M.-Y., and Kautz, J. (2018). PWC-

Net: CNNs for Optical Flow using Pyramid, Warping,

and Cost Volume. In IEEE Conference on Computer

Vision and Pattern Recognition, pages 8934–8943.

Tsai, R. Y. and Huang, T. S. (1984). Multiframe Im-

age Restoration and Registration. Multiframe image

restoration and registration, 1:317–339.

Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y.,

and Loy, C. C. (2019). ESRGAN: Enhanced Super-

Resolution Generative Adversarial Networks. In Leal-

Taix

e, L. and Roth, S., editors, European Conference

on Computer Vision, pages 63–79, Cham. Springer In-

ternational Publishing.

Wang, Z., Bovik, A., Sheikh, H., and Simoncelli, E. (2004).

Image Quality Assessment: From Error Visibility to

Structural Similarity. IEEE Transactions on Image

Processing, 13(4):600–612.

Wang, Z., Liu, D., Yang, J., Han, W., and Huang, T.

(2015). Deep Networks for Image Super-Resolution

with Sparse Prior. In IEEE International Conference

on Computer Vision, volume 1, pages 370–378.

Wronski, B., Garcia-Dorado, I., Ernst, M., Kelly, D.,

Krainin, M., Liang, C.-K., Levoy, M., and Milanfar,

P. (2019). Handheld Multi-Frame Super-Resolution.

ACM Transactions on Graphics, 38(4).

Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S.,

Yang, M.-H., and Shao, L. (2020). Learning En-

riched Features for Real Image Restoration and En-

hancement. In Vedaldi, A., Bischof, H., Brox, T., and

Frahm, J.-M., editors, European Conference on Com-

puter Vision, pages 492–511, Cham. Springer Interna-

tional Publishing.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,

O. (2018a). The Unreasonable Effectiveness of Deep

Features as a Perceptual Metric. In IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), volume 1, pages 586–595, Los Alamitos,

CA, USA. IEEE Computer Society.

Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., and Fu, Y.

(2018b). Image Super-Resolution Using Very Deep

Residual Channel Attention Networks. In Ferrari, V.,

Hebert, M., Sminchisescu, C., and Weiss, Y., edi-

tors, European Conference on Computer Vision, pages

294–310, Cham. Springer International Publishing.

Zhang, Y., Tian, Y., Kong, Y., Zhong, B., and Fu, Y.

(2018c). Residual Dense Network for Image Super-

Resolution. In IEEE Conference on Computer Vision

and Pattern Recognition, pages 2472–2481.

Simple Base Frame Guided Residual Network for RAW Burst Image Super-Resolution