Re-Learning ShiftIR for Super-Resolution of

Carbon Nanotube Images

Yoshiki Kakamu

, Takahiro Maruyama

and Kazuhiro Hotta

Meijo University, 1-501 Shiogamaguchi, Tempaku-ku, Nagoya 468-8502, Japan

Keywords: Super-Resolution, Carbon Nanotube, Shift, SwinIR, Re-Learning.

Abstract: In this study, we perform super-resolution of carbon nanotube images using Deep Learning. In order to

achieve super-resolution with higher accuracy than conventional SwinIR, we introduce an encoder-decoder

structure to input an image of larger size and a Shift mechanism for local feature extraction. In addition, we

propose super-resolution method by re-training to perform super-resolution with high accuracy even with a

small number of images. Experiments were conducted on DIV2K, General100, Set5, and carbon nanotube

image dataset for evaluation. Experimental results confirmed that the proposed method provides higher

accuracy than the conventional SwinIR, and showed that the proposed method can super-resolve carbon

nanotube images. The main contribution is the proposal of a network model with better performance for

super-resolution of carbon nanotube images even if there is no crisp supervised images. The proposed method

is suitable for such images. Effectiveness of our method was demonstrated by experimental results on a

general super-resolution dataset and a carbon nanotube image dataset.

1 INTRODUCTION

Deep learning technology for image processing is

currently applied not only to the IT field but also to

various fields such as civil engineering, and materials

engineering. In materials engineering, this study

focuses on carbon nanotube (CNT). CNT is

nanometer-sized tubular materials composed of

carbon, which can be used in a wide variety of

applications, such as fuel cells and medical materials.

The problem of blurred images often occurs in

transmission electron microscope (TEM) observation,

which is the main method for CNT characterization.

A solution for this problem is to use deep learning to

obtain super-resolution images.

In this study, we use a super-resolution method

called SwinIR as a baseline. This model is based on

the Swin Transformer, which has achieved the highest

accuracy in many super-resolution tasks. However,

the pre-trained SwinIR was trained on datasets such

as plants, animals, and buildings, and not on material

images. Therefore, the pre-trained models may fail to

super-resolution of material images. Although it is

desirable to train on a large number of clear CNT

https://orcid.org/ 0000-0001-6698-3634

https://orcid.org/0000-0003-4559-348X

https://orcid.org/0000-0002-5675-8713

images, it is difficult to collect such data. Therefore,

we propose a re-training method in which fine tuning

is performed using the images that have already been

successfully super-resolved by ShiftIR as teacher

images. This method can incorporate CNT-specific

features while utilizing generic knowledge from pre-

trained models.

In order to achieve super-resolution with higher

accuracy than SwinIR, we incorporated an encoder-

decoder structure that enables us to input an image of

larger size. We also incorporated the Shift mechanism

of ShiftViT, which performs better than Swin

Transformer.

To evaluate the proposed method, we used the

DIV2K, General100 and Set5 datasets, which are

commonly used in super-resolution experiments. We

also use the CNT image dataset which is the subject

of this study. In our experiments, we first evaluated

the proposed method on the DIV2K, General100,

Set5 datasets to evaluate its super-resolution

performance. The results show that the proposed

method can achieve higher PSNR accuracy than

conventional SwinIR on those general datasets. Next,

we conducted experiments on CNT image datasets to

Kakamu, Y., Maruyama, T. and Hotta, K.

Re-Learning ShiftIR for Super-Resolution of Carbon Nanotube Images.

DOI: 10.5220/0011708200003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

371-377

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

371

Figure 1: Overview of SwinIR.

evaluate the effectiveness of the proposed method on

CNT images and the usefulness of the re-learning

based super-resolution. As a result, we confirmed that

the proposed method can achieve higher PSNR

accuracy than the conventional SwinIR on CNT

image datasets. In addition, an ablation study was

conducted without the encoder-decoder structure and

the Shift mechanism in the proposed method to

demonstrate the usefulness of the proposed method,

and we show the effectiveness of both devices in our

method.

The structure of this paper is as follows. Section 2

describes related works. Section 3 describes the

proposed method. Section 4 presents experimental

results. Section 5 concludes and discusses future

works.

2 RELATED WORKS

We discuss related works of this study. We discuss

SwinIR in Section 2.1 and Shift ViT in Section 2.2.

2.1 SwinIR

SwinIR is a super-resolution method based on Swin

Transformer. Figure 1 shows the overview of SwinIR,

which consists of three parts: a shallow feature

extraction part, a deep feature extraction part, and a

high-quality image reconstruction part. The shallow

feature extraction part extracts shallow features from

an input image using convolution layers. The deep

feature extraction part consists of Residual Swin

Transformer Blocks (RSTB). Each block utilizes

multiple Swin Transformer Layers (STLs). In

addition, a convolutional layer is added at the end of

the block to enhance functionality, and residual

connections are used. The high-quality image

reconstruction section reconstructs the sum of

features from the shallow and deep feature extraction

sections into a quadruple-sized image using

convolution layers and the Nearest Neighbor method.

These mechanisms make SwinIR high-quality super-

Figure 2: Shift Mechanism.

resolution method by taking advantage of the Swin

Transformer. Conventional SwinIR consumes a lot of

memory and cannot treat input images of large sizes.

Therefore, only small size images can be input, and

only a small amount of information on the input

image can be utilized. To solve this problem, this

paper introduces an encoder-decoder structure that

enables the input of large images without increasing

memory usage.

2.2 ShiftViT

ShiftViT is a variant of Vision Transformer that uses

the Shift mechanism instead of Attention in the Swin

Transformer, an operation that swaps some channels

between adjacent features. It has the advantage that no

arithmetic operations or parameters are required.

Figure 2 shows the Shift mechanism in the ShiftViT.

The Shift mechanism first divides a portion of the

input channel into four equal parts. Next, one of

divided part is shifted by one pixel to the left, right,

top, and bottom.

This operation is a substitute for Attention in Swin

Transformer. As a result, ShiftViT achieved better

performance than Swin Transformer on the tasks such

as image classification, object detection, and

segmentation. By replacing the Swin Transformer

Layer (STL) in SwinIR with the ShiftViT Layer

(SVL), this study enables super-resolution with high

accuracy and low memory consumption.

3 PROPOSED METHOD

In this section, we describe the proposed method,

Section 3.1 describes super-resolution method by re-

learning, and Section 3.2 describes the structure of the

proposed ShiftIR.

3.1 Re-Learning Based

Super-Resolution

In this study, we propose a re-training based super-

resolution method. This method is based on

fine-tuning using images that have been successfully

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

372

Figure 3: Structural diagram of ShiftIR.

Figure 4: Re-learning based Super-resolution process.

Figure 5: Images in datasets.

Table 1: Results on general super-resolution datasets.

super-resolved images by a pre-trained model as a

teacher image. The process of the super-resolution

method based on relearning is shown in Figure 4.

CNT images available for this study are blurred

images. To successfully conduct experiments under

the harsh conditions, we adopted ShiftIR which has

been pre-trained in DIV2K to the blurred CNT

images. We can obtain better CNT images than the

original ones. Thus, the output images are used as

pseudo teacher images. However, since the CNT

images are very different from the dataset used for

pre-training the ShiftIR, some images would fail to be

super-resolved. Therefore, fine tuning of ShiftIR is

required. This allows us to super-resolve CNT

images.The CNT images used in this study are

difficult to collect clear images, and it is difficult to

create a sufficient dataset for training. In addition, the

datasets used for super-resolution are generally

images of plants, animals, and buildings, which are

not similar to CNT images. Therefore, the usage of

only pre-trained models may fail to super-resolve

material images. Therefore, by using learned images

that have been successfully super-resolved by ShiftIR

as teacher images, we can create pseudo-clear teacher

images, resulting in high-quality super-resolution.

The method can also incorporate CNT-specific

features while utilizing generic knowledge from pre-

trained models.

3.2 ShiftIR

Figure 3 shows the overview of the proposed ShiftIR.

In conventional methods, feature extraction is

performed by a single-layer convolution in the

shallow feature extraction section. However, only one

layer has poor feature extraction capability and cannot

make good use of information in the input image.

Therefore, encoder-decoder structure is introduced

into the shallow feature extraction part and the high-

quality image reconstruction part. The input image

size is therefore four times larger than that of the

conventional method. The proposed method adds a

layer of residual connections and convolution of

information from the shallow feature extraction

section between each layer, while the conventional

method only has two layers of convolution and the

nearest neighbor as high-quality image reconstruction

sections. The proposed method, however, adds a

residual connection and a convolution between each

layer. This allows us to take advantage of features

in large images without increasing memory

consumption.

Since the conventional Swin Transformer used

self-attention in a window. However, attention

between far points in a window may not be useful for

super-resolution. We consider that local relationship

is more important for super-resolution. For example,

if we try to super-resolve the petals in the left image

in Figure 5, what is important is the shape of the

surrounding petals and the border with the center of

the flower. On the other hand, the relationship with

the background and leaves is not so important. Thus,

Re-Learning ShiftIR for Super-Resolution of Carbon Nanotube Images

373

Figure 6: Super-resolution results by ShiftIR.

local relationships are more important for super-

resolution. Local relationships are also important in

the CNT image shown in Figure 5 because of the

importance of separating CNTs from the background

and the nature of the CNTs being formed in a series.

Thus, we use Shift layer which excels in more

localized processing to achieve more accurate super-

resolution processing. Figure 3 shows the proposed

ShiftIR that we use local Shift of feature maps.

By using an encoder-decoder structure, ShiftIR

enhances the shallow feature extraction part of the

image by inputting a large size image and effectively

utilizing its information. In addition, by using the

Shift mechanism, local features are extracted and the

deep feature extraction part is enhanced.

Table 2: Result of CNT image dataset.

4 EXPERIMENTS

In this section, we present the experimental results.

Section 4.1 describes the carbon nanotube images.

Section 4.2 presents the experimental results on

general super-resolution dataset. Section 4.3 shows

the experimental results for the CNT image dataset.

Section 4.4 shows ablation study.

4.1 Carbon Nanotube (CNT) Image

Carbon nanotube (CNT) is nanometer-sized tubular

materials composed of carbon, which are used in a

wide variety of applications, including fuel cells and

medical materials. However, transmission electron

microscopy, which is the primary method for

evaluating CNTs, often results in blurred images. The

solution to this problem is to use ShiftIR to obtain

super high resolution images of CNTs. In addition,

the CNT images used in this study are not suitable for

using as teacher images because they are unclear.

Therefore, the proposed super-resolution method

based on retraining was used to create pseudo teacher

images and use them for training.

4.2 Results on General

Super-Resolution Datasets

The experimental results on the general super-

resolution dataset (DIV2K, General100, Set5) are

shown in Figure 4. In these experiments, we trained

each method till 50,000 epochs. Figure 6 shows that

the overall quality of the images by ShiftIR is

improved.

Loss and PSNR values for SwinIR and ShiftIR for

the general super-resolution dataset are shown in

Table 1. Table 1 shows that the proposed ShiftIR

method and the conventional SwinIR method have 40

dB or higher values. Since 30 dB or higher is

generally considered to be high image quality, all of

them can be said to have high image quality. In

addition, the PSNR values of the proposed method are

approximately 3.5 dB higher on the DIV2K dataset,

4.3 dB higher on the General100 dataset, and 14.5dB

higher on the Set5 dataset compared to the

conventional method.

From these results, we see that the proposed

method is able to achieve super-resolution with better

accuracy. This is due to the fact that the encoder-

decoder structure of ShiftIR extracts the overall

features of larger input image and utilizes them for

image reconstruction, thereby enhancing the shallow

feature extraction and image reconstruction parts. The

deep feature extraction part was also enhanced by

changing from the Swin Transformer to the Shift

mechanism.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

374

Figure 7: Super-resolution of CNT images by ShiftIR.

Table 3: Ablation Studies.

Figure 8: PSNR for each method on CNT dataset.

4.3 Results on CNT Image Dataset

The usefulness of the proposed method was

confirmed in the experiments on the datasets for

general super-resolution in section 4.2. Therefore, we

adopt the proposed method to CNT images, which are

the main subject of this study. We also adopt the re-

learning based super-resolution method to the CNT

image dataset in order to confirm the usefulness. The

experimental results on the CNT image dataset are

shown in Figure 7, and the Loss and PSNR values of

SwinIR and ShiftIR are shown in Table 2. We trained

each method till 50,000 epochs. Figure 6 shows that

the image with ShiftIR has higher quality, and the

background noise disappears. In addition, as shown in

Table 2 and Section 4.2, both ShiftIR and SwinIR are

higher than 40 dB. The PSNR value of the proposed

method is about 16.7 dB higher than that of the

conventional SwinIR.

From these results, we see that the proposed

method is able to achieve super-resolution with higher

quality. This indicates that the proposed method is

more useful than conventional methods on CNT

image datasets. In addition, the higher improvement

on CNT image dataset than general super-resolution

datasets though no clear teacher images exist. This

indicates that the re-training based super-resolution

method is useful.

4.4 Ablation Study

Ablation study of the proposed ShiftIR is performed

to verify which mechanism is effective. The CNT

Re-Learning ShiftIR for Super-Resolution of Carbon Nanotube Images

375

image dataset is used in this experiment. One of the

two mechanisms in the proposed method, the

encoder-decoder structure and the Shift mechanism,

was removed in this experiment. The experimental

results are shown in Table 3 and the graphs of each

PSNR value are shown in Figure 8.

We also trained each method till 50,000 epochs.

Table 3 and Figure 8 show that the results excluding

each factor are higher than the conventional method

but less accurate than the proposed method.

The PSNR values for the encoder-decoder

structure alone and the Shift mechanism alone are

nearly same. These results indicate that both the

encoder-decoder structure and the shift mechanism

are important for ShiftIR. These experiments confirm

the effectiveness of the encoder-decoder structure and

the shift mechanism.

The encoder-decoder structure improved accuracy

because it was able to take advantage of the features

of large images. When the height and width of a

square image is halved, the number of pixels in the

image is reduced to one-fourth. This means that the

image information is also reduced to one-fourth, and

conversely, the image information is increased by a

factor of four when the image size is doubled.

Therefore, an encoder-decoder structure that can

handle four times as many input images as

conventional methods is considered an important

element for ShiftIR.

The reason for the improved accuracy with the

Shift mechanism can be attributed to the improved

performance of local feature extraction. Local

relationships are more important than the

relationships between distant locations in an image,

because features around objects are important for

super-resolution. Especially, local relationships are

especially important due to the characteristics of

CNTs, so the Shift mechanism with its high local

feature extraction capability is quite effective.

ShiftViT's Shift mechanism has fewer parameters

than Swin Transformer's Attention mechanism and

can build deeper models within a limited amount of

computation, making it superior for spatial feature

extraction. Therefore, the Shift mechanism is

considered as an important element for ShiftIR

because it can extract more features than conventional

methods. Based on these factors, we believe that

ShiftIR can achieve higher resolution than

conventional methods due to the synergistic effect of

the encoder-decoder structure and the Shift

mechanism.

5 CONCLUSION

In this paper, we proposed ShiftIR, which introduces

an encoder-decoder structure and the Shift

mechanism to the conventional SwinIR for super-

resolution of carbon nanotube images. In addition, we

proposed a re-training based learning method to

perform super-resolution with high accuracy even

with a small amount of low quality images.

Experimental results show that ShiftIR can perform

super-resolution with a maximum accuracy of 59dB

on both general super-resolution datasets and CNT

image datasets, which is higher than the accuracy of

conventional methods. Through the ablation study,

we see that both the encoder-decoder structure and the

shift mechanism are effective for super-resolution. In

addition to CNTs, there are the other fields in

materials engineering such as catalyst images that

require super-resolution, and we would like to

develop models for those fields.

REFERENCES

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,G

omez,A. N.,Polosukhin,I.,"Attention is all you need.",

International Conference on Neural Information

Processing Systems, pp. 6000-6010, 2017

Dong,C.,Loy,C.C.,He,K.,Tang,X.,"Learning a deep

convolutional network for image super-resolution. ",

European Conference on Computer Vision, pp. 184-

199 ,2014

Yu,C.,Xiao,B.,Gao,C.,Yuan,L.,Sang,N.,Wang,J.,"Lite-hrnet:

A lightweight high-resolution network.", IEEE/CVF

Conference on Computer Vision and Pattern

Recognition,pp. 10440-10450 ,2021

Chun,P.J.,Shota,I.,Tatsuro,Y., "Automatic detection

method of cracks from concrete surface imagery using

two‐step light gradient boosting machine.", Computer‐

Aided Civil and Infrastructure Engineering, pp. 61-

72 ,2021

Dung,C.V.,"Autonomous concrete crack detection using

deep fully convolutional neural network.",Automation

in Construction, pp. 52-58, 2019

Ramprasad,R., Batra,R., Pilania,G., Mannodi-

Kanakkithodi,A., Kim,C., "Machine learning in

materials informatics: recent applications and

prospects.", NPJ Computational Materials, pp. 1-13,

2017

Kim,C., Chandrasekaran,A., Huan,T.D., Das,D.,

Ramprasad,R., "Polymer genome: a data-powered

polymer informatics platform for property

predictions.", The Journal of Physical Chemistry C, pp.

17575-17585, 2018

Sumio,I.,"Carbon Nanotube",Electron microscope 34-2,

pp.103-105, 1999

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

376

Takahiro,K., Junji,N., "Specificity of electrocatalysts using

carbon nanotubes as supports.", Surface science 32-11,

pp.704-709, 2011

Heister,E., Brunner,E.W., Dieckmann,G.R., Jurewicz,I.,

Dalton,A.B., "Are carbon nanotubes a natural solution?

Applications in biology and medicine. ", ACS Applied

Materials & Anterfaces, pp.1870-1891, 2013

Liang J., Cao,J., Sun,G., Zhang,K., VanGool,L.,

Timofte,R., "Swinir: Image restoration using swin

transformer", IEEE International Conference on

Computer Vision, pp.1833-1844, 2021.

Liu,Z., Lin,Y., Cao,Y., Hu,H., Wei,Y., Zhang,Z., Lin,S.,

Guo,B., "Swin transformer: Hierarchical vision

transformer using shifted windows", IEEE International

Conference on Computer Vision, pp.10012-10022,

2021

Mao,X.,Shen,C.,Yang,Y.B.,"Image restoration using very

deep convolutional encoder-decoder networks with

symmetric skip connections.", Advances in Neural

Information Processing Systems, pp. 2810–2818 ,2016.

Wang,G., Zhao,Y., Tang,C., Luo,C., Zeng,W., "When Shift

Operation Meets Vision Transformer An Extremely

Simple Alternative to Attention Mechanism.”, arXiv

preprint arXiv:2201.10801, 2022

Agustsson,E.,Timofte,R.,"Ntire 2017 challenge on single

image super-resolution: Dataset and study.", IEEE

Conference on Computer Vision and Pattern

Recognition Workshops, pp. 126-135 , 2017

Dong,C.,Loy,C. C.,Tang,X.,"Accelerating the super-

resolution convolutional neural network.", European

Conference on Computer Vision, pp. 391-407, 2016

Bevilacqua,M., Roumy,A., Guillemot,C., Alberi-

Morel,M.L., "Low-complexity single-image super-

resolution based on nonnegative neighbor embedding.",

British Machine Vision Conference, pp.135.1-135.10,

2012

Dosovitskiy,A., Beyer,L., Kolesnikov,A., Weissenborn,D.,

Zhai,X., Unterthiner,T., Houlsby,N., "An image is

worth 16x16 words: Transformers for image

recognition at scale.", International Conference on

Learning Representations, 2021

Kobako,M., "Electronic Documents Image Compression

Guidelines", IM Monthly Vol.50 No.5, pp.21-24, 2011.

Re-Learning ShiftIR for Super-Resolution of Carbon Nanotube Images

377