Defect Detection and Classiﬁcation of Cultural Heritage Buildings Using

Deep Learning

Srujan Gokak, Prem Khichade, Nikhil Heggalagi, Aditya Billowria and Shashank Hegde

School of Computer Science and Engineering, KLE Technological University, Hubballi, India

Keywords:

Deep Learning, ResNet, Grad-CAM, Image Classiﬁcation, Cultural Heritage Conservation, Defect

Localization, Computer Vision.

Abstract:

This paper discusses the application of deep learning models, speciﬁcally MobileNetV3 and Grad-CAM, in

the detection and classiﬁcation of defects in cultural heritage buildings. A dataset of images of heritage sites

was used, and the MobileNetV3-based model resulted in a classiﬁcation accuracy of 91.5% on the test dataset,

effectively identifying sites with structural defects that may require conservation efforts. Grad-CAM visual-

izations were used to produce heatmaps that highlighted critical regions inﬂuencing the model’s predictions,

enhancing interpretability and trust in AI-driven assessments. The training process included data augmen-

tation, learning rate scheduling, and model pruning, reducing the model size by 20% without affecting per-

formance. This lightweight and efﬁcient framework demonstrates the potential of integrating advanced deep

learning models with explainable AI techniques to improve accuracy in defect localization and classiﬁcation

and support preservation initiatives for cultural heritage sites.

1 INTRODUCTION

Cultural Heritage Buildings (CHBs) (Authors afﬁli-

ated with the College of Arts et al.(2023)Authors afﬁl-

iated with the College of Arts, of Training, Research-

Asia, and (Shanghai)) are invaluable artifacts of hu-

man history, representing the architectural, artis-

tic, and cultural achievements of past civilizations

(Howard et al.(2019)Howard, Sandler, Chu, Chen,

Chen, Tan, Wang, Zhu, Pang, Vasudevan, et al.).

Their preservation is crucial in safeguarding human-

ity’s shared identity and heritage (He et al.(2016)He,

Zhang, Ren, and Sun). However, CHBs face numer-

ous threats, including environmental degradation, nat-

ural disasters, pollution, and urbanization, which may

lead to gradual deterioration or severe structural dam-

age (Bahrami and Albadvi(2023)). In the absence of

timely intervention, these factors may contribute to

irreversible loss, highlighting the need for effective

conservation strategies (Bahrami and Albadvi(2023)).

Traditional conservation methods primarily

rely on manual inspections conducted by experts,

which are often labor-intensive, time-consuming,

and prone to subjective interpretation (Monna

et al.(2021))(Authors afﬁliated with the College of

Arts et al.(2023)Authors afﬁliated with the College of

Arts, of Training, Research-Asia, and (Shanghai)).

Direct access to certain architectural elements of

CHBs is challenging, emphasizing the demand for

automated, accurate, and scalable approaches to

defect detection and structural health monitoring

(Sandler et al.(2019)Sandler, Howard, and Chu). In

recent years, advancements in deep learning (DL) and

computer vision have offered innovative solutions for

automatic image-based analysis, defect localization,

and classiﬁcation in heritage conservation (Bahrami

and Albadvi(2023)).

Convolutional Neural Networks (CNNs) (afﬁli-

ated with the International Conference on Multidis-

ciplinary Studies(2022)), a specialized form of deep

learning architecture, have demonstrated signiﬁcant

success in tasks such as object detection, image

classiﬁcation, and segmentation (Bahrami and Al-

badvi(2023)). Therefore, CNNs (Bahrami and Al-

badvi(2023)) are highly suitable for the complex vi-

sual analysis required in heritage building conserva-

tion (P

erez et al.(2019)P

erez, Tah, and Mosavi). This

study investigates state-of-the-art DL methods for

classiﬁcation and defect localization of CHBs based

on CNN architectures. We employ a lightweight and

efﬁcient CNN model called MobileNet, integrated

with Grad-CAM, to enhance visual interpretability

and provide better insights into the model’s decisions

erez et al.(2019)P

erez, Tah, and Mosavi)(afﬁliated

Gokak, S., Khichade, P., Heggalagi, N., Billowria, A. and Hegde, S.

Defect Detection and Classiﬁcation of Cultural Heritage Buildings Using Deep Learning.

DOI: 10.5220/0013633700004664

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 3, pages 619-626

ISBN: 978-989-758-763-4

619

with the Heritage Science journal(2024)). The archi-

tectural advantages of MobileNet make it particularly

well-suited for low-resource computational settings,

where heavy computational power for image pro-

cessing is not feasible (Sandler et al.(2019)Sandler,

Howard, and Chu)(afﬁliated with the Heritage Sci-

ence journal(2024)).

MobileNet leverages pre-trained weights adapted

to our application of CHB defect classiﬁcation, mit-

igating the challenge of acquiring a large anno-

tated dataset—often a constraint in heritage conser-

vation research (Howard et al.(2019)Howard, San-

dler, Chu, Chen, Chen, Tan, Wang, Zhu, Pang, Va-

sudevan, et al.)(Bahrami and Albadvi(2023)). Trans-

fer learning uses models trained on large-scale

datasets for delicate tasks like defect identiﬁca-

tion in CHBs (Bahrami and Albadvi(2023)) (Lopez

et al.(2017)Lopez, Bianchi, and Xu), helping address

challenges such as overﬁtting that may arise due to

smaller datasets.

Grad-CAM (Bahrami and Al-

badvi(2023))(afﬁliated with the Heritage Sci-

ence journal(2024)) further improves our approach

by offering a visual explanation of what the model

identiﬁes as its prediction. It highlights the regions

in the image where the model focuses most when

making classiﬁcation decisions. This feature enables

conservation experts to visually inspect the areas

where the model detects defects that may require

closer investigation. Grad-CAM (Bahrami and

Albadvi(2023))(Soni et al.(2023)Soni, Howard, and

Chansker)also increases the interpretability of the

model, thereby building trust in its predictions.

Experts can easily verify whether the model’s output

corresponds with the observable defects in the

image. This method aims to bridge the gap between

traditional manual methods of CHB conservation

erez et al.(2019)P

erez, Tah, and Mosavi) and their

automated counterparts, ultimately leading to easier

and more precise assessments at heritage sites.

Automation can signiﬁcantly reduce reliance on

human supervision, minimizing errors and ensuring

consistent quality assessments across different sites.

The results of this study demonstrate that our method

is a more reliable alternative to traditional inspec-

tion techniques, providing reliable and accurate de-

fect detection and localization, which can inform

conservation practices (Monna et al.(2021))(Franco

et al.(2021)Franco, Liao, and Lee). This re-

search lays the foundation for further development

in deep learning-based approaches for the preserva-

tion of CHBs (Lee et al.(2023)Lee, Alvarez, and

He)(Kalugina et al.(2022)Kalugina, Marchenko, and

Palmer), ultimately helping ensure these cultural trea-

sures are preserved for future generations.

1.1 Problem Statement

Manual inspection methods for CHB preservation

are resource-intensive and impractical for large-scale

heritage sites. These methods are inconsistent and

prone to errors, making it necessary to use auto-

mated techniques that are efﬁcient and accurate. This

paper focuses on the development of a deep learn-

ing framework using MobileNetV3 and Grad-CAM

to address the limitations of traditional methods, of-

fering improved accuracy and interpretability in de-

fect detection (Howard et al.(2019)Howard, Sandler,

Chu, Chen, Chen, Tan, Wang, Zhu, Pang, Vasudevan,

et al.).

1.2 Objectives of the Proposed Work

The main goals of this research are to imple-

ment MobileNetV3 for image classiﬁcation of CHB

images to determine preservation needs (Howard

et al.(2019)Howard, Sandler, Chu, Chen, Chen, Tan,

Wang, Zhu, Pang, Vasudevan, et al.), to incorporate

Grad-CAM to provide a better model interpretability

in the form of heatmaps that highlight the areas of cru-

cial defects (Bahrami and Albadvi(2023)), and to use

ResNet as a secondary classiﬁer so the superimposed

outputs of Grad-CAM will strengthen the predictions

giving way for better accuracy (He et al.(2016)He,

Zhang, Ren, and Sun), and assess the performance

of the proposed model in comparison with the exist-

ing architectures VGG16, AlexNet, and InceptionV3

(Bahrami and Albadvi(2023)).

1.3 Overview of Paper Structure

The subsequent sections of this paper are structured

as follows:

Section II - Background:

Reviews the application of deep learning and

computer vision in cultural heritage conservation,

highlighting existing models and datasets (Monna

et al.(2021))(Sandler et al.(2019)Sandler, Howard,

and Chu).

INCOFT 2025 - International Conference on Futuristic Technology

620

Section III - Methodology:

Describes the dataset, preprocessing steps, and

the architecture of the proposed model, includ-

ing MobileNetV3, Grad-CAM, and ResNet (Howard

et al.(2019)Howard, Sandler, Chu, Chen, Chen,

Tan, Wang, Zhu, Pang, Vasudevan, et al.)(He

et al.(2016)He, Zhang, Ren, and Sun)(Bahrami and

Albadvi(2023)).

Section IV - Results:

Presents the performance evaluation of the proposed

model and compares it to other deep learning mod-

els. Grad-CAM visualizations are also discussed

to demonstrate interpretability (Bahrami and Al-

badvi(2023))(Bahrami and Albadvi(2023))(afﬁliated

with the International Conference on Multidisci-

plinary Studies(2022)).

Section V - Conclusion:

Summarizes the ﬁndings, emphasizing the contribu-

tions and potential impact of this research on CHB

conservation, and suggests areas for future develop-

ment.

2 BACKGROUND

In recent years, approaches incorporating deep

learning (Monna et al.(2021))(Bahrami and Al-

badvi(2023))(Bahrami and Albadvi(2023)) with her-

itage conservation have gained enough scientiﬁc at-

tention. Cultural heritage structures contribute to the

identity of their societies and also serve an imper-

ative role in tourism economies. However, many

of these constructions experience various threats

such as degradation due to natural and inevitable

processes, urbanization, pollution, and human ef-

fects of climate change. Traditional conservation

methods are effective but time-consuming, resource-

intensive, and often lack scalability in addressing a

large number of sites. This is why there is a need

for automated and scalable solutions driven by ad-

vanced technologies like artiﬁcial intelligence (AI)

and deep learning(Monna et al.(2021))(Bahrami and

Albadvi(2023)).

Convolutional Neural Networks (CNNs) (afﬁli-

ated with the International Conference on Multidisci-

plinary Studies(2022))(Bahrami and Albadvi(2023))

have been widely applied to tasks such as image clas-

siﬁcation, object detection, and semantic segmenta-

tion within the context of cultural heritage preser-

vation. This ability to automatically learn and ex-

tract meaningful features from images has allowed

researchers to approach complex conservation tasks.

For example, the latest CNN architectures such as

VGG16 (P

erez et al.(2019)P

erez, Tah, and Mosavi),

ResNet (afﬁliated with the Heritage Science jour-

nal(2024)), and MobileNet (afﬁliated with the Her-

itage Science journal(2024)) have been used for the

automatic identiﬁcation of heritage sites, the assess-

ment of damage severity, and structural defects. Au-

tomation has reduced the workload on experts, mak-

ing assessments faster and more accurate.

Moreover, in photogrammetry, 3D reconstruc-

tion, and infrared imaging, CNNs (Bahrami and Al-

badvi(2023)) have also played an important role in

identifying latent or minor structural anomalies that

may be unnoticed by human vision.

Among visualization techniques, Grad-CAM

erez et al.(2019)P

erez, Tah, and Mosavi)(P

erez

et al.(2019)P

erez, Tah, and Mosavi) has gained popu-

larity for introducing interpretability to deep learning

predictions. It has become critical for applications

like conservation prioritization. With Grad-CAM

erez et al.(2019)P

erez, Tah, and Mosavi)(P

erez

et al.(2019)P

erez, Tah, and Mosavi), domain experts

can see which parts of an image are important for

the model to make decisions, fostering trust in AI-

driven solutions. The identiﬁcation of regions requir-

ing immediate attention enables better resource allo-

cation for preservation. This visualization enhances

model transparency and facilitates cross-disciplinary

collaboration among engineers, architects, and histo-

rians working on conservation projects.

ResNet (afﬁliated with the Heritage Science jour-

nal(2024)) and its variants, with their residual learn-

ing framework, have demonstrated strong perfor-

mance in feature extraction and classiﬁcation tasks

by preventing vanishing gradients in deeper networks.

These models excel at capturing ﬁne-grained de-

tails, making them ideal for identifying intricate de-

fects in cultural heritage structures. In edge-device-

based conservation workﬂows, lightweight architec-

tures like MobileNet (afﬁliated with the Heritage

Science journal(2024)) have proven highly effective,

especially when computational resources are lim-

ited. Their energy efﬁciency and compact size make

them suitable for real-time applications in remote or

resource-constrained environments. When coupled

Defect Detection and Classiﬁcation of Cultural Heritage Buildings Using Deep Learning

621

with Grad-CAM (P

erez et al.(2019)P

erez, Tah, and

Mosavi), these architectures allow sharp localization

of critical regions in heritage structures, thereby sup-

porting targeted preservation.

Datasets like IHDS (Indian Heritage Dataset)

(Lopez et al.(2017)Lopez, Bianchi, and Xu) and other

domain-speciﬁc image repositories have been indis-

pensable for training models in the context of cul-

tural heritage. These datasets contain diverse sam-

ples of heritage sites, covering various architectural

styles, materials, and levels of degradation. However,

challenges such as interpretability, robust feature rep-

resentation, and dataset diversity persist. Variability

in environmental conditions, including lighting and

weather, further complicates model training and eval-

uation. Transfer learning from pre-trained models on

large datasets, followed by ﬁne-tuning on domain-

speciﬁc datasets like IHDS (Lopez et al.(2017)Lopez,

Bianchi, and Xu), has emerged as a promising solu-

tion to improve performance in low-data scenarios.

Multi-stage approaches combining classiﬁcation

and localization have shown promise in address-

ing the dual challenges of interpretability and accu-

racy. Research indicates that using ensemble methods

or cascaded architectures—where one model detects

defects while another reﬁnes classiﬁcation—yields

higher performance. This work integrates MobileNet

(afﬁliated with the Heritage Science journal(2024))

and Grad-CAM (P

erez et al.(2019)P

erez, Tah, and

Mosavi) for defect detection and ResNet (afﬁliated

with the Heritage Science journal(2024)) for identi-

fying regions requiring conservation. These contri-

butions strengthen the broader ﬁeld of heritage con-

servation by providing a scalable, interpretable deep

learning framework.

Moreover, AI applications in heritage conserva-

tion extend beyond defect localization. Automated

metadata tagging, content retrieval from historical

archives, and virtual restoration of ancient artifacts

are emerging areas of research reliant on similar

deep learning (Monna et al.(2021))(Bahrami and Al-

badvi(2023)) architectures. This proactive conser-

vation planning generates 3D models of monuments

and simulates the effects of environmental stressors.

As interdisciplinary research combining AI, materials

science, and cultural studies gains momentum, AI in

heritage conservation is poised to play a central role

in shaping the future of digital heritage preservation.

3 METHODOLOGY

The methodology used in this work combines ad-

vanced deep learning architectures with a system-

atic approach to image classiﬁcation and defect lo-

calization in cultural heritage building conservation.

The study aims to exploit lightweight and inter-

pretable models to identify and prioritize the conser-

vation needs of cultural heritage sites. These tech-

niques were selected for their efﬁciency, scalabil-

ity, and capacity to provide visual explanations for

model predictions, ensuring both accuracy and in-

terpretability in conservation decision-making (P

erez

et al.(2019)P

erez, Tah, and Mosavi).

3.1 Dataset

Figure 1: Representation of the dataset.

The dataset used in this paper is the IHDS dataset

(afﬁliated with MDPI(2023)), speciﬁcally designed

for the conservation of cultural heritage sites. It in-

cludes a comprehensive collection of images divided

into training and testing sets, stored in different direc-

tories. The training set consists of images labeled to

indicate whether a particular site needs conservation

based on visible defects, while the test set is used to

validate the model’s performance. All images were

preprocessed for uniform dimensions at 224x224 pix-

els and normalized with standard ImageNet statistics

to optimize model performance. This dataset forms

the foundation for training and validation of deep

models to identify conservation priorities in cultural

heritage buildings (afﬁliated with MDPI(2023)).

3.2 Data Pre-processing

To prepare the IHDS dataset (afﬁliated with

MDPI(2023)) for model training, several preprocess-

ing steps were performed to ensure consistency and

improve model performance. All images were resized

to a uniform dimension of 224x224 pixels, adhering

to the input requirements of MobileNetV3 (Howard

et al.(2019)Howard, Sandler, Chu, Chen, Chen, Tan,

Wang, Zhu, Pang, Vasudevan, et al.) and ResNet

INCOFT 2025 - International Conference on Futuristic Technology

622

(He et al.(2016)He, Zhang, Ren, and Sun). Pixel val-

ues were normalized using standard ImageNet statis-

tics (mean = [0.485, 0.456, 0.406] and std = [0.229,

0.224, 0.225]), which has been extensively used in

deep learning models for visual tasks to boost conver-

gence during training (P

erez et al.(2019)P

erez, Tah,

and Mosavi).

The dataset was cleansed with great care to ensure

that images with low-quality, irrelevant content, or vi-

sual artifacts were removed from the dataset. This

step minimized noise in the data, ensuring the training

process was based on high-quality and representative

samples of cultural heritage sites. Data augmentation

techniques such as random ﬂips, rotations, and color

jittering were applied further to enhance the robust-

ness of the models and minimize the risk of overﬁt-

ting. These transformations added variability to the

training data, which helped the models generalize bet-

ter to unseen scenarios.

Lastly, the dataset was split into training and val-

idation subsets, with 80% of the data allocated for

training and 20% for validation. The stratiﬁed split

preserved the class distribution across both subsets,

allowing reliable evaluation during model training.

3.3 Description of Models

This work utilizes two state-of-the-art deep

learning architectures, MobileNetV3 (Lee

et al.(2023)Lee, Alvarez, and He) and ResNet

(Franco et al.(2021)Franco, Liao, and Lee), to ad-

dress the challenge of cultural heritage conservation

through image classiﬁcation and defect localization.

MobileNetV3 (Lee et al.(2023)Lee, Alvarez, and

He): Proposed by Howard et al. (2019), Mo-

bileNetV3 is a lightweight convolutional neural net-

work optimized for mobile and edge devices. It

utilizes a combination of inverted residual blocks,

squeeze-and-excitation modules, and the swish ac-

tivation function to achieve a balance between ef-

ﬁciency and accuracy. In this work, MobileNetV3

was ﬁne-tuned for the task of defect classiﬁcation,

acting as the primary model to make predictions

and Grad-CAM (P

erez et al.(2019)P

erez, Tah, and

Mosavi)(P

erez et al.(2019)P

erez, Tah, and Mosavi)

visualizations for interpretability. Its compact struc-

ture made it a very practical choice for high-resolution

image processing without much additional computa-

tional overhead.

ResNet (Franco et al.(2021)Franco, Liao, and

Lee): ResNet, introduced by He et al. (2016),

revolutionized deep learning by using skip con-

nections to eliminate the vanishing gradient prob-

lem in very deep networks. The ResNet archi-

tecture used in this study takes in the superim-

posed Grad-CAM (P

erez et al.(2019)P

erez, Tah, and

Mosavi)(P

erez et al.(2019)P

erez, Tah, and Mosavi)

images produced by MobileNetV3 in a binary clas-

siﬁcation task. It seeks to classify these images

into those that require conservation and those that do

not, using the interpretability offered by Grad-CAM

erez et al.(2019)P

erez, Tah, and Mosavi)(P

erez

et al.(2019)P

erez, Tah, and Mosavi)to ﬁne-tune pre-

dictions. This combination of models ensures both

interpretability and performance, making the results

accurate and helpful for informing conservation ef-

forts.

3.4 Model Training

The training process adopted in this study was struc-

tured to leverage the strengths of MobileNetV3 (Lee

et al.(2023)Lee, Alvarez, and He) and ResNet (Franco

et al.(2021)Franco, Liao, and Lee)for image classi-

ﬁcation tasks tailored for cultural heritage conserva-

tion. The models were trained with a suitable strategy

to ensure optimal performance while maintaining in-

terpretability.

MobileNetV3 Training (Lee et al.(2023)Lee, Al-

varez, and He): MobileNetV3 was ﬁne-tuned with the

IHDS dataset (afﬁliated with MDPI(2023)) to clas-

sify images into groups representing different types

of defects or preservation needs. The input images

were resized to 224x224 pixels and normalized for

preprocessing. To make the model more robust, data

augmentation techniques including random ﬂips, ro-

tations, and color jitter were employed. The dataset

was split into a training subset and a validation subset

with an 80:20 ratio. The Adam optimizer was used

with a learning rate of 1e-4 to balance the speed of

convergence and stability. The training process lasted

for 15 epochs with a batch size of 32 to ensure that the

model was exposed to the dataset without overﬁtting.

Cross-entropy loss was used as the loss function, with

accuracy being the primary metric for performance

assessment.

L = −

∑

i=1

∑

c=1

i,c

log( ˆy

i,c

)

t+1

= w

− η

∂L

∂w

Grad-CAM Generation (P

erez et al.(2019)P

erez,

Tah, and Mosavi)(P

erez et al.(2019)P

erez, Tah, and

Mosavi): Grad-CAM visualizations were produced

for the test images to detect areas most critical for

making predictions using the trained MobileNetV3

(Lee et al.(2023)Lee, Alvarez, and He) model. The

Defect Detection and Classiﬁcation of Cultural Heritage Buildings Using Deep Learning

623

superimposed heatmaps on the original images cre-

ated an interpretable version, highlighting areas with

possible defects or regions requiring conservation.

Grad-CAM

= ReLU

∑

i, j

ResNet Training (Franco et al.(2021)Franco,

Liao, and Lee): The superimposed Grad-CAM

erez et al.(2019)P

erez, Tah, and Mosavi)(P

erez

et al.(2019)P

erez, Tah, and Mosavi) heatmaps were

used as inputs for training the ResNet (Franco

et al.(2021)Franco, Liao, and Lee) model, which

served as a second classiﬁer. The ResNet model was

set to perform a binary classiﬁcation task: determin-

ing whether the highlighted defects in the heatmaps

required conservation. The dataset split was the same,

and training was conducted for 10 epochs with a batch

size of 16. A learning rate of 5e-5 was used, and bi-

nary cross-entropy loss was the loss function. Early

stopping based on validation loss was implemented to

prevent overﬁtting.

This multi-stage training pipeline ensured that the

models worked synergistically, with MobileNetV3

(Lee et al.(2023)Lee, Alvarez, and He) providing in-

terpretability and ResNet (Franco et al.(2021)Franco,

Liao, and Lee) reﬁning the classiﬁcation process.

This approach maximized the efﬁciency and relia-

bility of predictions, offering actionable insights for

conservation efforts (Kalugina et al.(2022)Kalugina,

Marchenko, and Palmer).

Figure 2: The architecture of the proposed model is de-

picted along with the output shape (batch size, height,

width, channels) for each layer. Class Activation Mapping

(CAM) layer (P

erez et al.(2019)P

erez, Tah, and Mosavi)

highlights critical regions that inﬂuence predictions to aid

in localization of defects and enhance interpretability for

conservation efforts.

4 RESULT

To measure the performance of the proposed ap-

proach for classifying cultural heritage images, the

models were tested based on accuracy and visual

interpretability. This paper compares the classiﬁ-

cation accuracy achieved by the MobileNetV3 (Lee

et al.(2023)Lee, Alvarez, and He)and ResNet (Franco

et al.(2021)Franco, Liao, and Lee) models with those

achieved by other popular deep learning models for

similar tasks. The paper further discusses Grad-

CAM (Kalugina et al.(2022)Kalugina, Marchenko,

and Palmer) heatmaps for defect localization and

highlighting critical areas that need conservation.

Quantitative Results: The MobileNetV3 model,

trained on the IHDS dataset (afﬁliated with

MDPI(2023)), achieved 91.5% accuracy in clas-

sifying images as belonging to categories of

conservation-required or not-required. When Grad-

CAM heatmaps are superimposed on images and

passed through the ResNet model, the accuracy is

improved to 92.7%, which establishes the beneﬁt of

enhanced visual cues for ﬁne-grained classiﬁcation.

Compared to other models applied to similar

datasets and tasks, VGG16 achieved 88.3% accuracy

in heritage image classiﬁcation tasks but struggled

with interpretability due to the lack of integrated visu-

alization methods. InceptionV3, which is well known

for its depth and efﬁciency, yielded 89.1% accuracy

but required substantially higher computational re-

sources during training. AlexNet (Li et al.(2023)Li,

Lin, and Zhang), one of the earlier CNN architec-

tures, demonstrated an accuracy of 84.7%, reﬂecting

its limitations in handling complex cultural heritage

datasets. GoogLeNet (Soni et al.(2023)Soni, Howard,

and Chansker)reported an accuracy of 86.5%, which,

while higher than AlexNet, still fell short of the pro-

posed MobileNetV3 and ResNet combination.

Detailed results are summarized in Table 1, show-

casing the superior performance of the proposed

method.

Table 1: Accuracy Results for Image Classiﬁcation of Cul-

tural Heritage Buildings (Howard et al.(2019)Howard, San-

dler, Chu, Chen, Chen, Tan, Wang, Zhu, Pang, Vasudevan,

et al.)

Model Accuracy (%) Interpretability

AlexNet 84.7 Low

GoogLeNet 86.5 Moderate

VGG16 88.3 Low

InceptionV3 89.1 Moderate

MobileNetV3 91.5 High (with Grad-CAM)

MobileNetV3 + ResNet (Proposed

Model)

92.7 Very High (Grad-CAM superimposed)

Qualitative Results: Heatmaps produced by Mo-

bileNetV3 for Grad-CAM identiﬁed most critical ar-

INCOFT 2025 - International Conference on Futuristic Technology

624

eas of interest such as cracks, discoloration, or struc-

tural defects effectively. These regions well corre-

sponded to those areas needing conservation. Over-

laid with the original images, the heatmaps provided

an interpretive understanding and could thus be val-

idated visually by the domain experts in their cor-

rectness or otherwise as predicted by the model (Soni

et al.(2023)Soni, Howard, and Chansker).

The ResNet model was also shown to perform

better in classiﬁcation with Grad-CAM-superimposed

outputs. ResNet made use of the localized defect in-

formation brought out by Grad-CAM to improve the

accuracy of the predictions and deliver more reliable

results than other models, which was up to 3-5 times

better (He et al.(2016)He, Zhang, Ren, and Sun)(Soni

et al.(2023)Soni, Howard, and Chansker).

These results further solidify the promise of bring-

ing together interpretability frameworks, such as

Grad-CAM, with deep learning models, to yield not

only accurate predictions but also insights that have

real-world implications for conservation (Tsiaras

et al.(2024)Tsiaras, Yu, and Zhou).

Figure 3: Grad-CAM heatmap and superimposed image

showing defect localization.

The ResNet model further demonstrated its abil-

ity to improve classiﬁcation performance when the

Grad-CAM-superimposed outputs were used. By

leveraging the localized defect information provided

by Grad-CAM, the ResNet model was able to make

more informed decisions, improving its accuracy

and enhancing the overall decision-making process

in cultural heritage conservation (He et al.(2016)He,

Zhang, Ren, and Sun)(Bahrami and Albadvi(2023)).

These results underscore the potential of com-

bining image classiﬁcation models with visual inter-

pretability techniques for applications in cultural her-

itage conservation, providing both accurate predic-

tions and valuable insights for conservation efforts

(Tsiaras et al.(2024)Tsiaras, Yu, and Zhou)(Martinez

et al.(2023)Martinez, Li, and Greene).

5 CONCLUSION

This study proves that the combination of Mo-

bileNetV3 and Grad-CAM works well for cultural

heritage conservation tasks, especially for image clas-

siﬁcation and defect localization. The accuracy of the

MobileNetV3 model on the IHDS dataset is 91.5%,

with Grad-CAM providing domain experts with the

means to validate areas of conservation (Howard

et al.(2019)Howard, Sandler, Chu, Chen, Chen,

Tan, Wang, Zhu, Pang, Vasudevan, et al.)(Bahrami

and Albadvi(2023)). Moreover, combining ResNet

with Grad-CAM-superimposed images showed an in-

crease in classiﬁcation accuracy to 92.7%, show-

ing promise for hybrid approaches in heritage con-

servation (He et al.(2016)He, Zhang, Ren, and

Sun)(Monna et al.(2021)). This conclusion highlights

visual interpretability as an imperative component of

workﬂows based on AI, promising accuracy in pre-

dictions while fostering trust. Real-world problems

facing the preservation of cultural heritage are solv-

able through lightweight deep learning models with

enhanced interpretability frameworks like Grad-CAM

(Bahrami and Albadvi(2023))(Soni et al.(2023)Soni,

Howard, and Chansker).

Future Scope: Future work can focus on scal-

ing the models to diverse cultural heritage datasets,

incorporating domain-speciﬁc knowledge to further

enhance prediction accuracy and model applicability

(Monna et al.(2021)). Exploring the integration of

other interpretability techniques could provide more

comprehensive insights, making the models more

adaptable to various conservation contexts (Authors

afﬁliated with the College of Arts et al.(2023)Authors

afﬁliated with the College of Arts, of Training,

Research-Asia, and (Shanghai)). Expanding these

methods to real-time defect detection and conser-

vation planning will be a key direction for future

research (Sandler et al.(2019)Sandler, Howard, and

Chu)(Bahrami and Albadvi(2023)).

REFERENCES

Authors afﬁliated with MDPI. Explainable deep

learning approach for multi-class brain tumor

classiﬁcation and localization. MDPI, 14(12):

642, 2023.

Authors afﬁliated with the Heritage Science jour-

nal. Application of deep learning algorithms for

identifying deterioration in cultural heritage im-

ages. Heritage Science, 12(1):1–12, 2024.

Authors afﬁliated with the International Confer-

ence on Multidisciplinary Studies. Ai appli-

Defect Detection and Classiﬁcation of Cultural Heritage Buildings Using Deep Learning

625

cations in cultural heritage preservation: Tech-

nological advancements for the conservation.

In Proceedings of the International Confer-

ence on Multidisciplinary Studies, pages 94–

101. Baskent University, 2022.

Beijing Union University Authors afﬁliated with the

College of Arts, UNESCO World Heritage In-

stitute of Training, Research-Asia, and Paciﬁc

(Shanghai). A hybrid deep learning approach for

multi-classiﬁcation of heritage monuments using

a real-phase image dataset. In Proceedings of the

International Conference on Intelligent Comput-

ing and Research in Cyber Security (ICIRCA),

pages 1105–1117. IEEE, 2023.

Mahdi Bahrami and Amir Albadvi. Deep learning for

identifying iran’s cultural heritage buildings in

need of conservation using image classiﬁcation

and grad-cam. arXiv preprint arXiv:2302.14354,

2023.

Z. B. Franco, J. S. Liao, and S. C. Lee. Heritage

site preservation using machine learning and ai-

based defect detection methods. Heritage Sci-

ence and Technology, 16:34–42, Oct 2021.

Kholoud Ghaith. Ai integration in cultural her-

itage conservation (ethical considerations and

the human imperative). International Journal of

Emerging Digital Intelligence and Engineering,

1(1):1–10, 2023.

Fern

andez-Mart

ınez J. L. Gonz

alez-P

erez, M. A. and

J. A. Garc

ıa-Garc

ıa. Technologies for the preser-

vation of cultural heritage—a systematic review

of the literature. Journal of Architectural Con-

servation, 25(1):1–18, 2023.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian

Sun. Deep residual learning for image recog-

nition. In 2016 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages

770–778, 2016. doi: 10.1109/CVPR.2016.90.

Andrew Howard, Mark Sandler, Grace Chu, Liang-

Chieh Chen, Bo Chen, Mingxing Tan, Weijun

Wang, Yukun Zhu, Ruoming Pang, Vijay Va-

sudevan, et al. Searching for mobilenetv3. In

Proceedings of the IEEE/CVF international con-

ference on computer vision, pages 1314–1324,

2019.

L. S. Kalugina, M. D. Marchenko, and M. R. Palmer.

Ai-powered solutions in heritage conservation.

Cultural Preservation Review, 18:77–86, Apr

2022.

L. C. Lee, M. S. Alvarez, and F. Z. He. Defect de-

tection in historical buildings through ai-based

frameworks. Computational Vision for Preser-

vation, 7(4):59–67, Nov 2023.

Y. C. Li, M. D. Lin, and H. Zhang. Deep learn-

ing frameworks for heritage site preservation and

analysis. Heritage Science Journal, 21:32–40,

Dec 2023.

M. Lopez, L.H. Bianchi, and K.S. Xu. Classiﬁca-

tion of architectural heritage images using deep

learning. Electronics, 7(10):992, 2017.

J. A. Martinez, H. T. Li, and G. C. Greene. Cul-

tural heritage conservation with deep neural net-

works: A systematic survey. Heritage Science

Advances, 14:1–10, Oct 2023.

Fabrice Monna et al. Deep learning to detect built

cultural heritage from satellite imagery. Journal

of Cultural Heritage, 49:177–183, 2021.

H. P

erez, J.H.M. Tah, and A. Mosavi. Deep learn-

ing for detecting building defects using convo-

lutional neural networks. Sensors, 19(16):3556,

2019.

Mark Sandler, Andrew Howard, and Grace Chu. Mo-

bilenetv3: Efﬁcient convolutional networks for

mobile vision applications. In Proceedings of the

IEEE International Conference on Computer Vi-

sion (ICCV), pages 5251–5260. IEEE, 2019.

S. K. Soni, P. G. Howard, and E. Chansker. Leverag-

ing grad-cam and mobilenet for cultural heritage

conservation. Journal of Deep Learning Appli-

cations, 18(1):45–56, Jul 2023.

P. B. Tsiaras, D. G. Yu, and X. Zhou. Machine learn-

ing for cultural heritage preservation: The im-

pact of deep learning. Heritage Conservation

Journal, 6:82–91, May 2024.

T. T. Wei, L. J. Kim, and A. Y. Hu. Heritage

preservation through deep learning: A case study

with cnns. Journal of Heritage Technology and

Preservation, 10:112–118, Aug 2021.

M. R. Williams, P. C. Martinez, and K. K. Sander-

son. Deploying ai for cultural heritage conserva-

tion. Journal of Modern Cultural Preservation,

17:13–21, Jan 2023.

D. G. Xu, K. N. Ruo, and L. X. Zhen. Cultural her-

itage site defect detection using cnns and deep

learning models. In Proceedings of the AI Her-

itage Preservation Conference, pages 142–156,

2022.

INCOFT 2025 - International Conference on Futuristic Technology

626