Color-Light Multi Cascade Network for Single Image Depth Prediction

on One Perspective Artifact Images

Aufaclav Zatu Kusuma Frisky

1,2

, Simon Brenner

, Sebastian Zambanini

and Robert Sablatnig

Computer Vision Lab, Institute of Visual Computing and Human-Centered Technology,

Faculty of Informatik, TU Wien, Austria

Electronics and Instrumentations Lab, Department of Computer Science and Electronics,

Universitas Gadjah Mada, Yogyakarta, Indonesia

Keywords:

Single Image, Depth Prediction, Color-light, Multi Cascade, One-side Perspective, State-of-the-Art, Roman

Coins, Temple Relief.

Abstract:

Different color material and extreme lighting change pose a problem for single image depth prediction on

archeological artifacts. These conditions can lead to misprediction on the surface of the foreground depth re-

construction. We propose a new method, the Color-Light Multi-Cascade Network, to overcome single image

depth prediction limitations under these inﬂuences. Two feature extractions based on Multi-Cascade Networks

(MCNet) are trained to deal with light and color problems individually for this new approach. By concatenat-

ing both of the features, we create a new architecture capable of reducing both color and light problems. Three

datasets are used to evaluate the method with respect to color and lighting variations. Our experiments show

that the individual Color-MCNet can improve the performance in the presence of color variations and fails to

handle extreme light changes; the Light-MCNet, on the other hand, shows consistent results under changing

lighting conditions but lacks detail. When joining the feature maps of Color-MCNet and Light-MCNet, we

obtain a detailed surface both in the presence of different material colors in relief images, and under differ-

ent lighting conditions. These results prove that our networks outperform state-of-the-art in limited number

dataset. Finally, we also evaluate our joined network on the NYU Depth V2 Dataset to compare it with other

state-of-the-art methods and obtain comparable performance.

1 INTRODUCTION

3D scanners are widely used these days for digital

archiving of objects of cultural heritage (Georgopou-

los et al., 2010); however, 3D scanning of scenes and

multiple objects is time-consuming. The use of these

scanners is subject to high maintenance conﬁgura-

tions, such as the light, distance, and the number of

scans (Georgopoulos et al., 2010). The lack of en-

ergy sources at sites that are difﬁcult to access exac-

erbates this problem. If scanning using high preci-

sion scanners is too expensive, archaeologists search

for time-efﬁcient and robust alternatives (Frisky et al.,

2020). In practical terms, Single Image Reconstruc-

tion (SIR) is more time-efﬁcient when opposed to

Structure-from-Motion and structure light scanning

techniques, as 3D models can be obtained efﬁciently

only using a single image. Distance prediction, re-

ferred to as Single Image Depth Prediction (SIDP),

is one of the crucial steps in the reconstruction phase

that uses in SIR (other than depth to 3D transfer), that

decides the ﬁnal product’s quality (Ming et al., 2021).

Problems can arise in SIDP due to material ob-

ject colors and extreme lighting conditions. The ma-

terial color poses problems in two different scenar-

ios that usually appear in real-life depth prediction

(Frisky et al., 2021c). In the ﬁrst scenario, surfaces at

different depths appear in the same color, leading to

misprediction at depth discontinuities. In the second

scenario, different surface colors appear in regions of

equal depth, leading to noise in the depth predictions.

Light conditions change the appearance of the ar-

tifact in the image. This phenomenon is also present

in outdoor scenes and thus appears in NYU Depth V2

(Silberman et al., 2012). To the best of our knowl-

edge, the handling of extreme light conditions is not

addressed in a speciﬁc manner in previous SIDP re-

search (Frisky et al., 2021b). An investigation in

Frisky et. al (Frisky et al., 2021b) work shows that

the state of the arts cannot handle the extreme change

Frisky, A., Brenner, S., Zambanini, S. and Sablatnig, R.

Color-Light Multi Cascade Network for Single Image Depth Prediction on One Perspective Artifact Images.

DOI: 10.5220/0010801100003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages

909-916

ISBN: 978-989-758-555-5; ISSN: 2184-4321

909

of light. It inspired us to create a new method that

is robust to both color and lighting variations. In

speciﬁc areas, such as cultural heritage applications,

most datasets contain a limited number of samples

(Brenner et al., 2018; Frisky et al., 2021a). This con-

dition also motivates a creation of a system that per-

forms well on small datasets.

The contributions of this paper are summarized as

follows:

1. We improve the architecture of the previous color

robust network (Color-MCNet) by changing the

number of dimensions of the transferred feature

map.

2. We create a new Light-MCNet architecture using

input from Intrinsic Image Decomposition to han-

dle extreme lighting variations.

3. We create a new a combined architecture by join-

ing the color- and Light-MCNet feature maps,

thereby improving both extreme lighting and ma-

terial color robustness of features in the ﬁnal

output. Our system outperforms state-of-the-art

methods in small datasets.

2 RELATED WORK

Historically, single-image 3D reconstruction has

been approached via shape-from-shading (Ruo Zhang

et al., 1999). However, the pure shape from shad-

ing methods make use of only a single depth cue and

are sensitive to color variations and depth discontinu-

ities (Ming et al., 2021). Saxena et al. (Saxena et al.,

2009) estimated depth from a single image by train-

ing a Markov Random Field on local and global im-

age features. Oswald et al. (Oswald et al., 2012) im-

proved the performance using interactive user input

with the same depth estimation problem. In archae-

ology, the reconstruction of cultural objects using 2D

images is used because of its ﬂexibility and efﬁciency

(Frisky et al., 2020). Pollefeys et al. (Pollefeys et al.,

2001) present an approach that obtains virtual mod-

els. Regarding the color in archeology artifacts, two

sub-problems need to be solved (Frisky et al., 2021c).

First, an exemplary system needs to be able to predict

the depth in the presence of different surface colors,

as they appear on relief surfaces. Second, the system

needs to reconstruct surfaces with different depths but

similar material colors. For most artifacts, foreground

and background (e.g. relief and wall) are made of the

same material.

In order to predict the depth and the shape, most

methods use RGB color as an input. In recent work,

Pan et al. (Pan et al., 2018) reconstruct a Borobudur

relief using single image reconstruction based on the

multi-depth approach from Eigen and Fergus (Eigen

and Fergus, 2015). However, the data used in their

paper does not involve different material colors. For

the requirement to solve these problems in the archae-

ological area, Frisky et al. provide a Registered Relief

Depth (RRD) dataset consisting of RGB images and

its corresponding depth on outdoor Borobudur (Frisky

et al., 2021a), and Prambanan reliefs (Frisky et al.,

2021c). Frisky et al. also propose a new method

called MCCNet (Frisky et al., 2021c) that uses the

cascaded color spaces with weight transfer to solve

both mentioned color problems.

Varying lighting situations pose another problem

for SIDP, which can affect performance and are dif-

ﬁcult to control in natural environments. To the best

of our knowledge, no speciﬁc research available in

monocular depth reconstruction mentions this prob-

lem. Most recent research uses available datasets such

as an indoor NYU Depth Dataset, that only use nat-

ural light (Song et al., 2021). However, no speciﬁc

experiment shows that it is robust to extreme changes

in lighting direction. Datasets of imaged under vary-

ing lighting conditions, together with a corresponding

depth map, are commonly used in photometric stereo

research (Brenner et al., 2018). Brenner et al. cre-

ated a dataset of ancient roman coins, each illumi-

nated from 54 directions (Brenner et al., 2018). It is

believed that coin surface are similar to wall-reliefs in

that they are sufﬁciently modelled in 2.5D represen-

tations (Frisky et al., 2021b). Together with a depth

map created via photometric stereo, this Roman Coin

dataset can be used as ground truth for evaluating the

robustness of SIDP approaches with respect to light-

ing direction.

3 PROPOSED METHOD

In this work we propose a single image depth pre-

diction method that is robust against material color

and extreme lighting variations. Furthermore, a sin-

gle network that can extract both properties (robust

to color and light difference) with low computational

time is needed.

First, our work utilizes an improved version of the

previous cascaded network (Color-MCNet) (Frisky

et al., 2021c). Second, we add a new network robust

to lighting variations, called Light-MCNet. Then, a

feature joining mechanism to produce the ﬁnal result.

An overview of the proposed method (Color-Light-

MCNet) is given in Figure 1. In the following, the

building blocks of the approach are described in de-

tail.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

910

IID

Color Mul�-Cascade Network

Light Mul�-Cascade Network

5x5

conv

Upscaling

HSV

YCrCb

Convert

Color

feature

map

Light

feature

map

Joined

feature

map

Pre-processing

Feature extrac�on

Join process

Finaliza�on

Figure 1: Architecture of the proposed Color-Light-MCNet. IID is Intrinsic Image Decomposition. The numbered arrows

beside 32 and 64 indicate the dimensionality of the feature maps.

3.1 Color-MCNet

In the Color-MCNet section (see Figure 2, we use a

modiﬁed version of the MCCNet architecture from

Frisky et al. (Frisky et al., 2021c), which is designed

for SIDP robust to surface color variations and is re-

used in this work to obtain color robust features. The

input image is successively presented to the network

in RGB, YCrCb and HSV color spaces, where fea-

tures from each stage are concatenated with features

from the previous stage. Our networks make use of

the sub-architectures illustrated in Figure 4. In sub-

architecture 1a, we use a 9x9 convolution network, a

stride of 2 and 2x2 pooling. This conﬁguration makes

the feature map size a quarter compared to the input

size, and a 3x3 convolution is applied afterwards. In

sub-architecture 2a, similar to 1a, we used a 9x9 con-

volution network, a stride of 2 and 2x2 pooling. In

sub-architecture 2b, the feature map of the previous

cascade level is concatenated to the current feature

map, and in 2c, similar to 1b, a 3x3 convolution is ap-

plied. The weight transfer in this network is applied

from sub-architecture 1a to 2a and sub-architecture 1b

to 2c. While the transferred feature maps in the orig-

inal architecture were 1-dimensional (Frisky et al.,

2021c), now a 32-dimensional feature map is trans-

ferred in order to allow more information to ﬂow from

one stage of the cascade to the next.

3.2 Light-MCNet

In the Light-MCNet section (see Figure 3), we change

the input and parameters of Color-MCNet: the RGB

Sub

Architecture I

Architecture

YCrCb

HSV

RGB

Feature

map

Feature

map

Feature

map

Sub

Figure 2: Three color-spaces (RGB,YCrCb,HSV) Color-

MCNet feature extraction. The input image is successively

presented to the cascaded network in different color spaces.

Feature maps and weights (w) are propagated to subsequent

cascade levels. Wt denotes the weight transfer process. Sub

Architectures 1 (initial feature map generation) and 2 (fea-

ture map generation with concatenation) are given in Fig-

ure 4.

image is decomposed using the unsupervised intrinsic

image decomposition by Letry et al. (Lettry et al.,

2018), in order to separate the lighting effects from

the original surface reﬂectance (see Figure 5 for an

example). In conjunction with the RGB image on the

ﬁrst level, these two images become an input to the

network.

3.3 Color-Light-MCNet

The two architectures mentioned above, Color and

Light-MCNet, aim to solve their speciﬁc problems,

i.e., material color and extreme lighting problems, re-

Color-Light Multi Cascade Network for Single Image Depth Prediction on One Perspective Artifact Images

911

Architecture I

Architecture

Reﬂectance

Shading

RGB

IID

Feature

map

Feature

map

Feature

map

Sub

Figure 3: Light-MCNet feature extraction. On the ﬁrst cas-

cade level, the original RGB input image is presented to the

network. On the subsquent levels, reﬂectance and shading

components extracted by unsupervised intrinsic image de-

composition (Lettry et al., 2018) are used as inputs. The

rest of the network works analogous to the Color-MCNet

(Figure 2).

Feature

map

input

9x9 conv

2stride

2x2 pool

input

Feature

map

9x9 conv

2stride

2x2 pool

Feature

map

5x5 conv

Feature

map

concatenate

Feature

map

Sub Architecture 1

Sub Architecture 2

5x5 conv

a b

Figure 4: Sub architectures 1 and 2 that are used in Color-

MCNet and Light-MCNet.

RGB

Reflectance

Shading

13° 51° 71°

Figure 5: Example results of the intrinsic image decom-

position for different light elevation angles (Roman Coin

dataset).

spectively. However, a single architecture that can ad-

dress both problems simultaneously would be prefer-

able. Thus, we added a concatenation mechanism

on the output feature map in each architecture. The

integration aims to combine the color and light ro-

bust feature maps from the two previous architectures.

The combined features are passed to a 5x5 convolu-

tion into one feature dimension at the end of this net-

work. Finally, upscaling is carried out to return to

the original resolution. The architecture of the pro-

posed joint network is shown in Figure 1. The ar-

chitecture consists of four parts: pre-processing, fea-

ture extraction, join process, and ﬁnalization. In pre-

processing, conversion into different color-space is

performed for Color-MCNet, and Intrinsic image De-

composition (IID) is done for the Light-MCNet. In

the feature extraction part, color and light robust fea-

ture maps are extracted, which are subsequently com-

bined in the join process to a single feature map.

Lastly, the ﬁnalization part converts the feature map

into the ﬁnal depth prediction.

4 EXPERIMENTS AND RESULTS

The performance of our architecture is individually

tested for robustness to color and lighting variations,

using the dedicated RRD Temple dataset and the

Roman Coin dataset, respectively. For evaluating

the performance in a mixed environment and com-

parison to state-of-the-art methods, the NYU Depth

V2 Dataset is used. Additionally to evaluating our

full network (Color-Light-MCNet), we also test the

performance of its components (Color-MCNet and

Light-MCNet) individually. As a primary error met-

ric, we use the Root Mean Square Error (RMSE):

RMSE =

|N|

∑

i=1

||y

− y

∗

(1)

where y

is groundtruth depth, y

∗

is predicted depth

and N is the number of test points.

4.1 Datasets

Two specialized datasets represent the two main chal-

lenges addressed in this paper: the RRD Temple

dataset represents different materials in a temple re-

lief, and the Roman Coin dataset represents different

lighting conditions in the image. Both RRD temple

and Roman coin datasets have a limited number of

samples. This condition motivates the creation of a

system that can perform well using a small number of

data. Additionally, in order to test the general applica-

bility of our approach and compare it with state of the

art, we also use the public NYU Depth V2 dataset.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

912

These three datasets are described in the following

subsections.

First dataset is The Registered Relief Depth

(RRD) Temple dataset, consists of two relief datasets

acquired at two Indonesian temples: Prambanan and

Borobudur. The RRD Temple Dataset is created by

Frisky et al. (Frisky et al., 2021c; Frisky et al., 2021a)

to accommodate the color difference problem created

by different materials appearing on archaeological re-

liefs and different color by chemical reaction. From

the 41 reliefs of the RRD Prambanan dataset, we

use 21 for training and 20 for testing. In the RRD

Borobudur dataset, 20 were used as training and 10 as

test sets. In total, we obtain 41 training examples and

31 test examples.

Second dataset is the Roman coin, consists of 23

coins, 11 of which are imaged from both sides and

12 from one side only, resulting in a total of 34 coin

sides. The dataset was originally created for photo-

metric stereo reconstruction, using a PhaseOne IQ260

Achromatic camera and a light dome with 54 indi-

vidually controlled LED light sources for illumina-

tion (Brenner et al., 2018).In the training phase, each

image is paired with its corresponding depth map.

From the 34 coin sides (each represented by 54 input

pairs), 26 are used for training and eight for testing.

All the grayscale images are converted into RGB be-

fore processing them in order to ﬁt the network input.

The last dataset is the NYU Depth v2 that offers

images and depth maps for various indoor scenes (Sil-

berman et al., 2012). The dataset includes 120K train-

ing samples and 654 test samples, but we only use a

50K subset to train our network. This dataset is used

to compare our results to the current state-of-the-art

methods, as a majority of publications use it for eval-

uation.

4.1.1 Augmentation

In this work, we augment the training data of the three

datasets in several ways (Eigen and Fergus, 2015):

• Scaling: Input and target images are scaled by s ∈

[1, 1.5], and the depths are divided by s.

• Rotation: Input and target are rotated by r ∈

[−5, 5] degrees.

• Color: Input values are multiplied globally by a

random RGB value c ∈ [0.8, 1.2]

• Flips: Input and target are horizontally ﬂipped

with 0.5 probability

4.2 Conﬁguration

We test three different architectures: the individual

Color-MCNet and Light-MCNet and the combined

Color-Light-MCNet network. Both Color-MCNet

and Light-MCNet use three different inputs on a

three-level multi cascade network. As shown in Fig-

ures 2 and 3 both networks output 32 dimensional

feature maps. In order to obtain the ﬁnal depth es-

timations, these feature maps are passed to an addi-

tional 5x5 convolution and upscaling network, anal-

ogous to the ﬁnalization stage of the Color-Light-

MCNet shown in Figure 1. For our experiments, we

train and test the Color-MCNet, Light-MCNet and

Color-Light-MCNet on all three datasets from scratch

for 100 epochs with a learning rate of 0.001. Addi-

tionally, AdaBins (Bhat et al., 2020) is trained and

tested on the RRD Temple dataset and the Roman

Coin dataset for reference; the performance of Ad-

aBins on the NYU Depth V2 Dataset is given by the

authors (Bhat et al., 2020).

4.3 Results

The three datasets used in this work represent dif-

ferent problems: the RRD Temple dataset represents

color problems, the Roman Coin dataset represents

lighting problems, and the NYU Depth V2 dataset

represents common indoor scenes and allows a com-

parison to state of the art. We thus evaluate our three

architectures on these three datasets in order to assess

their performance with respect to each of their spe-

ciﬁc problems. Results, including a comparison to

AdaBins, are shown in TABLE 1.

On the RRD Temple dataset, the Color-MCNet

performs better than the Light-MCNet. The differ-

ence in the material in the Prambanan temple and the

yellowing color in the Borobudur temple can be ap-

propriately resolved (see Figure 6). Compared to the

two networks, color-Light-MCNet outperformed all

implemented networks, including the AdaBins net-

work. In Figure 6, it can be seen that Color-MCNet

and Color-Light-MCNet perform well to different

materials in relief. On the other hand, Light-MCNet

produces low detail depth, and AdaBins results ex-

hibit erroneous depth differences caused by different

materials.

Regarding the performance on the Roman Coin

dataset, the Color-MCNet architecture cannot resolve

the inﬂuence of different lighting angles (see the sec-

ond row of Figure 7). Within the dataset, this method

performs most adequately with a 51

◦

light elevation

angle; this suggests that the Color-MCNet architec-

ture needs a speciﬁcs light condition for maximum

performance. For the Light-MCNet, we cannot ob-

serve signiﬁcantly better results with respect to abso-

lute RMS errors; for the input images with 51

◦

ele-

vation angle, it performs even worse than the Color-

Color-Light Multi Cascade Network for Single Image Depth Prediction on One Perspective Artifact Images

913

Table 1: Results of the proposed method on three databases (RMSE in mm, lower better). For the Roman Coin dataset, results

are grouped with respect to different light elevation angles.

RRD Temple Roman Coin NYU Depth V2

◦

Color-MCNet 2.53 3.22 1.70 1.06 1.24 1.18 0.557

Light-MCNet 4.43 1.34 1.28 1.25 1.21 1.18 0.720

Adabins (Bhat et al., 2020) 2.43 1.14 0.82 0.95 1.92 2.04 0.364

Color-Light MCNet 2.32 0.89 0.75 0.71 0.68 0.72 0.376

MCNet. However, the results of this architecture are

stable and are not impacted by different angles of in-

cident light. This can also be observed in Figure 7:

the architecture is robust to incident light direction,

but the results lack detail.

AdaBins performs similar to Color-MCNet, but it

produces better details on the results. The depth re-

sults of the Color-Light-MCNet are rich in detail and

robust to varying lighting conditions; even in extreme

light angles (13

◦

and 81

◦

, where the state of the arts

fails to produce consistent outputs (see the two bot-

tom rows of Figure 7 for example). Again, Color-

Light-MCNet outperforms the other tested methods

with respect to RMSE. Given the limited size of the

dataset, these results also suggest that our method is

especially useful in such situations.

(a)

(c)

(b)

(d)

(e)

(f)

Figure 6: Example results for the RRD Temple datasets. a:

RGB image (input), b: ground truth, c: Color-MCNet, d:

Light-MCNet, e: AdaBins, f: Color-Light-MCNet.

Using the NYU Depth V2 dataset, we test the per-

formance of our methods on a public dataset and com-

pare it to multiple states of the art methods. In Ta-

Input Data

Color-MCNet

Light MCNet

AdaBins

Color-Light

MCNet

13 32 51 71 82

° ° ° ° °

Ground truth

Figure 7: Results of the proposed method on the Roman

Coin datasets. Columns represent different elevation angles

of incident light. The ﬁrst row shows the input images, the

subsequent rows show the results obtained from different

architectures.

ble 1, it can be seen that Color-MCNet and Light-

MCNet perform worst, while AdaBins and Color-

Light-MCNet show similar results. TABLE 2 shows a

comparison of Color-Light-MCNet to state of the art

methods on the NYU Depth V2 dataset using several

error metrics, such as RMSE (Equation 1), threshold

(Equation 2), RMSE log (Equation 3), and absolute

Relative Difference (abs.REL) (Equation 4). The state

of the art results given in TABLE 2 are taken from

available scoreboards (Bhat et al., 2020). It can be

observed that on the NYU Depth V2 dataset, our pro-

posed Color-Light-MCNet achieves competitive re-

sults with state of the art in all comparison metrics.

% o f y

s.t. max(

∗

) = δ < thr (2)

RMSE log =

|N|

∑

i=1

||log y

− log y

∗

(3)

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

914

Table 2: Results of the proposed method on NYU Depth V2 compared with other state-of-the-art methods.

Higher better Lower better

δ < 1.25 δ < 1.25

δ < 1.25

RMSE linear RMSE log abs. REL

(Eigen and Fergus, 2015) 77.10% 95.00% 98.80% 0.639 0.215 0.158

(Frisky et al., 2021c) 79.40% 95.50% 99.10% 0.598 0.202 0.145

(Lee et al., 2018) 81.50% 96.30% 99.10% 0.572 0.193 0.139

(Lee and Kim, 2019) 83.70% 97.10% 99.40% 0.538 0.180 0.131

(Lee et al., 2019) 88.50% 97.80% 99.40% 0.392 0.142 0.110

Proposed 92.17% 98.70% 99.50% 0.376 0.098 0.108

(Wu et al., 2019) 93.20% 98.90% 99.70% 0.382 0.050 0.115

(Bhat et al., 2020) 90.30% 98.40% 99.70% 0.364 0.088 0.059

abs. REL =

∑

i=1

− y

∗

(4)

5 CONCLUSION

In this work, the Color-Light-MCNet is proposed

as a new approach for SIDP, speciﬁcally addressing

depth mispredictions arising from variations in mate-

rial color and lighting direction. Evaluations are per-

formed with respect to three datasets: the robustness

of the method to color and light variations is tested

using the RRD Temple dataset and the Roan Coin

dataset, respectively, while a comparison with state

of the art in natural environments is performed using

the public NYU Depth V2 dataset.

Prior to the full Color-Light-MCNet, we test its

two main components individually: the Color-MCNet

designed to produce features robust to surface color

variations, and the Light-MCNet designed to pro-

duce features robust to lighting direction. The Color-

MCNet performs well in the RRD Temple dataset but

fails to resolve the inﬂuence of different lighting an-

gles appearing in the Roman coin dataset; the method

delivers acceptable results only for speciﬁc lighting

directions.

The Light-MCNet introduces intrinsic image de-

composition as a pre-processing step to separate the

input images’ lighting effects and surface reﬂectance.

Together with the original RGB image, the decom-

position results are presented as inputs to the cas-

cade network.This approach proved largely invariant

to lighting direction (Roman Coin dataset), but the re-

sults generally lack detail.

Finally, we combine the two feature maps ob-

tained from Color-MCNet and Light-MCNet into a

single stream to combine the strengths of both ap-

proaches in a single architecture. The resulting Color-

Light-MCNet shows superior results on both the RRD

Temple dataset and the Roman Coins dataset, where

the results exhibit rich details and robustness to varia-

tions in surface color and lighting direction, respec-

tively. For these datasets, our method clearly out-

performs AdaBins (Bhat et al., 2020), the current

state of the art SIDP method. The results make it

evident of the superior performance of our method

with small datasets. On the NYU Depth V2 dataset,

Color-Light-MCNet could not outperform AdaBins

but shows competitive performance with respect to

another state of the art methods. Most of the images

in this work were taken from a frontal view perspec-

tive of the artifact. In the future, more comprehensive

research in SIDP on non-frontal views is needed.

ACKNOWLEDGMENT

This work is funded by a collaboration scheme be-

tween the Ministry of Research and Technology of the

Republic of Indonesia and OeAD-GmbH within the

Indonesian-Austrian Scholarship Program (IASP).

This work is also supported by Type C grant from

Electronics Instrumentation Lab, Universitas Gadjah

Mada.

REFERENCES

Bhat, S. F., Alhashim, I., and Wonka, P. (2020). Ad-

abins: Depth estimation using adaptive bins. Arxiv,

abs/2011.14141.

Brenner, S., Zambanini, S., and Sablatnig, R. (2018). An

investigation of optimal light source setups for photo-

metric stereo reconstruction of historical coins. In Eu-

rographics Workshop on Graphics and Cultural Her-

itage.

Eigen, D. and Fergus, R. (2015). Predicting depth, surface

normals and semantic labels with a common multi-

scale convolutional architecture. In Proceedings of

Color-Light Multi Cascade Network for Single Image Depth Prediction on One Perspective Artifact Images

915

IEEE International Conference on Computer Vision

(ICCV), pages 2650–2658.

Frisky, A. Z. K., Fajri, A., Brenner, S., and Sablatnig, R.

(2020). Acquisition evaluation on outdoor scanning

for archaeological artifact digitalization. In Farinella,

G. M., Radeva, P., and Braz, J., editors, Proceedings

of the 15th International Joint Conference on Com-

puter Vision, Imaging and Computer Graphics The-

ory and Applications, VISIGRAPP 2020, Volume 5:

VISAPP, Valletta, Malta, February 27-29, 2020, pages

792–799. SCITEPRESS.

Frisky, A. Z. K., Harjoko, A., Awaludin, L., Dharmawan,

A., Augoestien, N. G., Candradewi, I., Hujja, R. M.,

Putranto, A., Hartono, T., Suhartono, Y., Zambanini,

S., and Sablatnig, R. (2021a). Registered relief depth

(rrd) borobudur dataset for single-frame depth predic-

tion on one-side artifacts. Data in Brief, 35:106853.

Frisky, A. Z. K., Harjoko, A., Awaludin, L., Zambanini,

S., and Sablatnig, R. (2021b). Investigation of single

image depth prediction under different lighting condi-

tions: A case study of ancient roman coins. J. Comput.

Cult. Herit., 14(4).

Frisky, A. Z. K., Putranto, A., Zambanini, S., and Sablat-

nig, R. (2021c). Mccnet: Multi-color cascade net-

work with weight transfer for single image depth pre-

diction on outdoor relief images. In Del Bimbo, A.,

Cucchiara, R., Sclaroff, S., Farinella, G. M., Mei, T.,

Bertini, M., Escalante, H. J., and Vezzani, R., editors,

Pattern Recognition. ICPR International Workshops

and Challenges, pages 263–278, Cham. Springer In-

ternational Publishing.

Georgopoulos, A., Ioannidis, C., and Valanis, A. (2010).

Assessing the performance of a structured light scan-

ner. International Archives of Photogrammetry,

Remote Sensing and Spatial Information Sciences,

XXXVIII:250–255.

Lee, J., Heo, M., Kim, K., and Kim, C. (2018). Single-

image depth estimation based on fourier domain anal-

ysis. In 2018 IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition, pages 330–339.

Lee, J. H., Han, M.-K., Ko, D. W., and Suh, I. H. (2019).

From big to small: Multi-scale local planar guid-

ance for monocular depth estimation. arXiv preprint

arXiv:1907.10326.

Lee, J.-H. and Kim, C.-S. (2019). Monocular depth esti-

mation using relative depth maps. In The IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 9729–9738.

Lettry, L., Vanhoey, K., and Van Gool, L. (2018). Unsu-

pervised Deep Single-Image Intrinsic Decomposition

using Illumination-Varying Image Sequences. Com-

puter Graphics Forum, 37(10):409–419.

Ming, Y., Meng, X., Fan, C., and Yu, H. (2021). Deep

learning for monocular depth estimation: A review.

Neurocomputing, 438:14–33.

Oswald, M. R., T

oppe, E., and Cremers, D. (2012). Fast and

globally optimal single view reconstruction of curved

objects. In 2012 IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 534–541.

Pan, J., Li, L., Yamaguchi, H., Hasegawa, K., Thufail, F. I.,

Mantara, B., and Tanaka, S. (2018). 3D Reconstruc-

tion and Transparent Visualization of Indonesian Cul-

tural Heritage from a Single Image. In Eurographics

Workshop on Graphics and Cultural Heritage, pages

207–210. The Eurographics Association.

Pollefeys, M., Van Gool, L., Vergauwen, M., Cornelis, K.,

Verbiest, F., and Tops, J. (2001). Image-based 3d ac-

quisition of archaeological heritage and applications.

In VAST 01: Proceedings of the 2001 conference on

Virtual Reality, Archeology, and Cultural Heritage,

pages 255–262.

Ruo Zhang, Ping-Sing Tsai, Cryer, J. E., and Shah, M.

(1999). Shape-from-shading: a survey. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

21(8):690–706.

Saxena, A., Sun, M., and Ng, A. Y. (2009). Make3d: Learn-

ing 3d scene structure from a single still image. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 31(5):824–840.

Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. (2012).

Indoor segmentation and support inference from rgbd

images. In Fitzgibbon, A., Lazebnik, S., Perona, P.,

Sato, Y., and Schmid, C., editors, ECCV 2012, pages

746–760, Berlin, Heidelberg. Springer Berlin Heidel-

berg.

Song, M., Lim, S., and Kim, W. (2021). Monocular depth

estimation using laplacian pyramid-based depth resid-

uals. IEEE Transactions on Circuits and Systems for

Video Technology, pages 1–1.

Wu, Y., Boominathan, V., Chen, H., Sankaranarayanan, A.,

and Veeraraghavan, A. (2019). Phasecam3d — learn-

ing phase masks for passive single view depth esti-

mation. In 2019 IEEE International Conference on

Computational Photography (ICCP), pages 1–12.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

916