A Review of Methods for Estimating Depth Based on Various
Application Scenarios
Siqi Huang
a
Engineering College , Shantou University, Shantou, China
Keywords: Depth Estimation, 2D Image, Stereo Image, SPIdepth, HiMODE.
Abstract: In computer vision, depth estimation is a crucial task. The task is to measure the distance between each pixel
and the camera. The accuracy and efficiency of estimation tasks have undergone a significant improvement
due to the rise of deep learning. There are many distinct application situations for depth estimation, and the
task can be broadly classified into two categories: stereo image depth estimation and 2D image depth
estimation, depending on the needs of each scenario. In this paper, HR-Depth, HybridDepth, SPIdepth in 2D
image depth estimation methods and UniFuse, NLFB, PanoFormer, OmniFusion, HiMODE in stereo image
depth estimation methods are introduced and analysed. In addition, NYU-Depth V2, KITTI in 2D image
dataset and Stanford3D, and Matterport3D in stereo image dataset are introduced in detail, and the
effectiveness of these two kinds of approaches is examined and contrasted using the widely used evaluation
indices. It is found that HybridDepth and SPIdepth perform well in 2D image depth estimation, and NLFB
and HiMODE perform better in stereo image depth estimation. Future research in depth estimation may focus
more on depth estimation studies of stereo images, because stereo images usually contain richer depth
information, which can allow for higher accuracy in depth estimation.
1 INTRODUCTION
Depth estimate becomes a crucial element as
computer vision advances from basic image
identification to intricate scene comprehension.
Using a mapping relationship from 2D picture cues to
3D depth, the objective is to extract depth cues from
the image and calculate each pixel's distance from the
camera. In order to achieve applications in different
scenes, image datasets of different scene types such
as NYU-Depth, KITTI, Stanford3D, Matterport3D,
etc., have been constructed, which have driven the
research on depth estimation. As a result, depth
estimation has found extensive use in fields like
autonomous driving, robotics, VR, AR, 3D
reconstruction, and security monitoring.
As the application scenarios become wider and
wider, the depth estimation task is gradually divided
into two different types: 2D image depth estimation
and stereo image depth estimation. Using 2D images
without direct depth information, 2D image depth
estimation seeks to determine how to create a
mapping relationship between 2D image cues and 3D
a
https://orcid.org/0009-0009-1920-1180
depth in order to estimate each pixel's distance from
the camera; Whereas stereo image depth estimation is
the study of how to extract effective depth cues from
images containing 360-degree scene information in
the presence of aberrations or distortions, etc., in
order to obtain depth information.
Focusing on the above two different types of
depth estimation research, this paper analyses and
summarises their research progress and current status,
collects different types of research data, and classifies
and analyses them. Finally, it provides an outlook on
the research development and application of depth
estimation.
2 METHOD
2.1 Depth Estimation Methods For 2D
Images
The Eigen team was the first to use CNN for 2D
image depth estimation in 2014, and since then,
204
Huang, S.
A Review of Methods for Estimating Depth Based on Various Application Scenarios.
DOI: 10.5220/0013681100004670
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 2nd International Conference on Data Science and Engineering (ICDSE 2025), pages 204-211
ISBN: 978-989-758-765-8
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
numerous researchers have improved it in various
ways based on this. Meanwhile, unsupervised deep
learning methods have emerged, such as training the
network by generating new viewpoint maps or using
left and right views and video information to solve the
depth estimation problem. Supervised means that
there is a high dependence on training data with
accurate depth labels, whereas unsupervised means
that there is no need for depth-labelled data.
2020, HR - Depth was presented in a study by
Xiaoyang Lyu et al. This approach, which focuses on
optimizing the depth estimation effect through two
methodologies, is an enhanced deep network for
high-resolution self-supervised monocular depth
estimation. On the one hand, the jump connection in
DepthNet is redesigned to add intermediate nodes to
the original encoder and decoder nodes to aggregate
features, so that the decoder can obtain high-
resolution features with richer semantic information
and thus predict the depth map boundary more
accurately; on the other hand, the feature fusion
squeeze excitation (fSE) module is proposed, which
efficiently fuses features through global average
pooling, fully-connected layers and 1×1 convolution
to efficiently fuse features and improve network
performance while reducing the number of
parameters. In addition, a lightweight network Lite -
HR - Depth based on MobileNetV3 was constructed
and knowledge distillation techniques were applied to
further improve its accuracy. The benefits of this
approach are substantial when using Resnet-18 as
the encoder, HR-Depth performs better than any prior
state-of-the-art techniques with the fewest parameters
at both high and low resolutions; Lite - HR - Depth
contains only 3.1M parameters, yet achieves
comparable or even better performance than
Monodepth2 at high resolution; by redesigning the
jump connections and introducing the fSE module,
the accuracy of depth estimation in large gradient
region is effectively improved, and sharper edges can
be predicted; the high-resolution depth estimation is
deeply analysed, which provides theoretical basis and
practical references for subsequent studies.Lyu
LiuWang2020
In 2024, HybridDepth was presented in a study by
Ashkan Ganj et al. The method aims to address the
problems of scale ambiguity, hardware heterogeneity
and generalisability in depth estimation by fusing
focused stack information and single-image prior for
robust metric depth estimation. The process logic is
as follows: first, the relative depth map and the metric
depth map are generated using the single-image
relative depth estimator and the DFF metric depth
estimator, respectively; then, the relative depth map
is converted into a global-scale depth map by global
scale and displacement alignment, and a dense scale
map is constructed; finally, the global-scale depth
map is corrected using a deep-learning-based
refinement layer combined with a scale map and an
uncertainty map from the DFF module. depth map
with pixel-level scale correction to obtain the final
depth map. This method has significant advantages,
surpassing existing methods on several datasets, such
as on the DDFF12 and NYU Depth V2 datasets, with
significant improvement in metrics such as RMSE
compared to specific SOTA models; strong zero-
sample generalisation capability, with excellent
performance on the ARKitScenes and Mobile Depth
datasets; a compact model structure, with an
inference The model is compact, with inference time
of only 20ms and size of 240MB, which is more
suitable for mobile device deployment than other
models while ensuring accuracy; it effectively solves
the scale ambiguity problem of single-image depth
estimation, and the depth estimation is more accurate
and consistent under different scaling levels. Ganj
SuGuo2024
In 2024, Mykola Lavreniuk Research introduced
SPIdepth for self-supervised monocular depth
estimation. The primary flow logic of this technique,
which aims to increase depth estimate accuracy by
fortifying the pose network, is as follows: Use
DepthNet to extract visual characteristics from a
single RGB image, then use the encoder-decoder
framework and the previously trained ConvNext to
generate the depth map; In order to align the depth
map and the reference image during view
compositing, PoseNet is utilized to estimate the
relative poses between the input and the reference
image. The Self Query Layer is used to capture the
geometric cues for depth estimation, compute the
depth intervals and generate the final depth maps
through probabilistic linear combinations; both
DepthNet and PoseNet are optimised for training, and
the interference regions are filtered by an automatic
masking strategy, which is combined with various
loss functions to improve the model performance.
The benefits are significant, demonstrating excellent
performance on datasets such as KITTI, Cityscapes
and Make3D, with accuracy surpassing many other
methods, strong generalisation ability, excellent
performance in dealing with dynamic scenes and
zero-sample evaluations, and high efficiency in
inference using only a single image, with a
lightweight model design that is easy to integrate into
various types of systems. Mykola Lavreniuk
2024
A Review of Methods for Estimating Depth Based on Various Application Scenarios
205
2.2 Depth Estimation Methods For
Stereo Image
In 2021, UniFuse was proposed by Hualie Jiang et al.
for 360° stereo depth estimation. This method aims
at fusing the features of equirect-angular projection
(ERP) or cubemap projection (CMP) to enhance the
depth estimation. The process is as follows: firstly,
the input panoramic image is subjected to ERP and
CMP respectively, and the CMP features are
reprojected to the ERP mesh through C2E; the U-Net
is used as the baseline network, and the processed
CMP features are unidirectionally fused to the ERP
features in the hopping connection of the decoding
stage, which is specifically achieved through the
designed CEE fusion module, which firstly performs
the residual modulation on the CMP features to
reduce their discontinuities, and then uses the SE
block to adaptively adjust the importance of the
channels; finally, the isometric columnar projection
depth map is output from the fused features. This is
achieved by the designed CEE fusion module, which
firstly modulates the residuals of the CMP features to
reduce their discontinuities, and then adaptively
adjusts the importance of the channels by using the
SE block; finally, the fused features output an
isometric bar-projected depth map. This approach has
several noteworthy benefits: State-of-the-art
performance is attained on the four frequently used
datasets, and the benefit is evident on the largest real
dataset Matterport3D; the CEE fusion module is
made to better fuse the two projected features and
reduce the number of parameters; the proposed one-
way fusion framework fuses the CMP features to the
ERP branch only in the decoding stage, which is more
efficient than the two-way fusion; the model
complexity is low. Low complexity, UniFuse based
on ResNet-18 has a small complexity increase but
significant performance increase compared to
BiFuse, and can maintain real-time and good
performance on MobileNetV2; strong generalization
ability, outperforms BiFuse when migrating between
different datasets, and is able to generate reasonable
depths in regions with no real depth. JiangSheng
Zhu2021
In 2021, NLFB was proposed by Ilwi Yun et al. to
improve the 360 ° monocular depth estimation
method. The method consists of three main parts:
first, proposing a self-supervised learning method
using only gravity-aligned videos, which reduces the
dependence on depth data by mining the relationship
between the depths of consecutive scenes, and
constructing image, depth, and pose consistency loss
to train the model; secondly, using a jointly
supervised and self-supervised learning approach that
uses supervised learning to compensate for the
shortcomings of self-supervised learning in areas that
reflect light, for instance, and self-supervised learning
to improve the supervised learning features and
increase the model's ability to adapt to data that is not
visible; thirdly, the network can preserve global
information when reconstructing depth by creating a
non-local fusion block (NLFB), which applies non-
local operations to the features entering the fusion
block. The advantages of this method are significant:
Transformer is successfully applied to 360° depth
estimation, which outperforms previous methods on
multiple benchmark datasets and reaches the current
optimal level; the inaccuracy of supervised learning's
predictions because of data scarcity and the erratic
performance of self-supervised learning are both
successfully improved by the joint learning approach;
the non-local fusion block better preserves the global
information and improves the accuracy; the self-
supervised learning part requires only gravity-aligned
videos, which reduces the dependence on depth data
and shows advantages in the comparison with other
self-supervised learning methods. Yun Lee
Rhee2021
In 2022, Shen et al. proposed the PanoFormer
model, which aims to solve the problem of depth
estimation of indoor 360 ° panoramic images. It
starts by designing a hierarchically structured
network architecture in which the input stems
perform the initial processing of the image, followed
by the encoder and decoder progressively extracting
and reducing the features through multiple
hierarchical stages, where the key lies in the
collaborative work of the positional embedding, PST
blocks and convolutional layers included in each
stage. In terms of feature processing, a pixel-level
patch-dividing method is innovatively adopted to
finely divide the input features and manually create
tokens, which helps the network to capture more
detailed features, differentiating it from the traditional
Vision Transformer processing. In order to cope with
the distortion problem of panoramic images, the
relative position embedding is implemented using the
Spherical Token Locating Model (STLM), which
effectively reduces the effect of distortion on depth
estimation by transforming operations between
different domains. Meanwhile, to improve the
perception of the panoramic geometric structure and
provide important information for depth estimation,
the enhanced panoramic self-attention mechanism
with token flow, which is based on the classic Vision
Transformer block, allows the final position of the
ICDSE 2025 - The International Conference on Data Science and Engineering
206
tokens to be determined by the initialization and
learnable flow during the computation of the attention
scores, token flow, and resampled features. In the
supervised part of model training, the objective
function is designed by combining the inverse Huber
(Berhu) loss and gradient loss, and the loss value is
calculated according to a specific formula, which
drives the model to be optimized continuously.
Furthermore, two panorama-specific metricsthe left-
right consistency error (LRCE) and the polar root
mean square error (P-RMSE)are suggested in light
of the features of panoramic images. The P - RMSE
focuses on measuring the depth estimation accuracy
of the polar regions, while the LRCE is used to assess
the depth consistency of the left and right boundaries,
and together, they provide a comprehensive and
tailored evaluation criterion for panoramic depth
estimation to ensure the effectiveness and accuracy of
the model in the panoramic image depth estimation
task. ShenLinLiao2022
In 2022, Li et al. proposed the OmniFusion
method for 360-degree monocular depth estimation.
Given the presence of spherical distortion in 360
images affecting depth estimation, the method first
converts the ERP input images into a set of tangent
images using spherical projection, which can be used
to predict the depth map using a conventional CNN
architecture due to its lack of distortion. At the same
time, considering that the independence of tangent
image prediction of depth can cause problems, a
geometric embedding network is introduced to
combine pixel sphere coordinates with image centre
coordinates to provide geometric information, which
is fused with image features at an early stage of the
encoder to enhance depth consistency. To
compensate for the lack of overall information
brought by decomposing ERP, Transformer is used to
globally aggregate the image block features, and after
convolutional dimensionality reduction and
spreading to add positional embedding, the features
are adjusted by self-attention. In addition, let the
network predict the confidence maps, merge the
depths in a weighted average manner, and add a
regression layer to the decoder and transform the
domain. In order to enhance the quality of depth
estimation and successfully address the issue of 360-
image depth estimation, an iterative depth refinement
approach is also created to enhance the geometric
embedding based on the iteratively updated depth
information.LiGuoYan2022
In 2022, HiMODE was proposed by Masum Shah
Junayed et al. as an innovative method for depth
estimation of 360-degree panoramic images. The
technique, which is based on a CNN+Transformer
hybrid architecture, attempts to efficiently address the
issues of data loss and distortion in the depth
estimation of panoramic images. Architecturally, it
employs a deeply separable convolutional CNN
backbone network combined with a feature pyramid
network that is capable of extracting high-resolution
features near the edges, thereby reducing image
distortion and artefacts. The Transformer module, on
the other hand, plays an important role, with an
encoder that enhances the ability to encode depth
features by capturing the relationships between pixels
and global information in the image through self-
attention and cross-attention mechanisms; and a
spatial and temporal patch (STP) and a multi-head
self-attention (MHSA) layer in the decoder, which
can process and recover encoded features to generate
accurate depth maps. In addition, the method
introduces linear projection and positional coding,
where the feature maps extracted by the CNN are
appropriately processed to fit the input requirements
of the Transformer, and positional coding is utilised
to enhance the understanding of the image features.
By lowering the number of parameters and
computational expenses, the spatial residual block
(SRB) setting contributes to the system's increased
stability and performance. Conversely, the contextual
adjustment layer makes up for the lack of depth data
and increases the accuracy of depth estimation by
combining the feature maps that the CNN retrieved
with the depth maps that the Transformer produced.
The advantages of HiMODE include the following:
firstly, it achieves excellent performance on multiple
datasets and is able to estimate the depth map
accurately, especially in recovering surface details
and processing complex scenes. Second, the
technique is very flexible, robust to the size of the
input image, and can be used and trained on datasets
of various sizes. In addition, HiMODE's hybrid
architecture combines the advantages of CNN and
Transformer to provide an effective solution for depth
estimation of 360-degree panoramic images with a
wide range of applications.
Junayed
SadeghzadehIslam2022
3 EXPERIMENTS
3.1 Datasets
NYU-Depth V2 is a classical dataset focusing on
indoor scene understanding, with simultaneous
acquisition of colour (RGB) images and depth
(distance) information by Microsoft Kinect sensors,
A Review of Methods for Estimating Depth Based on Various Application Scenarios
207
containing 1449 pairs of densely annotated RGB and
depth images (with each object annotated with a
category and an individual number), as well as 464
diversified indoor scenes (e.g., home, office) from 3
cities ) of more than 400,000 frames of raw,
unlabelled video from three cities. The dataset is
divided into three parts: Labeled (preprocessed
labelled data with complete depth information), Raw
(raw sensor data), and a supporting toolkit, which is
widely used to train models for depth estimation,
semantic segmentation, 3D reconstruction, etc., and
is an important research foundation for robotics,
AR/VR, and other fields.
KITTI is a reputable dataset in the field of
automated driving that was jointly launched by the
Toyota Technological Institute (TTI) and the
Karlsruhe Institute of Technology (KIT) in Germany.
It contains rich data gathered by a range of sensors
(such as RGB cameras, stereo cameras, and 3D
LIDAR) in real traffic scenarios, covering complex
environments like cities and villages.Although its raw
data does not provide fine annotations for semantic
segmentation (e.g., pixel-level classification), the
researchers manually annotated some of the images
for different tasks (e.g., road detection, object
recognition) involving categories such as roads,
vehicles, pedestrians, and sky. This dataset has
become a core test benchmark for autonomous
driving algorithm development (e.g., environment
perception, depth estimation, visual localisation) due
to its realistic scenarios and diverse data.
The Stanford3D dataset is an important 3D
computer vision dataset, which is mainly collected
from indoor environments, covering several different
scenes such as offices, bedrooms, and so on. The
dataset contains rich data content, including a large
number of 3D models in the form of triangular
meshes, accompanied by information such as
materials and colours, and the corresponding RGB
images and depth images. At the same time, the
dataset also provides detailed annotation information,
including object category annotation, geometric
annotation (e.g., 3D coordinates, dimensions, shapes,
etc.) and semantic annotation (e.g., object functions,
uses, etc.). These annotation information provides
important support for tasks such as 3D model
reconstruction, object recognition and classification,
and scene understanding and interaction, making the
Stanford3D dataset widely valuable for research and
uses in the domain of computer vision.
A sizable RGB-D dataset for interior scene
comprehension is called Matterport3D. 10,800
panoramic views of 90 actual building-scale sceneries
made up of 194,400 RGB-D photos are included. A
residential building with several rooms and floors that
are labeled with surface configurations, camera
postures, and semantic segmentation makes up each
scene.
3.2 Evaluation Metrics
Indicators of the error or departure of a prediction
from the actual value include Absolute Relative Error
(Abs Rel), Squared Relative Error (Sq Rel), Root
Mean Square Error (RMSE), and Threshold Precision
Indicators (δ < 1.25, δ < 1.25², δ < 1.25³); Abs
Rel is the average of the absolute values of the relative
errors between the true and predicted values; the
lower the value, the better the model performs and the
closer the prediction is to the true value; The average
of the squares of the relative errors between the true
value and the anticipated value is known as Sq Rel.
The model performs better and the forecast result is
closer to the true value when the value is less. Sq Rel
is the average of the squares of the relative errors
between the predicted value and the true value, and
again the smaller the value, the better the model
performance; RMSE calculates the difference
between the predicted value and the true value, and
seeks for the average of the squares of the difference
and then opens the square root, and the smaller the
value is, the closer the predicted result is to the true
value, and the better the model prediction ability is;
and Threshold Precision is a metric that quantifies
how much the actual depth value deviates from the
projected depth value. The percentage of pixels in a
certain threshold range is the ratio of the predicted
depth value to the true depth value; the higher the
ratio, the more accurately the depth estimate is made
under the associated threshold value.
ICDSE 2025 - The International Conference on Data Science and Engineering
208
3.3 Experiment Data
Table 1: Experimental data on depth estimation methods for 2D images
Dataset Method Abs Rel RMSE
δ
2
δ
3
δ
NYU-Depth V2 HybridDepth 0.026 0.128 0.988 1.000 1.000
Metric3Dv2 0.047 0.183 0.989 0.998 1.000
Marigold 0.055 0.224 0.964 0.991 0.998
Depth
Anything
0.056 0.206 0.984 0.998 1.000
UniDepth 0.058 0.201 0.984 0.997 0.999
KITTI Monodepth2 0.115 4.701 0.879 0.961 0.982
PackNrt-SfM 0.107 4.538 0.889 0.962 0.981
HR-Depth 0.104 4.410 0.894 0.966 0.984
GEDepth 0.048 2.044 0.976 0.997 0.999
EVP 0.048 2.015 0.980 0.998 1.000
SQLdepth 0.043 1.698 0.983 0.998 0.999
LightedDepth 0.041 1.748 0.989 0.998 0.999
SPIDepth 0.029 1.394 0.990 0.999 1.000
As shown in table 1, the minimum values of Abs Rel
and RMSE under different datasets are marked in
bold, and the maximum value of threshold accuracy
index is marked in bold, which shows that the current
2D image depth estimation methods of HybridDepth
and SPIDepth have better performance, and the
model prediction accuracy is higher.
Table 2: Experimental data on depth estimation methods for stereo images
Dataset Method Abs Rel Sq Rel RMSE
δ
2
δ
3
δ
Stanford3D HoHoNet 0.0901 0.0593 0.4132 0.9047 0.9762 0.9933
Omnide
p
th 0.1009 0.0522 0.3835 0.9114 0.9855 0.9958
Bifuse 0.1214 0.1019 0.5396 0.8568 0.9599 0.9880
SvS
y
n 0.1003 0.0492 0.3614 0.9096 0.9822 0.9949
NLFB 0.0649 0.0240 0.2776 0.9665 0.9948 0.9983
PanoForme
r
0.0405 - 0.3083 0.9394 0.9838 0.9941
UniFuse 0.1114 - 0.3691 0.8711 0.9664 0.9882
OmniFusion 0.0950 0.0491 0.3474 0.8988 0.9769 0.9924
HiMODE 0.0532 0.0207 0.2619 0.9711 0.9965 0.9989
Matterport3D SvSyn 0.1063 0.0599 0.4062 0.8984 0.9773 0.9934
Omnidepth 0.1136 0.0671 0.4438 0.8795 0.9795 0.9950
HoHoNet 0.0671 0.0417 0.3416 0.9415 0.9838 0.9942
Bifuse 0.1330 0.1359 0.6277 0.8381 0.9444 0.9815
NLFB 0.0700 0.0287 0.3032 0.9599 0.9938 0.9982
PanoForme
r
- - 0.3635 0.9184 0.9804 0.9916
UniFuse 0.1063 - 0.4941 0.8897 0.9623 0.9831
OmniFusion 0.0900 0.0552 0.4261 0.9189 0.9797 0.9931
HiMODE 0.0658 0.0245 0.3067 0.9608 0.9940 0.9985
As shown in table 2, the minimum values of Abs Rel,
Sq Rel, RMSE under different datasets are boldly
marked, and the maximum value of the threshold
accuracy index is boldly marked, which indicates that
HiMODE, NLFB, and PanoFormer have performed
better than other stereo image depth estimation
techniques in recent years, and that the model
prediction accuracy is higher.
A Review of Methods for Estimating Depth Based on Various Application Scenarios
209
4 DISCUSSION
Both 2D image depth estimation and stereo image
depth estimation have their own shortcomings. 2D
image depth estimation has difficulties in data
acquisition and annotation, the model has limited
performance in dealing with complex scenes, lighting
changes and object occlusion, and feature extraction
is insufficient with high model complexity and
computational cost. Future research directions
include data enhancement, model optimisation, and
application scenario expansion, such as synthetic
data, innovative neural network structure, and fusion
of multimodal information. Stereo image depth
estimation, on the other hand, is more complex in data
acquisition and processing, and the model has
challenges in edge and detail capture, global and local
information fusion, and special scene processing, as
well as high model complexity and computational
cost. Future research can be carried out by optimising
the dataset, improving the model structure, fusing
multi-scale and multi-modal information, and
improving the real-time performance for better
application in augmented reality, virtual reality,
intelligent robotics and autonomous driving.
5 CONCLUSION
A crucial task in computer vision, depth estimation
measures the distance between each pixel in an image
and the camera. This information is vital for many
applications, including robotics, automatic driving,
VR, AR, 3D reconstruction, security monitoring, and
more. It also has significant research implications.
The accuracy and efficiency of depth estimation have
improved significantly with the advent of deep
learning, and it is currently primarily separated into
two categories: stereo image depth estimation and 2D
image depth estimation. In 2D image depth
estimation, HR-Depth optimises depth estimation by
redesigning jump connections and introducing fSE
modules, HybridDepth fuses focused stack
information and single-image a priori to solve scale
ambiguity, and SPIdepth strengthens the pose
network to improve accuracy, which have shown
excellent performance on multiple datasets. In stereo
image depth estimation, UniFuse fuses different
projection features, NLFB combines self-supervised
learning with non-local fusion blocks, PanoFormer
adopts innovative network architectures and
strategies to deal with distortion, OmniFusion solves
the problem of spherical distortion by using spherical
projection and Transformer, and HiMODE reduces
distortion by using a hybrid architecture of CNN+
Transformer. Transformer hybrid architecture to
reduce distortion and data loss, each of these
approaches has its own advantages in depth
estimation of complex scenes. However, at present,
2D image depth estimation suffers from the problems
of difficult data acquisition and annotation, limited
ability of the model to deal with complex scenes,
insufficient feature extraction, and high model
complexity and high computational cost, etc. Stereo
image depth estimation also faces the challenges of
data acquisition and processing, edge detail capture,
global and local information fusion, special scene
response, and model complexity and computational
cost. In the future, depth estimation research can be
carried out in various aspects, such as at the data level,
solving data-related problems by synthesising data
and optimizing datasets; at the model level,
innovating neural network structures, simplifying
models, and fusing multi-scale and multi-modal
information; at the application level, further
expanding to augmented reality, virtual reality,
intelligent robotics, and automated driving scenarios,
and improving real-time and accuracy, in order to
promote the depth estimation technology to a wider
range of fields. estimation technology to be widely
used and developed in more fields. Meanwhile, since
stereo images usually contain richer depth
information, such methods may become more and
more popular in the research of depth estimation.
REFERENCES
Cheng, B., Yu, Y., Zhang, L., et al., 2024. Depth estimation
of self-supervised monocular dynamic scene based on
deep learning. Journal of Remote Sensing, 28(9).
Ganji, A., Su, H., Guo, T., 2024. HybridDepth: Robust
Metric Depth Fusion by Leveraging Depth from Focus
and Single-Image Priors. arXiv preprint
arXiv:2407.18443.
Godard, C., Mac Aodha, O., Firman, M., et al., 2019.
Digging into self-supervised monocular depth
estimation. In Proceedings of the IEEE/CVF
International Conference on Computer Vision,
pp.3828-3838.
Guizilini, V., Ambrus, R., Pillai, S., et al., 2019. Packnet-
sfm: 3d packing for self-supervised monocular depth
estimation. arXiv preprint arXiv:1905.02693.
Hu, M., Yin, W., Zhang, C., et al., 2024. Metric3d v2: A
versatile monocular geometric foundation model for
zero-shot metric depth and surface normal estimation.
IEEE Transactions on Pattern Analysis and Machine
Intelligence.
Jiang, H., Sheng, Z., Zhu, S., et al., 2021. Unifuse:
Unidirectional fusion for 360 panorama depth
ICDSE 2025 - The International Conference on Data Science and Engineering
210
estimation. IEEE Robotics and Automation Letters,
6(2), pp.1519-1526.
Junayed, M.S., Sadeghzadeh, A., Islam, M.B., et al., 2022.
HiMODE: A hybrid monocular omnidirectional depth
estimation model. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern
Recognition, pp.5212-5221.
Ke, B., Obukhov, A., Huang, S., et al., 2024. Repurposing
diffusion-based image generators for monocular depth
estimation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern
Recognition, pp.9492-9502.
Lavreniuk, M., 2024. SPIdepth: Strengthened Pose
Information for Self-supervised Monocular Depth
Estimation. arXiv preprint arXiv:2404.12501.
Lavreniuk, M., Bhat, S.F., ller, M., et al., 2023. Evp:
Enhanced visual perception using inverse multi-
attentive feature refinement and regularized image-text
alignment. arXiv preprint arXiv:2312.08548.
Li, Y., Guo, Y., Yan, Z., et al., 2022. Omnifusion: 360
monocular depth estimation via geometry-aware
fusion. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp.2801-
2810.
Lyu, X., Liu, L., Wang, M., et al., 2021. Hr-depth: High
resolution self-supervised monocular depth estimation.
In Proceedings of the AAAI Conference on Artificial
Intelligence, 35(3), pp.2294-2301.
Piccinelli, L., Yang, Y.H., Sakaridis, C., et al., 2024.
UniDepth: Universal monocular metric depth
estimation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern
Recognition, pp.10106-10116.
Shen, Z., Lin, C., Liao, K., et al., 2022. PanoFormer:
Panorama transformer for indoor 360 depth
estimation. In European Conference on Computer
Vision, Cham: Springer Nature Switzerland, pp.195-
211.
Sun, C., Sun, M., Chen, H.T., 2021. Hohonet: 360 indoor
holistic understanding with latent horizontal features. In
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp.2573-
2582.
Wang, F.E., Yeh, Y.H., Sun, M., et al., 2020. Bifuse:
Monocular 360 depth estimation via bi-projection
fusion. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp.462-471.
Wang, Y., Liang, Y., Xu, H., et al., 2024. Sqldepth:
Generalizable self-supervised fine-structured
monocular depth estimation. In Proceedings of the
AAAI Conference on Artificial Intelligence, 38(6),
pp.5713-5721.
Yang, L., Kang, B., Huang, Z., et al., 2024. Depth anything:
Unleashing the power of large-scale unlabeled data. In
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp.10371-
10381.
Yang, X., Ma, Z., Ji, Z., et al., 2023. Gedepth: Ground
embedding for monocular depth estimation. In
Proceedings of the IEEE/CVF International Conference
on Computer Vision, pp.12719-12727.
Yun, I., Lee, H.J., Rhee, C.E., 2022. Improving 360
monocular depth estimation via non-local dense
prediction transformer and joint supervised and self-
supervised learning. In Proceedings of the AAAI
Conference on Artificial Intelligence, 36(3), pp.3224-
3233.
Zioulis, N., Karakottas, A., Zarpalas, D., et al., 2019.
Spherical view synthesis for self-supervised 360 depth
estimation. In 2019 International Conference on 3D
Vision (3DV), IEEE, pp.690-699.
Zhu, S., Liu, X., 2023. LightedDepth: Video depth
estimation in light of limited inference view angles. In
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp.5003-
5012.
A Review of Methods for Estimating Depth Based on Various Application Scenarios
211