4 DISCUSSION
Both 2D image depth estimation and stereo image
depth estimation have their own shortcomings. 2D
image depth estimation has difficulties in data
acquisition and annotation, the model has limited
performance in dealing with complex scenes, lighting
changes and object occlusion, and feature extraction
is insufficient with high model complexity and
computational cost. Future research directions
include data enhancement, model optimisation, and
application scenario expansion, such as synthetic
data, innovative neural network structure, and fusion
of multimodal information. Stereo image depth
estimation, on the other hand, is more complex in data
acquisition and processing, and the model has
challenges in edge and detail capture, global and local
information fusion, and special scene processing, as
well as high model complexity and computational
cost. Future research can be carried out by optimising
the dataset, improving the model structure, fusing
multi-scale and multi-modal information, and
improving the real-time performance for better
application in augmented reality, virtual reality,
intelligent robotics and autonomous driving.
5 CONCLUSION
A crucial task in computer vision, depth estimation
measures the distance between each pixel in an image
and the camera. This information is vital for many
applications, including robotics, automatic driving,
VR, AR, 3D reconstruction, security monitoring, and
more. It also has significant research implications.
The accuracy and efficiency of depth estimation have
improved significantly with the advent of deep
learning, and it is currently primarily separated into
two categories: stereo image depth estimation and 2D
image depth estimation. In 2D image depth
estimation, HR-Depth optimises depth estimation by
redesigning jump connections and introducing fSE
modules, HybridDepth fuses focused stack
information and single-image a priori to solve scale
ambiguity, and SPIdepth strengthens the pose
network to improve accuracy, which have shown
excellent performance on multiple datasets. In stereo
image depth estimation, UniFuse fuses different
projection features, NLFB combines self-supervised
learning with non-local fusion blocks, PanoFormer
adopts innovative network architectures and
strategies to deal with distortion, OmniFusion solves
the problem of spherical distortion by using spherical
projection and Transformer, and HiMODE reduces
distortion by using a hybrid architecture of CNN+
Transformer. Transformer hybrid architecture to
reduce distortion and data loss, each of these
approaches has its own advantages in depth
estimation of complex scenes. However, at present,
2D image depth estimation suffers from the problems
of difficult data acquisition and annotation, limited
ability of the model to deal with complex scenes,
insufficient feature extraction, and high model
complexity and high computational cost, etc. Stereo
image depth estimation also faces the challenges of
data acquisition and processing, edge detail capture,
global and local information fusion, special scene
response, and model complexity and computational
cost. In the future, depth estimation research can be
carried out in various aspects, such as at the data level,
solving data-related problems by synthesising data
and optimizing datasets; at the model level,
innovating neural network structures, simplifying
models, and fusing multi-scale and multi-modal
information; at the application level, further
expanding to augmented reality, virtual reality,
intelligent robotics, and automated driving scenarios,
and improving real-time and accuracy, in order to
promote the depth estimation technology to a wider
range of fields. estimation technology to be widely
used and developed in more fields. Meanwhile, since
stereo images usually contain richer depth
information, such methods may become more and
more popular in the research of depth estimation.
REFERENCES
Cheng, B., Yu, Y., Zhang, L., et al., 2024. Depth estimation
of self-supervised monocular dynamic scene based on
deep learning. Journal of Remote Sensing, 28(9).
Ganji, A., Su, H., Guo, T., 2024. HybridDepth: Robust
Metric Depth Fusion by Leveraging Depth from Focus
and Single-Image Priors. arXiv preprint
arXiv:2407.18443.
Godard, C., Mac Aodha, O., Firman, M., et al., 2019.
Digging into self-supervised monocular depth
estimation. In Proceedings of the IEEE/CVF
International Conference on Computer Vision,
pp.3828-3838.
Guizilini, V., Ambrus, R., Pillai, S., et al., 2019. Packnet-
sfm: 3d packing for self-supervised monocular depth
estimation. arXiv preprint arXiv:1905.02693.
Hu, M., Yin, W., Zhang, C., et al., 2024. Metric3d v2: A
versatile monocular geometric foundation model for
zero-shot metric depth and surface normal estimation.
IEEE Transactions on Pattern Analysis and Machine
Intelligence.
Jiang, H., Sheng, Z., Zhu, S., et al., 2021. Unifuse:
Unidirectional fusion for 360 panorama depth