DA-NET: Monocular Depth Estimation using Disparity Maps Awareness

NETwork

Antoine Billy

1,2

and Pascal Desbarats

Laboratoire Bordelais de Recherches en Informatique, Universit

e de Bordeaux, France

Innovative Imaging Solutions, Pessac, France

Keywords:

Monocular Depth Estimation, Disparity Images, Stereo Vision, Convolutional Networks, U-NET.

Abstract:

Estimating depth from 2D images has become an active ﬁeld of study in autonomous driving, scene reconstruc-

tion, 3D object recognition, segmentation, and detection. Best performing methods are based on Convolutional

Neural Networks, and, as the process of building an appropriate set of data requires a tremendous amount of

work, almost all of them rely on the same benchmark to compete between each other : The KITTI benchmark.

However, most of them will use the ground truth generated by the LiDAR sensor which generates very sparse

depth map with sometimes less than 5% of the image density, ignoring the second image that is given for

stereo estimation. Recent approaches have shown that the use of both input images given in most of the depth

estimation data set signiﬁcantly improve the generated results. This paper is in line with this idea, we devel-

oped a very simple yet efﬁcient model based on the U-NET architecture that uses both stereo images in the

training process. We demonstrate the effectiveness of our approach and show high quality results comparable

to state-of-the-art methods on the KITTI benchmark.

1 INTRODUCTION

The problem of estimating depth in an image got ad-

dressed by many authors as a crucial factor in scene

understanding applications like, for instance, au-

tonomous driving (Urmson et al., 2008), robot assis-

tance (Cunha et al., 2011), object detection (Volkhardt

et al., 2013) or augmented reality systems (Ong and

Nee, 2013). Being able to generate a dense depth map

from a 2D image turned out to be strongly appealing

as modern solutions require a speciﬁc equipment such

as LiDAR (Schwarz, 2010) in which costs might be-

come expensive, RGB-D camera (Henry et al., 2014)

only sustainable for indoor scenarios or stereo vision

systems (Bertozzi and Broggi, 1998) with a decreas-

ing precision for large-scale scenes. Predicting depth

from an image is relatively easy for a human brain

thanks to perspective and the known size of ordinary

objects but it can still be deceived by very simple op-

tical illusions. Monocular depth estimation by a com-

puter has thus proven to be a key challenge in recent

studies and several benchmarks

appeared during the

last decade, requiring more precision and accuracy for

the incoming methods.

http://www.cvlibs.net/datasets/kitti/eval depth.php

Figure 1: Example output from our model. From Top to

Bottom: Original image and its predicted depth.

Thanks to these benchmarks, it is obvious to see

that all of the best performing methods (Fu et al., ;

ıaz, ; Ren et al., 2019) make use of deep learn-

ing approaches, and more speciﬁcally Convolutional

Neural Networks. To do so, a large amount of data

is required for the models to be trained. Even with

self-supervised methods (Godard et al., 2017; Zhu

et al., 2018; Zhong et al., 2017) the need of ei-

ther a stereo system or a strict acquisition process

with strongly overlapping frames is needed. Building

such a database is a colossal job, this is why most of

Billy, A. and Desbarats, P.

DA-NET: Monocular Depth Estimation using Disparity Maps Awareness NETwork.

DOI: 10.5220/0009174405290535

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

529-535

ISBN: 978-989-758-402-2; ISSN: 2184-4321

529

modern methods use the already existing data sets.

This moreover allows these methods to compare

themselves with each other on the same scenarios.

These data sets provide several sequences of images

with their corresponding depth as ground truth. There

is a limited number of solutions allowing the data

builder to acquire this information. Synthetic dataset

(Mayer et al., 2016; Billy et al., 2019) have been

developed to ensure perfect ground truth generation.

These images however lack of realism and fail to de-

pict real world artifacts regularly occurring in exter-

nal acquisitions. The use of a RGB-D camera such as

the Kinect (Zhang, 2012) has been exploited in (Sil-

berman et al., 2012). This sensor is suitable for close-

range indoor scenarios but can not be applied for large

outdoor scenes. LiDAR sensors appear to be the best

candidates for our application but requires high en-

try costs. Moreover, a LiDAR sensor will generate

a sparse depth map that will need both realignment

and interpolation processes. Finally, a stereo sys-

tem might be embedded to generate dense depth map.

This layout is yet not enough on its own because of

the lack of precision increasing with the scene depth.

(Geiger et al., 2012) took advantage of the last two so-

lutions, combining the strength of the LiDAR and the

consistency of stereo vision models, to build a widely

used dataset : The so-called KITTI benchmark. De-

spite the fact that the data from KITTI allow us to use

both LiDAR and stereo acquisitions, almost all of the

existing methods only use the LiDAR depth map for

the training. The role of this paper is to show that by

taking into account both stereo images and building

their corresponding disparity map, we can perform

state of the art results with a very simple network ar-

chitecture.

2 RELATED WORKS

Depth prediction from single view has gained in-

creasing attention in the computer vision community

thanks to the recent advances in deep learning. Si-

multaneous Localization and Mapping (Thrun, 2008)

or Structure from Motion (Ullman, 1979) methods

turned out to be powerful tools able to accomplish this

mission but will require several images of the same

scene with an important overlap between two con-

secutive frames in order to work efﬁciently and can

not be classiﬁed in the single image depth estimation

category. Classic depth prediction approaches em-

ploy hand-crafted features and probabilistic graphical

models (Delage et al., 2007), (Mutimbu and Robles-

Kelly, 2013), (Saxena et al., 2007) to yield regularized

depth maps, usually making strong assumptions on

the scene geometry. Moreover, formulating depth es-

timation as a Markov Random Field (MRF) learning

problem as in (Schwarz, 2010) might result to some

issues. As exact MRF learning and inference are in-

tractable in general, most of these approaches em-

ploy approximation methods such as multiconditional

learning (MCL) or particle belief propagation (PBP).

These approximations require complex modeling pro-

cess and limit their applications to very speciﬁc sce-

narios. Furthermore, recently developed deep convo-

lutional architectures signiﬁcantly outperformed pre-

vious methods in terms of depth estimation accuracy,

speed and versatility.

An exhaustive review of CNN single-image depth

estimator has been made by (Koch et al., ) in ad-

dition with a standardized evaluation protocol. One

of the ﬁrst truly competitive CNN for single-image

depth prediction has been developed by (Eigen et al.,

2014) using a two scale deep network. Unlike most

other previous work in single image depth estimation,

they do not rely on hand crafted features or an initial

oversegmentation and instead learn a representation

directly from the raw pixel values. Several works have

built upon the success of this approach using tech-

niques, improving the overall precision of the estima-

tion. These approaches however rely on having high

quality, pixel aligned, ground truth depth at training

time, and are only based on LiDAR velodyne data.

(Zhu et al., 2018; Zhong et al., 2017) proposed self-

supervised methods whose training phase does not re-

quire ground truth data. Similarly, (Garg et al., 2016)

offered an unsupervised CNN for depth prediction.

They train a network for monocular depth estimation

using an image reconstruction loss. However, their

image formation model is not fully differentiable and

thus have to perform a Taylor approximation to lin-

earize their loss, resulting in an objective that is more

challenging to optimize and again do not take fully

advantage of the stereo data.

This principle have been exploited with the build of

the DispNet architecture (Mayer et al., 2016), later

improved by (Godard et al., 2017). They introduce the

integration of disparity in their loss function by tak-

ing account of the synthetic depth map developed by

(Mayer et al., 2016) and the left-right consistency em-

ployed in (Godard et al., 2017). Our approach is based

on the use of unsupervised cues such as (Godard et al.,

2017) and use both images in the stereo pair equiva-

lently to deﬁne our loss function but use the strength

of supervised methods to enhance the generated re-

sults. A quantitative comparison between our meth-

ods and (Godard et al., 2017) is detailed in a further

section, showing that we outperform their prediction

in almost all metrics.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

530

3 THE DA-NET MODEL

Figure 2: Example data given by KITTI. Top to Bottom:

original RGB image, LiDAR depth map, Computed dispar-

ity map.

This section describes our single image depth pre-

diction network. We introduce a new depth estima-

tion training loss, including a Disparity Map aware-

ness term. This allow us to train our model with fully

dense images. Indeed, as shown in ﬁgure 2, the depth

image generated by LiDAR acquisition is very sparse

and might need strong interpolation to generate a ﬁ-

nal dense image. This interpolation process generates

a rough interpretation of the acquired scene and could

easily miss ﬁne details or even small objects such as

road signs, trees or,far away, even pedestrians or cars.

On the other hand, a disparity map generated by a

stereo vision algorithm will directly generate a depth

value for every pixel in the image, resulting in a

denser depth estimation. A thorough study of stereo

vision algorithms has been exposed by (Scharstein

et al., 2001) and several approaches might be used

to generate the ﬁnest disparity map from rectiﬁed im-

ages. Semi Global matching (SGBM) (Hirschm

uller,

2008) is known to produce more accurate depths and

is an integral part of many of the state-of-the-art stereo

algorithms. We train a CNN to model the complex

non-linear transformation which converts an image to

a depth-map. The loss we use for learning this CNN

is the photometric difference between the input im-

age, and the inverse warped target image (the other

image in the stereo pair). This loss is highly corre-

lated with the prediction error as it can be used to ac-

curately rank two different depth-maps even without

using ground-truth labels.

Training Loss. Similar to (Godard et al., 2017), we

also formulate our problem as the minimization of a

photometric reprojection error at training time. Our

ﬁnal loss L (4) is computed as a combination of three

main terms.

Photometric Loss. Following (Pillai et al., 2019),

the similarity between the target image I

and the syn-

thesized target image image

is computed using the

Structural Similarity (SSIM) (Pillai et al., 2019) term

combined with a L1 norm as shown in 3:

= α

1 − SSIM(I

)

+ (1 −α

)



−



(1)

Smoothness Loss. In order to regularize the dispar-

ities in textureless low-image gradient regions, we in-

corporate an edge-aware term based on the generated

disparity map. The effect of each of the pyramid-

levels is decayed by a factor of 2 on downsampling,

starting with a weight of 1 for the 0

pyramid level.

−

(2)

Disparity Awareness Loss. Finally, our main con-

tribution is in the integration of our Disparity Aware-

ness loss. This loss is extremely similar to the ﬁrst

criterion, however, instead of comparing the gener-

ated image I

with the LiDAR output

, we compute

its difference with the generated disparity map D

= α

1 − SSIM(I

, D

)

+ (1 − α

)

− D

(3)

Final Loss. Ultimately, a binary mask is applied

both the the LiDAR and Disparity images. This mask

ensures that non relevant pixels (LiDAR wholes or

left-most pixels on the disparity map) don’t interfere

in the ﬁnal loss computation.

L = αL

+ βL

+ γL

(4)

Whereas L

forces the reconstructed image to be sim-

ilar to the LiDAR input and L

to the Disparity map,

smoothes the generated depth estimation.

3.1 Network Architecture

Our depth estimation network is based on the gen-

eral U-Net architecture (Ronneberger et al., 2015). As

shown in Figure 3, this architecture enables to repre-

sent the input image at different scales, while taking

DA-NET: Monocular Depth Estimation using Disparity Maps Awareness NETwork

531

advantage of high level features for each step. Al-

though such method has been introduced for medical

image segmentation, we demonstrate hereafter how it

can be efﬁciently used for monocular depth prediction

when trained using our proposed loss function.

Figure 3: Our network is based on the general U-Net archi-

tecture.

4 EXPERIMENTS AND RESULTS

In this section, we compare the performances of our

approach to several state-of-the-art methods. We

applied straightforward data augmentation by hori-

zontally ﬂipping the images and performed random

gamma, brightness, and color shifts by sampling from

uniform distributions for each color channel sepa-

rately. We compare two different variants of our

method, with (ours) and without (ours No DA) Dis-

parity Awareness to evaluate its impact. We evalu-

ate our models, on the Eigen data split KITTI to al-

low comparison with previously published monocular

methods.

4.1 The KITTI Dataset

The KITTI dataset by (Geiger et al., 2012) was intro-

duced in IJRR in 2013. Instead of utilizing the entire

KITTI dataset, it is common to follow the Eigen split.

This split contains 23,488 images from 32 scenes for

training and 697 images from 29 scenes for testing.

We use the data split of (Eigen et al., 2014). Ex-

cept in ablation experiments, for training we follow

Zhou et al.’s pre-processing to remove static frames.

This results in 39,810 monocular images for train-

ing and 4,424 for validation. We use the same in-

trinsic parameters for all images, setting the principal

point of the camera to the image center and the focal

length to the average of all the focal lengths in KITTI.

For fair comparison with state-of-the-art single view

depth prediction, we evaluate our results on the same

cropped region of interest.

4.2 Evaluation Protocol

In addition with the dataset, a development kit is also

given by KITTI to evaluate our methods. The follow-

ing metrics are computed in this kit:

• RMSE:

∑

iεT



− d



• RMSE log:

∑

iεT



log(d

) − log(d

)



• Sq. relative:

∑

iεT

−d

• Abs. relative:

∑

iεT

−d

• Accuracies: % of d

s.t. max(

) = δ < thr

Even though these statistics are good indicators for

the general quality of predicted depth maps, they

could be delusive. Particularly, the standard metrics

are not able to directly assess the planarity of planar

surfaces or the correctness of estimated plane orienta-

tions. Furthermore, it is of high relevance that depth

discontinuities are precisely located, which is not re-

ﬂected by the standard metrics. This allows us yet to

easily compare our results with others approaches as

shown in Tables 1 and 2.

Table 1: Quantitative results. Comparison of our method to existing methods on the KITTI Eigen split. Best results are shown

in bold. Except for (Eigen et al., 2014), all the results have been directly taken from their papers, as we use the exact same

split for evaluation. This table shows that our method outperforms state of the art approaches in almost all metrics.

Lower is better Higher is better

Method Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.25

δ < 1.25

Train Set Mean 0.361 4.826 8.102 0.377 0.638 0.804 0.894

(Eigen et al., 2014) 0.214 1.605 6.563 0.292 0.673 0.884 0.957

(Yang et al., 2018) 0.198 1.202 5.977 0.266 0.72 0.901 0.932

(Godard et al., 2017) 0.148 1.344 5.927 0.247 0.803 0.922 0.964

Ours No DA 0.166 1.401 6.321 0.284 0.701 0.899 0.918

Ours 0.112 0.961 5.641 0.223 0.786 0.944 0.971

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

532

Table 2: Qualitative results on KITTI. Our method produces superior visual results on distinct scenarios thanks to its Disparity

Awareness. Smaller foreground objects such as trees or road signs are clearly better detected in our method.

Qualitative results on the KITTI Eigen Split

Method Scenario 1 Scenario 2 Scenario 3

Input image

(Hirschm

uller,

2008)

(Godard et al.,

2017)

Ours

4.3 Results Analysis

Quantitative results in Table 1 unambiguously high-

light the competitiveness of our method with respect

to state-of-the-art approaches.

Our scores outperform all the tested methods in

every pixel-wise metrics that average the offset be-

tween the prediction and the ground truth. This

proves that our proposed model provides more ac-

curate depth prediction compared to other methods.

This is conﬁrmed by the accuracy evaluation where

our method provides better scores for both δ = 1.25

and δ = 1.25

. However, for δ < 1.25 (Godard et al.,

2017) produces better results. This is due to the fact

that the depth maps generated by SGBM tend to be

less precise as the depth increases. Indeed, the depth

value of a pixel is directly correlated with the offset

from its correspondence in the second image. As this

offset is a discrete value in pixel unit, pixels corre-

sponding to faraway objects only have a limited set of

values. In contrast, foreground pixels are signiﬁcantly

easier to distinguish, allowing a visible improvement

of state-of-the-art results as illustrated in Table 2.

Here, we can see that SGBM tends to create arti-

facts in the reconstruction, as the consistency of the

disparity map lacks of regularization. The method

presented in (Godard et al., 2017) succeeds in gener-

ating smoother results. However, ﬁne structures such

as tree trunks or sign post might disappear in the re-

construction. Our approach achieves in both smooth

prediction with no artifacts, thanks to the combination

of photometric loss and disparity awareness, as well

as ﬁne structure recovering as the network encoder-

decoder architecture is able to handle a large variety

of structure scale during prediction.

This scenario is outlined even bolder in Figure 4

where a comparison is made between (Godard et al.,

2017) and our method. This points out expressly the

advantage of using disparity maps. The tree is clearly

visible in the input image but its outlines are partially

erased in (Godard et al., 2017) cutting the tree in two

halves. Thanks to the Disparity Awareness, our meth-

ods successfully segments the contours and the pre-

dicted depth ﬁts the real world scenario.

(a) Input image (b) Godard et al. (c) Our method

Figure 4: Qualitative results on KITTI. These zoomed depth

maps show that our approach visually outperforms (Godard

et al., 2017) for the foreground objects segmentation, pro-

ducing better results thanks to its Disparity Awareness.

5 CONCLUSION

In this paper we have presented a simple yet efﬁcient

deep neural network for monocular depth estimation.

We proposed a novel photometric loss that is able

to take advantage of disparity map consistency while

ensuring the regularity of the predicted depth image.

Along with the proposed loss, we showed that the use

of both stereo images to generate a dense disparity

map instead of just the one given by the LiDAR sen-

sor is a strong tool to predict satisfying dense depth

maps that even outperform qualitatively and quanti-

tatively state-of-the-art methods. Further works will

DA-NET: Monocular Depth Estimation using Disparity Maps Awareness NETwork

533

include evaluation on other datasets as well as stud-

ies on temporal consistency to apply our methods to

videos.

REFERENCES

Bertozzi, M. and Broggi, A. (1998). GOLD: A parallel real-

time stereo vision system for generic obstacle and lane

detection. IEEE Transactions on Image Processing,

7(1):62–81.

Billy, A., Pouteau, S., Desbarats, P., Chaumette, S., and

Domenger, J. P. (2019). Adaptive slam with synthetic

stereo dataset generation for real-time dense 3d re-

construction. In VISIGRAPP 2019 - Proceedings of

the 14th International Joint Conference on Computer

Vision, Imaging and Computer Graphics Theory and

Applications, volume 5, pages 840–848.

Cunha, J., Pedrosa, E., Cruz, C., Neves, A., and Lau, N.

(2011). Using a depth camera for indoor robot local-

ization and navigation. In DETI/IEETA-University of

Aveiro.

Delage, E., Lee, H., and Ng, A. Y. (2007). Automatic

single-image 3d reconstructions of indoor Manhattan

world scenes. Springer Tracts in Advanced Robotics,

28.

ıaz, R. Soft Labels for Ordinal Regression. Technical

report.

Eigen, D., Puhrsch, C., and Fergus, R. (2014). Depth map

prediction from a single image using a multi-scale

deep network. In Advances in Neural Information

Processing Systems, volume 3, pages 2366–2374.

Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao,

D. Deep Ordinal Regression Network for Monocular

Depth Estimation. Technical report.

Garg, R., Vijay Kumar, B. G., Carneiro, G., and Reid, I.

(2016). Unsupervised CNN for single view depth es-

timation: Geometry to the rescue. In Lecture Notes in

Computer Science (including subseries Lecture Notes

in Artiﬁcial Intelligence and Lecture Notes in Bioin-

formatics), volume 9912 LNCS, pages 740–756.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready

for autonomous driving? the KITTI vision benchmark

suite. In Proceedings of the IEEE Computer Society

Conference on Computer Vision and Pattern Recogni-

tion, pages 3354–3361.

Godard, C., Mac Aodha, O., and Brostow, G. J. (2017). Un-

supervised monocular depth estimation with left-right

consistency. In Proceedings - 30th IEEE Conference

on Computer Vision and Pattern Recognition, CVPR

2017, volume 2017-Janua, pages 6602–6611.

Henry, P., Krainin, M., Herbst, E., Ren, X., and Fox,

D. (2014). RGB-D mapping: Using depth cameras

for dense 3D modeling of indoor environments. In

Springer Tracts in Advanced Robotics, volume 79,

pages 477–491. Springer Verlag.

Hirschm

uller, H. (2008). Stereo processing by semiglobal

matching and mutual information. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

30(2):328–341.

Koch, T., Liebel, L., Fraundorfer, F., and K

orner, M. Eval-

uation of CNN-based Single-Image Depth Estimation

Methods. Technical report.

Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D.,

Dosovitskiy, A., and Brox, T. (2016). A Large Dataset

to Train Convolutional Networks for Disparity, Opti-

cal Flow, and Scene Flow Estimation. In Proceedings

of the IEEE Computer Society Conference on Com-

puter Vision and Pattern Recognition, volume 2016-

Decem, pages 4040–4048.

Mutimbu, L. and Robles-Kelly, A. (2013). A relaxed fac-

torial Markov random ﬁeld for colour and depth es-

timation from a single foggy image. In 2013 IEEE

International Conference on Image Processing, ICIP

2013 - Proceedings, pages 355–359.

Ong, S. and Nee, A. (2013). Virtual and augmented reality

applications in manufacturing.

Pillai, S., Ambrus¸, R., and Gaidon, A. (2019). SuperDepth:

Self-supervised, super-resolved monocular depth esti-

mation. In Proceedings - IEEE International Confer-

ence on Robotics and Automation, volume 2019-May,

pages 9250–9256.

Ren, H., El-khamy, M., and Lee, J. (2019). Deep Robust

Single Image Depth Estimation Neural Network Us-

ing Scene Understanding.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-

net: Convolutional networks for biomedical image

segmentation. In Lecture Notes in Computer Science

(including subseries Lecture Notes in Artiﬁcial Intel-

ligence and Lecture Notes in Bioinformatics), volume

9351, pages 234–241. Springer Verlag.

Saxena, A., Sun, M., and Ng, A. Y. (2007). Learning 3-D

scene structure from a single still image. In Proceed-

ings of the IEEE International Conference on Com-

puter Vision.

Scharstein, D., Szeliski, R., and Zabih, R. (2001). A

taxonomy and evaluation of dense two-frame stereo

correspondence algorithms. In Proceedings - IEEE

Workshop on Stereo and Multi-Baseline Vision, SMBV

2001, pages 131–140.

Schwarz, B. (2010). industry perspective technology focus

lIDAR Mapping the world in 3D. Technical report.

Silberman, N., Hoiem, D., Kohli, P., and Fergus, R.

(2012). Indoor segmentation and support inference

from RGBD images. In Lecture Notes in Computer

Science (including subseries Lecture Notes in Artiﬁ-

cial Intelligence and Lecture Notes in Bioinformatics),

volume 7576 LNCS, pages 746–760.

Thrun, S. (2008). Simultaneous localization and mapping.

Ullman, S. (1979). The interpretation of structure from mo-

tion. Proceedings of the Royal Society of London. Se-

ries B, Containing papers of a Biological character.

Royal Society (Great Britain), 203(1153):405–426.

Urmson, C., Anhalt, J., Bagnell, D., Baker, C., Bittner,

R., Clark, M. N., Dolan, J., Duggins, D., Galatali,

T., Geyer, C., Gittleman, M., Harbaugh, S., Hebert,

M., Howard, T. M., Kolski, S., Kelly, A., Likhachev,

M., McNaughton, M., Miller, N., Peterson, K., Pil-

nick, B., Rajkumar, R., Rybski, P., Salesky, B., Seo,

Y.-W., Singh, S., Snider, J., Stentz, A., Whittaker,

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

534

W. R., Wolkowicki, Z., Ziglar, J., Bae, H., Brown, T.,

Demitrish, D., Litkouhi, B., Nickolaou, J., Sadekar,

V., Zhang, W., Struble, J., Taylor, M., Darms, M., and

Ferguson, D. (2008). Autonomous driving in urban

environments. Journal of Field Robotics, 25(8):425–

466.

Volkhardt, M., Schneemann, F., and Gross, H. M. (2013).

Fallen person detection for mobile robots using 3D

depth data. In Proceedings - 2013 IEEE International

Conference on Systems, Man, and Cybernetics, SMC

2013, pages 3573–3578.

Yang, N., Wang, R., Stuckler, J., and Cremers, D. (2018).

Deep Virtual Stereo Odometry : Monocular Direct

Sparse Odometry. European Conference on Computer

Vision, pages 1–17.

Zhang, Z. (2012). Microsoft kinect sensor and its effect.

Zhong, Y., Dai, Y., and Li, H. (2017). Self-Supervised

Learning for Stereo Matching with Self-Improving

Ability.

Zhu, A., Yuan, L., Chaney, K., and Daniilidis, K. (2018).

EV-FlowNet: Self-Supervised Optical Flow Estima-

tion for Event-based Cameras. Robotics: Science and

Systems Foundation.

DA-NET: Monocular Depth Estimation using Disparity Maps Awareness NETwork

535