Machine Learning is the Solution Also for Foveated Path Tracing

Reconstruction

Atro Lotvonen

, Matias Koskela

and Pekka J

askel

ainen

Tampere University, Finland

Keywords:

Foveated Rendering, Path Tracing, Neural Network.

Abstract:

Real-time photorealistic rendering requires a lot of computational power. Foveated rendering reduces the work

by focusing the effort to where the user is looking, but the very sparse sampling in the periphery requires fast

reconstruction algorithms with good quality. The problem is even more complicated in the ﬁeld of foveated

path tracing where the sparse samples are also noisy. In this position paper we argue that machine learning

and data-driven methods play an important role in the future of real-time foveated rendering. In order to

show initial proofs to support this opinion, we propose a preliminary machine learning based method which is

able to improve the reconstruction quality of foveated path traced image by using spatio-temporal input data.

Moreover, the method is able to run in the same reduced foveated resolution as the path tracing setup. The

reconstruction using the preliminary network is about 2.9ms per 658 × 960 frame on a GeForce RTX 2080 Ti

GPU.

1 INTRODUCTION

High quality real-time graphics rendering remains a

major challenge in producing an immersive Mixed

Reality (MR) experience with contemporary comput-

ing platforms. A signiﬁcant part of the computa-

tional cost of photorealistic rendering can be reduced

by means of foveated rendering where the quality is

reduced in the peripheral parts of the human vision

(Guenter et al., 2012). However, most of the foveated

rendering work currently focuses merely on reduc-

ing the resolution, although peripheral vision has also

other understudied aspects which could be utilizied

for foveated rendering (Albert et al., 2019). There-

fore, we believe there is room for research in the area

of better foveated reconstruction ﬁlters.

Conventionally, the ﬁnal frame from foveated

sparse samples has been reconstructed with linear in-

terpolation. However, machine learning-based meth-

ods are an interesting future direction also for recon-

struction because they can also take other than reso-

lution reduction factors such as temporal stability of

the content into account. However, current machine

learning methods suffer from a slow inference run-

https://orcid.org/0000-0002-2985-4339

https://orcid.org/0000-0002-5839-7640

https://orcid.org/0000-0001-5707-8544

time and need a loss function that accurately mod-

els human vision. These problems are likely going

to be solved in the future by the increasingly fast in-

ference hardware available and development of faster

deep learning layers with similar performance (Shi

et al., 2016). In addition, a recent work shows that

the loss function can be a combination of multiple

simpler loss functions and it does not have to include

a single accurate model of the human visual system

(Kaplanyan et al., 2019).

In this position paper we argue that, in the fu-

ture, machine learning algorithms are going to play

a critical role in reconstructing foveated rendering re-

sults. We also argue that the reconstruction network

could be done only in the reduced foveated resolu-

tion. We also show preliminary evidence that machine

learning-based reconstruction can be fast enough for

wider adaption in the ﬁeld of foveated rendering.

2 PATH TRACING IN REAL TIME

There are many ways to render content for MR de-

vices. Path tracing (Kajiya, 1986) is an interesting

candidate because it naturally supports many effects

such as reﬂections, refractions, and caustics (Pharr

et al., 2016) which are harder to achieve with other

methods.

Lotvonen, A., Koskela, M. and Jääskeläinen, P.

Machine Learning is the Solution Also for Foveated Path Tracing Reconstruction.

DOI: 10.5220/0009156303610367

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 1: GRAPP, pages

361-367

ISBN: 978-989-758-402-2; ISSN: 2184-4321

361

frame

32 4

Bilinear chroma upsample

CNN luma upsample

Direct sampling of fovea

= 3 × 3 conv layer

= ReLU layer

= Tanh layer

= Sub pixel upscale

Mapping

to Screen

1 spp

Path tracing

analytical

denoiser

Mapping

n-1

distance

to fovea

= 5 × 5 conv layer

Figure 1: An example of a foveated rendering pipeline similar to Koskela et al. (2019b). In this position paper we claim that

the mapping stage should be done with machine learning. A similar mapping could be used with other rendering pipelines

that render foveated samples in continuous space like Friston et al. (2019). However, direct sampling of the fovea would be

in the center of the input image.

Due to its extreme computational requirements,

however, real-time path tracing was for a long time

considered unfeasible, but recently dedicated hard-

ware acceleration features for ray traversal, the prim-

itive operation of path tracing have started to appear

in state-of-the-art desktop processors (Kilgariff et al.,

2018). However, even with the hardware accelera-

tion, the Monte Carlo integration of path tracing can

only generate a very noisy estimate of the ﬁnal image

(Barr

e-Brisebois, 2018). This noise can be removed

in real time with analytical ﬁlters such as regression-

based methods (Koskela et al., 2019a) or fast edge

preserving blurs (Schied et al., 2017).

Analytical denoising of path tracing has also been

used in foveated rendering (Koskela et al., 2019b). In

this method the path tracing and denoising are done

in a polar space which is modiﬁed to distribute the

samples according to visual acuity model of the hu-

man eye. Another option instead of polar space would

be to use magniﬁed Cartesian screen space render-

ing (Friston et al., 2019). Typically, reconstruction

of the ﬁnal screen space frame is done with simple

ﬁlters like linear interpolation. However, the human

visual system can detect a presence of a pattern be-

fore resolving it and therefore it is a good idea to add

contrast to the peripheral parts of the foveated ren-

dering (Patney et al., 2016). Also, an analytical ﬁlter

can simulate other rendering effects such as depth of

ﬁeld (Weier et al., 2018). However, since data driven

methods have been dominating other ﬁelds of ﬁlling

missing image data (Hays and Efros, 2007) and as

the commercial processors increasingly add hardware

support for inference, in this paper we argue that ma-

chine learning can be a solution also to foveated ren-

dering reconstruction.

GRAPP 2020 - 15th International Conference on Computer Graphics Theory and Applications

362

3 MACHINE LEARNING

RECONSTRUCTION

Multiple machine learning based methods have been

proposed previously for removing path tracing noise,

but their run-time has so far been too slow for real-

time applications (Chaitanya et al., 2017; Vogels

et al., 2018; Bako et al., 2017). Deep learning has

also produced the state-of-the-art results for Single

Image Super Resolution (SISR) (Ledig et al., 2017).

The usual means for creating high resolution images

from low resolution ones with deep learning has been

to ﬁrst upsample the low resolution image with an

interpolation ﬁlter such as bicubic interpolation and

then use Convolutional Neural Networks (CNN) to

improve the result (Dong et al., 2015). However, Shi

et al. (2016) suggest that analytical upscaling does not

provide additional information for solving the prob-

lem. Instead the solution can be found by doing

the nonlinear convolutions only in the low resolution

space, and ﬁnally upscaling the image with an efﬁ-

cient sub-pixel convolutional layer. This way the net-

work learns the upscaling ﬁlters within the network,

which reduces the computational cost.

Furthermore, with real-time super resolution for

video data Caballero et al. (2017) found that including

temporal data with motion compensation improves

the quality of the result which can be used to reduce

the computational cost of the network with similar

quality. For temporal denoising and reconstruction of

path traced images, Bako et al. (2017) argues that us-

ing temporal data shifted with motion vectors is bet-

ter than using recurrent layers like proposed by Chai-

tanya et al. (2017).

In addition, different fusing strategies of temporal

data in video streams has been tested (Karpathy et al.,

2014; Caballero et al., 2017). In early fusing, the tem-

poral data is handled by giving the same input frames

to all of the input layers. With slow fusing, differ-

ent input CNN layers handle different input frames

and are collapsed later in the network. It was found

that early fusing works better for shallower networks

while in deeper networks the slow fusing is more ben-

eﬁcial.

In this position paper, based on these previous

works, we show initial evidence to our argument

by proposing a preliminary machine learning-based

method for reconstructing foveated path tracing. An

idea for the architecture of the CNN is to use the

efﬁcient sub-pixel convolutional layer. In that case

the reconstruction layers are computed in the lower

foveated resolution and therefore the network learns

upscaling ﬁlters instead of just improving the quality

of an image that is nearest neighbor upsampled. As a

result, we can decrease the spatial and temporal alias-

ing in the lower resolution rendered periphery, and

also take advantage of the lower computational re-

quirements. Furthermore, the proposed method uses

temporal data shifted with motion vectors which is

fused early in the network because the network archi-

tecture is shallow.

4 PRELIMINARY METHOD

In this section, we propose a preliminary machine

learning-based method for reconstructing foveated

path tracing. We focus on the periphery part of

the foveated path traced result since we assume the

foveated region already has sufﬁcient quality in the

described system. Also, like in previous work we are

focusing only on the luminance components of the in-

put images because the human eye is more sensitive to

changes in the luminance channel in the YCbCr color

space (Shi et al., 2016).

4.1 Path Tracing Setup

In our example setup, the 1 spp path tracing is done

the same way as described in Koskela et al. (2019b);

we follow the visual acuity function by sampling in

the Visual-Polar space. We also use the same foveated

resolution reduction of 61% so that the full resolu-

tion of a single eye in the used head mounted display

(1280 × 1440) can be rendered in a foveated resolu-

tion of 853 × 960. The used path tracing sample dis-

tribution mechanism could also be similar to (Friston

et al., 2019), but using the Visual-Polar space allowed

us to easily consider only the periphery part of the

rendered image: In the case of the Visual-Polar trans-

formation, the fovea is always mapped to the same

location in the image. Thus, to focus on the periphery

part of the rendering, it is enough to take a slice of the

modiﬁed polar image on the horizontal ρ axis.

For denoising we can use state-of-the-art fast an-

alytical path tracing denoisers such as Schied et al.

(2017) or Koskela et al. (2019a) directly in the re-

duced foveated Visual-Polar space and then feed the

denoised image to our reconstruction CNN.

4.2 Foveated Input and Output

For the input image we focus only on the periphery

part of the foveated path traced image. In order to

help the convolutional layers to differentiate the spa-

tial resolution shift in the distribution, one of the in-

puts to the network is the distance to the fovea. In the

Machine Learning is the Solution Also for Foveated Path Tracing Reconstruction

363

case of the Visual-Polar space, the distance changes

only on the horizontal ρ axis.

Only the luminance channel of the foveated path

traced images is considered. This decreases the com-

putation time of the network compared to full RGB

values for the current and previous frames. In the

case of the Visual-Polar space, the edges of the image

are mapped to the middle of the ﬁnal Cartesian screen

space image. Therefore, basic padding, such as zero

padding, in the Visual-Polar space leads to poorer

quality in the middle of screen space image. This can

be handled with a custom wrap-around padding.

Temporal data for the current frame is calculated

using the previous frame and motion vectors. The

data not present in the previous frame is ﬁlled with

a value outside of a nominal luminance value. In this

case we used a ﬂoating point value of -1.0. The idea

is that the network learns not to use this data.

We use path tracing resolution of 853 × 960.

For the machine learning reconstruction, we take a

slice from the periphery of 658 × 960 with a cus-

tom padding of 4 on each border, which is fed to the

network as an input. Fovea part of the Visual-Polar

frame is directly sampled with anisotropic sampling

to screen space.

4.3 Convolutional Neural Network

Architecture

For our use case we have a similar architecture as is

done in the case of (Shi et al., 2016). The foveated

CNN is shown fully in Figure 1 with an example in-

put and an output. We use a hyperbolic tangent (tanh)

activation layer only in the last layer of the network

and a Rectiﬁed Linear Unit (ReLU) in the ﬁrst two,

which we found to work better with our loss function.

For each low resolution foveated denoised input we

have a fully converged 4096 spp foveated 2× resolu-

tion image. Examples of the foveated input and the

reference output can be seen in Figure 2. We also em-

ployed early fusion for the input data, in that all the

three inputs are collapsed at the beginning of the net-

work, which has been demonstrated to improve qual-

ity in shallow networks (Caballero et al., 2017; Karpa-

thy et al., 2014).

Given Visual-Polar space images in resolution

W × H, the used pixel-wise Mean Absolute Error

(MAE) loss function can be described as

∑

y=0

∑

x=0

x,y

− f

x,y

)|, (1)

where I

is the high resolution version of the image

and I

is the low resolution image. L1 is the most ro-

bust loss function and closest to perceptual difference

4096 spp reference

Luminance

Prev. Luminance

Distance to Fovea

Figure 2: Teaching setup used in our experimental ma-

chine learning-based approach. Frames with the same bor-

der color are one set of inputs and their corresponding tar-

get. The missing data in previous frame is visualized in red

pixel values.

(Bako et al., 2017) and (Chaitanya et al., 2017) also

found that it is effective against outliers.

5 EXPERIMENT

For training the network for the experiment, 3200

training samples were collected from various scenes,

including the experimented ones. The use of sepa-

rate training frames from the experiment scenes is not

a problem in this application since it is difﬁcult for

a small network to learn scene speciﬁc parameters

and similar training could also be used in the ﬁnal

use case. The network was trained using Keras with

its TensorFlow backend. The network was trained

for 40 epochs with the Adam Optimizer (Kingma

and Ba, 2014) and 0.001 learning rate, after which

no improvement for the cost function was observed.

Furthermore, in this experiment we used the Spatio-

Temporal Variance-Guided (SVGF) (Schied et al.,

2017) ﬁlter as a denoiser in the pipeline and the train-

ing data for the network was collected with the same

denoiser.

For the experiment, four different scenes were

GRAPP 2020 - 15th International Conference on Computer Graphics Theory and Applications

364

Reference CNN (proposed) Anisotropic

Figure 3: Comparison of different methods. The gaze point is marked with a purple dot. The reference is 4096 spp Cartesian

space rendering. This shows how even a simple network learns how to reduce some of the errors produced by a real-time

denoiser.

tested. From each scene, a short few second long

camera path was recorded. For each camera path

two foveated renderings were compared with a fully

converged 4096 spp path traced reference. The ﬁrst

foveated rendering was an anisotropic sampling from

Visual-Polar space to Cartesian space. The second

compared rendering was the upscaled output of the

proposed foveated CNN. Examples of the compared

renderings can be seen in Figure 3. The results seen

in Table 1 show that the proposed CNN gave improve-

ment for peak signal-to-noise ratio (PSNR), struc-

tural similarity (SSIM) and video multi-method as-

sessment fusion (VMAF). The improvement in qual-

ity could potentially be translated to more reduction

for resolution in the foveated rendering space and

therefore faster inference.

Running a 658 × 960 frame with the proposed

CNN with TensorRT-acceleration (the arithmetics

were quantized to 16-bit half ﬂoating point precision

to ﬁt its TensorCore units), takes on average 2.9ms per

Machine Learning is the Solution Also for Foveated Path Tracing Reconstruction

365

Table 1: Results for recorded camera path in four path

traced scenes.

Metric Scene Anisotropic Proposed

PSNR Classroom 26.1 26.6

Fireplace 31.8 32.1

Sponza 22.1 22.4

SanMiguel 22.0 22.1

SSIM Classroom 0.826 0.835

Fireplace 0.899 0.902

Sponza 0.695 0.704

SanMiguel 0.613 0.625

VMAF Classroom 26.7 37.7

Fireplace 37.6 47.3

Sponza 11.4 22.0

SanMiguel 10.9 16.5

frame on a single GeForce RTX 2080 Ti GPU. The

timing is likely fast enough, because it is measured

on a single contemporary GPU. This means that either

the quality or the speed, or both, can be improved sig-

niﬁcantly on the next generations as inference hard-

ware acceleration support evolves. Furthermore, in

this experiment the network used 32-bit ﬂoating-point

precision. Using reduced precision would make net-

work faster likely without signiﬁcant reduction in the

quality Venkatesh et al. (2017).

6 CONCLUSIONS

In this position paper we argued that machine learn-

ing algorithms are going to play a critical role in re-

constructing foveated rendering. Speciﬁcally, an in-

teresting option is to perform machine learning-based

reconstruction in the reduced foveated resolution to

reduce the computational complexity. In order to

provide initial proofs, we built a preliminary spatio-

temporal super-resolving CNN to reconstruct the pe-

riphery part of a foveated rendering pipeline which

functions in the reduced foveated resolution. The

method was tested with four different scenes in which

it improved the PSNR, SSIM and VMAF scores com-

pared to anisotropic ﬁltering. In our opinion, af-

ter a more comprehensive architecture and parame-

ter search, a similar idea could be used to greatly re-

duce the computational complexity and improve the

result of the reconstruction in foveated photorealistic

rendering, making it usable in realistic mixed reality

setups of the future.

We believe that the improved quality of the ma-

chine learning based reconstruction can be translated

to more reduced foveated resolution and, therefore,

less overall rendering workload, but this is yet to

prove with more extensive user studies.

ACKNOWLEDGEMENTS

This work was supported by ECSEL JU project

FitOptiVis (project number 783162), the Tampere

University of Technology Graduate School, the Nokia

Foundation and the Emil Aaltonen Foundation.

REFERENCES

Albert, R., Godinez, A., and Luebke, D. (2019). Reading

speed decreases for fast readers under gaze-contingent

rendering. In Proceedings of the Symposium on Ap-

plied Perception.

Bako, S., Vogels, T., McWilliams, B., Meyer, M., Nov

ak,

J., Harvill, A., Sen, P., Derose, T., and Rousselle, F.

(2017). Kernel-predicting convolutional networks for

denoising monte carlo renderings. ACM Transactions

on Graphics (TOG), 36(4).

Barr

e-Brisebois, C. (2018). Game ray tracing: State-of-the-

art and open problems. High Performance Graphics

Keynote.

Caballero, J., Ledig, C., Aitken, A., Acosta, A., Totz, J.,

Wang, Z., and Shi, W. (2017). Real-time video super-

resolution with spatio-temporal networks and motion

compensation. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition.

Chaitanya, C. R. A., Kaplanyan, A. S., Schied, C., Salvi,

M., Lefohn, A., Nowrouzezahrai, D., and Aila, T.

(2017). Interactive reconstruction of monte carlo im-

age sequences using a recurrent denoising autoen-

coder. ACM Transactions on Graphics (TOG), 36(4).

Dong, C., Loy, C. C., He, K., and Tang, X. (2015). Image

super-resolution using deep convolutional networks.

IEEE transactions on pattern analysis and machine

intelligence, 38(2):295–307.

Friston, S., Ritschel, T., and Steed, A. (2019). Perceptual

rasterization for head-mounted display image synthe-

sis. ACM Transactions on Graphics (TOG), 38(4).

Guenter, B., Finch, M., Drucker, S., Tan, D., and Snyder, J.

(2012). Foveated 3d graphics. ACM Transactions on

Graphics (TOG), 31(6).

Hays, J. and Efros, A. A. (2007). Scene completion us-

ing millions of photographs. ACM Transactions on

Graphics (TOG), 26(3).

Kajiya, J. (1986). The rendering equation. SIGGRAPH

Computer Graphics, 20(4).

Kaplanyan, A., Sochenov, A., Leimkuehler, T., Okunev, M.,

Goodall, T., and Gizem, R. (2019). Deepfovea: Neu-

ral reconstruction for foveated rendering and video

compression using learned statistics of natural videos.

ACM Transactions on Graphics (TOG), 38(4).

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Suk-

thankar, R., and Fei-Fei, L. (2014). Large-scale video

classiﬁcation with convolutional neural networks. In

Proceedings of the IEEE conference on Computer Vi-

sion and Pattern Recognition.

GRAPP 2020 - 15th International Conference on Computer Graphics Theory and Applications

366

Kilgariff, E., Moreton, H., Stam, N., and Bell,

B. (2018). NVIDIA Turing Architec-

ture In-Depth. https://devblogs.nvidia.com/

nvidia-turing-architecture-in-depth/ accessed:

2020-01-06.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Koskela, M., Immonen, K., M

akitalo, M., Foi, A., Viita-

nen, T., J

askel

ainen, P., Kultala, H., and Takala, J.

(2019a). Blockwise multi-order feature regression for

real-time path-tracing reconstruction. ACM Transac-

tions on Graphics (TOG), 38(5).

Koskela, M., Lotvonen, A., M

akitalo, M., Kivi, P., Viitanen,

T., and J

askel

ainen, P. (2019b). Foveated real-time

path tracing in visual-polar space. In Proceedings of

the Eurographics Symposium on Rendering.

Ledig, C., Theis, L., Husz

ar, F., Caballero, J., Cunningham,

A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang,

Z., et al. (2017). Photo-realistic single image super-

resolution using a generative adversarial network. In

Proceedings of the IEEE conference on computer vi-

sion and pattern recognition.

Patney, A., Salvi, M., Kim, J., Kaplanyan, A., Wyman, C.,

Benty, N., Luebke, D., and Lefohn, A. (2016). To-

wards foveated rendering for gaze-tracked virtual re-

ality. Transactions on Graphics, 35(6).

Pharr, M., Jakob, W., and Humphreys, G. (2016). Phys-

ically based rendering: From theory to implementa-

tion. Morgan Kaufmann.

Schied, C., Kaplanyan, A., Wyman, C., Patney, A., Chai-

tanya, C. R. A., Burgess, J., Liu, S., Dachsbacher,

C., Lefohn, A., and Salvi, M. (2017). Spatiotempo-

ral variance-guided ﬁltering: real-time reconstruction

for path-traced global illumination. In Proceedings of

High Performance Graphics.

Shi, W., Caballero, J., Husz

ar, F., Totz, J., Aitken, A. P.,

Bishop, R., Rueckert, D., and Wang, Z. (2016). Real-

time single image and video super-resolution using an

efﬁcient sub-pixel convolutional neural network. In

Proceedings of the IEEE conference on computer vi-

sion and pattern recognition.

Venkatesh, G., Nurvitadhi, E., and Marr, D. (2017). Ac-

celerating deep convolutional networks using low-

precision and sparsity. In IEEE International Con-

ference on Acoustics, Speech and Signal Processing

(ICASSP).

Vogels, T., Rousselle, F., McWilliams, B., R

othlin, G.,

Harvill, A., Adler, D., Meyer, M., and Nov

ak, J.

(2018). Denoising with kernel prediction and asym-

metric loss functions. ACM Transactions on Graphics

(TOG), 37(4).

Weier, M., Roth, T., Hinkenjann, A., and Slusallek, P.

(2018). Foveated depth-of-ﬁeld ﬁltering in head-

mounted displays. Transactions on Applied Percep-

tion, 15(4).

Machine Learning is the Solution Also for Foveated Path Tracing Reconstruction

367