REAL-TIME AMBIENT OCCLUSION ON THE PLAYSTATION3
Dominic Goulding
1
, Richard Smith
1
, Lee Clark
1
, Gary Ushaw
2
and Graham Morgan
2
1
CCP Games, Gateshead, U.K.
2
School of Computing Science, Newcastle University, Newcastle, U.K.
Keywords:
Ambient Occlusion, Playstation3, Graphics.
Abstract:
This paper describes how to implement ambient occlusion effects on the Playstation3 (PS3) while alleviating
processing demands on the GPU. The solutions proposed here are implementations that utilize the parallel
processing available on the PS3’s synergistic processing units (SPUs). Two well-known ambient occlusion
techniques are evaluated as candidate solutions for PS3 SPU implementations.
1 INTRODUCTION
Ambient occlusion (AO) is a technique for enhanc-
ing the perception of three-dimensional space in com-
puter graphics. The technique enhances an image via
the shadowing of ambient light. The accentuating of
small surface details and the provision of spatial clues
via contact shadows provide an increased degree of
realism (Hoberock and Jia, 2008) (McGuire et al.,
2011). This makes ambient occlusion a popular tech-
nique in the context of film and video games (Loos
and Sloan, 2010).
To achieve ambient occlusion in real-time an ap-
proach based on approximation is required. Such
techniques are convincing on the latest graphics cards
and allow the modern PC gamer to enjoy the height-
ened realism afforded by ambient occlusion. How-
ever, as current console graphic card technology is
dated the ability to achieve convincing ambient oc-
clusion in console games is difficult.
The Playstation3 (PS3) does provide an opportu-
nity to move some graphics calculations away from
the graphics card and onto its Cell Architecture
(which consists of 8 processing units). However, the
Cell Architecture affords quite a different processing
style than a GPU. This requires a different approach to
implementation and re-integration to a graphics scene
(generated by the GPU).
In this paper we describe an engineering approach
to achieving ambient occlusion on the PS3. As such,
we are not proposing a new technique in ambient oc-
clusion but are proposing an implementation suitable
for deployment on the Cell Architecture.
2 BACKGROUND
2.1 Playstation3 Architecture
The Playstation3’s CPU architecture is cell-based,
consisting of six Synergistic Processing Units (SPUs)
around the central processor (plus a further two which
are not accessible to the developer). These cells have
a limited amount of memory (256k) for combined
program and data, and the DMA access to this mem-
ory is comparatively slow. Efficient programming of
SPUs is therefore reliant on identifying jobs which
can run independently within that memory, with in-
frequent calls on main memory. The SPU processors
are single instruction multiple data (SIMD) devices.
2.2 Ambient Occlusion
Ambient occlusion is defined as the amount of am-
bient light that is able to reach a point, which is not
occluded by other points (i.e. it simulates the shad-
owing caused by nearby objects from indirect light).
This can be achieved by casting ‘rays’ from the point,
and determining if these rays are obstructed. AO is
then calculated as the integral of a visibility function
over a unit hemisphere (Loos and Sloan, 2010).
Screen Space Ambient Occlusion (SSAO), is a
technique used to approximate the obscurance inte-
gral (Shanmugam and Arikan, 2007). Implementa-
tions of SSAO in games use a point sampling tech-
nique to approximate the occlusion integral. This in-
volves computing the obscurance for each pixel on
screen by taking samples around the pixel. The cor-
responding depth information from the depth buffer
295
Goulding D., Smith R., Clark L., Ushaw G. and Morgan G..
REAL-TIME AMBIENT OCCLUSION ON THE PLAYSTATION3.
DOI: 10.5220/0003820202950298
In Proceedings of the International Conference on Computer Graphics Theory and Applications (GRAPP-2012), pages 295-298
ISBN: 978-989-8565-02-0
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
is then used to compute how much of a surrounding
neighbourhood of the point in the scene is obscured
by objects.
Whilst making the assumption that the falloff
function is constant allows for efficient calculations
of the obscurance integral, it is also possible to select
a specific falloff function for an efficient implemen-
tation that maintains the complexity of the full radio-
metric model (McGuire et al., 2011).
2.3 Contribution of Paper
This paper shows that it is possible to move ambi-
ent occlusion calculations to the Playstation3’s SPUs,
freeing up processing time on the GPU, without a no-
ticeable reduction in quality. The paper introduces
a method for distributing full-screen ambient occlu-
sion into ”SPU-sized” chunks of calculation. Two
techniques for achieving ambient occlusion are im-
plemented and compared - line-sampling, and taking
a specific fall-off function - both shown to be viable
approaches on the Sony hardware. A number of opti-
misations, taking advantage of the Playstation3 archi-
tecture, are also presented.
3 IMPLEMENTATION
Both the line-sampling technique (Loos and Sloan,
2010) and (Ownby, 2010), and the technique of tak-
ing the specific falloff function (McGuire et al., 2011)
were implemented. The line sampling algorithm
(which only requires the depth buffer values), is ad-
vantageous due to the limited local memory of each
SPU. Whilst this technique boasts reduced sample
counts in comparison to point sampling, further re-
ductions in the sample count can be made by using
the fall-off function.
3.1 Performing Calculations on the
SPUs
A GPU based implementation uses a fullscreen 32bit
depth buffer for SSAO calculations; at 1280 × 720
screen resolution, this requires approximately 3.5MB.
However, each SPU on the Playstation3 has 256kB
of local memory and can only access external mem-
ory through direct memory access (DMA), which can
have a significant delay between requests and com-
pletion (Engstad, 2010).
Splitting the screen into sections and performing
SSAO calculations on a block at a time is not a vi-
able solution, as pixels at the edge of a block will
not have access to the required depth buffer sam-
ples. This issue also occurs at the edge of the screen
in a fullscreen implementation, however this can ei-
ther be solved by rendering to a slightly larger im-
age and cropping (McGuire et al., 2011), or ensuring
that samples outside of the screen return a very large
depth value, meaning they never contribute to occlu-
sion (Filion and McNaughton, 2008). Storing the full
screen depth buffer in main memory and then using
DMA calls for each sample when it is needed is also
inefficient due to the large number of DMA transfers.
3.1.1 Arranging the Input
Whilst the Playstation3 allows access to the depth
buffer, the data are stored in a specific tiled format
that is not suitable for our SSAO calculations. Before
reading the information for the SPU tasks this tiled
depth buffer must be reordered into a linear buffer.
This was performed in a pre-pass rendering stage,
storing the detiled depth buffer in main memory.
The next step is to arrange the data for con-
currently running SPUs. The screen was split into
64 × 64 pixel tiles, with each SPU calculating occlu-
sion values on a single tile at a time. A 128 × 128 tile
of depth information was read from the pixels sur-
rounding and including the inner 64 × 64 tile. We
therefore restricted samples to a maximum of 32 pix-
els away from the target pixel. While this does cause
a slight loss of accuracy for occlusion values, partic-
ularly with objects very near to the camera, for the
majority of cases this restriction was not noticeable
(indeed, this problem was further reduced by using
downsampled buffers, described below). Depth val-
ues overlapping the screen edges were set to very
large depths, as in many full-screen implementations.
This approach means that there is an overlap of reads
from the depth buffer, but no overlap when writing to
the buffer that stores the occlusion values.
Figure 1: Arranging the depth information for input.
Using this configuration each SPU was required to
store a 128 × 128 buffer of 4 byte depth values, and a
64 × 64 buffer of single byte occlusion values. Both
of these were ‘double buffered’ (see below), mean-
ing that approximately 140kB of local memory was
needed. This is comfortably within the 256kB local
GRAPP 2012 - International Conference on Computer Graphics Theory and Applications
296
memory available to a SPU. For any DMA transfer
of more than 16 bytes the size of the transfer must
be a multiple of 16 bytes, and must be aligned on a
16 byte boundary (Augonnet, 2007). The size of tiles
used ensures safe DMA transfer requests for each tile.
3.1.2 Combining with Lighting
The occlusion values must now be combined with the
current scene lighting. The occlusion values from
each of the SPUs’ local memory is written to a sin-
gle external buffer in the RSX graphics memory. This
is a little slower than writing to main memory, but
only the GPU requires access to these occlusion val-
ues, so storing the values in RSX memory reduces
main memory usage whiles providing fast access to
the occlusion values during the lighting stage.
The occlusion values are stored as a single byte
for each pixel. They are read as values from zero to
one and combined with the scene lighting in a pixel
shader by multiplying each pixel’s colour value by the
corresponding occlusion value. This has the effect of
dimming the lighting where AO occurs.
3.2 Optimizations
3.2.1 Downsampling
The first optimization was to perform calculations on
a
1
4
sized buffer. The depth buffer was downsampled
while it was detiled, so the reduced buffer was stored
in main memory. SSAO calculations were then per-
formed at this reduced resolution and output to a tex-
ture of the same resolution. This reduced the memory
used, and also significantly reduced calculation times.
Downsampling also had the advantage of increas-
ing our sampling range in screen space. Each SPU
still performs calculations on the same number of pix-
els, but with a downsampled buffer these pixels cor-
respond to a larger amount of screen space. Taking
samples from a maximum of 32 pixels away is equiv-
alent to taking samples up to 64 pixels away in a full
resolution buffer, allowing for a wider ambient effect
and improving the accuracy of occlusion results for
nearby objects in the scene.
Using a reduced resolution buffer is a common
way to increase the performance of SSAO algorithms,
providing a significant performance increase with
only minimal loss of detail. The decision whether to
use a fullscreen or downsampled buffer is a compro-
mise between performance, memory and visual qual-
ity, and will be application specific.
3.2.2 Double Buffering
A further performance increase came from double
buffering, allowing each SPU to perform occlusion
calculations on its current tile at the same time as
sending its previous tile’s occlusion results and re-
trieving the next tile’s depth information.
A pair of input buffers (for the depth values) and a
pair of output buffers (for the occlusion values) were
created, which were then alternated so that at any spe-
cific time, one of each is being used for memory trans-
fer while the other is being accessed by the AO cal-
culations. Each SPU requests a DMA transfer of the
previous tile’s occlusion values to main memory, and
a further DMA transfer to fill the free input buffer
with the next tile’s depth information. While these
transfers are occurring, the SPU uses the current depth
information to calculate the occlusion values. In this
way a SPU task does not have to wait for DMA re-
quests to complete before being able to perform cal-
culations on the current tile, improving performance.
3.2.3 Single Instruction, Multiple Data
The SPUs are capable of single instruction multi-
ple data (SIMD) operations (Gschwind, 2006). As
the SSAO algorithms perform the same operations
on each pixel in turn, the code was adapted to per-
form SSAO calculations on four pixels at a time using
SIMD. Downsampling the depth and normal buffers,
and blurring results were also performed using SIMD
instructions. As SIMD instructions were included
throughout the SPU code there was an improvement
in the calculation time by approximately four times.
3.3 Bilateral Filter
To smooth the results a 2D Gaussian blur was applied
which split the buffer into horizontal strips, followed
by vertical strips, allowing a single SPU to perform
the blur calculations one strip at a time.
While a Gaussian blur successfully smooths the
results, it causes artifacts (known as ‘halos’) in the
image. This effect occurs due to occlusion values
bleeding across the edges of objects (Filion and Mc-
Naughton, 2008). This lightens areas that should be
dark due to occlusion and darkens edges that should
not be occluded. To reduce the halo effect we used
a bilateral filter, where every sample is replaced by
a weighted average of its neighbours’ depth values
(Elad, 2002).
REAL-TIME AMBIENT OCCLUSION ON THE PLAYSTATION3
297
4 RESULTS AND EVALUATION
The two SSAO techniques were implemented using
the Playstation 3’s SPU architecture.We also imple-
mented fullscreen and downsampled versions for both
approaches. We found that using 12 samples for the
volumetric obscurance method and 6 samples for the
falloff function method provided similar performance
and both produced a good quality image.
4.1 Visual Quality
Visually comparing the two implementations at both
fullscreen and downsampled resolution shows that all
versions are of a high quality. In the game Uncharted
2 (Hable, 2010), where SSAO was also implemented
on the SPUs, the two visual downfalls were a visible
cross pattern of occlusion, and halos around objects.
Neither of our implementations suffer from a cross
pattern, whilst halos have been significantly reduced.
The volumetric obscurance method gives defined
occlusion values with little noise, and appears to be
of a similar quality to that seen in Toy Story 3: The
Game (Ownby, 2010). However this method suf-
fers from creating occlusion values that are focused
around the edges of objects, not providing wide area
results as seen in the falloff function method.
The falloff function method provides the best re-
sults for wide area ambience, capturing a greater
sense of the overall geometry of the scene. As this
method uses the surface normals, it is also able to
highlight details from the normal maps of the objects’
surfaces. The falloff function method can however
suffer from objects causing self-occlusion (this is con-
sistent with the findings of (McGuire et al., 2011)),
and the image as a whole is more noisy than that seen
in the volumetric occlusion method.
4.2 Performance
In our implementation, both SSAO techniques can run
on as many SPUs as desired. Whilst performance ob-
viously improves with more SPUs, it is unlikely to be
possible to allocate all 6 SPUs to perform SSAO cal-
culations. Table 1 shows performance results, mea-
sured in total SPU time. In our fastest implementa-
tion (using the falloff function at a downsampled res-
olution), occlusion results were calculated in 46.8ms,
if all six SPUs are assigned to this task therefore, the
time it takes for this task to complete is 7.8ms.
Table 1: Performance figures showing total SPU time.
Volumetric Obsc. Falloff Fn.
Downsampled 55.2ms 46.8ms
Fullscreen 209ms 166ms
We also created a version using only the PPU, no
tiling was required (much like a GPU implementa-
tion) and we tested on a fullscreen resolution buffer.
This was significantly slower, increasing frame render
times by over 1500ms.
Whilst all four of the SPU implementation re-
sults allow for fully interactive frame rates in test lev-
els, the large increase in speed seen in the downsam-
pled methods make them much more desirable than
the fullscreen equivalent. Our fastest result remains
slower than GPU implementations however. Despite
this,our implementation has the advantage of not im-
pacting GPU performance, giving a trade-off between
2.3ms of GPU time and 7.8ms of CPU time.
5 CONCLUSIONS
We have successfully achieved high quality SSAO ef-
fects using the Playstation3’s SPUs, confirming that
this technique can be used as an alternative to GPU-
based implementations. Our techniques are currently
viable in an application that is hindered by GPU per-
formance, but not with CPU performance.
REFERENCES
Augonnet, C. (2006-2007). An introduction to IBM cell
processor.
Elad, M. (2002). Algorithms for noise removal and the bi-
lateral filter.
Engstad, P.-K. (2010). Introduction to SPU optimizations.
Naughty Dog.
Filion, D. and McNaughton, R. (2008). Starcraft 2 effects
and techniques. In Advances in Real-Time Rendering
in 3D Graphics and Games Course, SIGGRAPH.
Gschwind, M. (2006). The cell broadband engine: Exploit-
ing multiple levels of parallelism in a chip multipro-
cessor. Technical report, IBM Research Division.
Hable, J. (2010). Uncharted 2: HDR lighting. Game Devel-
opers Conference.
Hoberock, J. and Jia, Y. (2008). High-quality ambient oc-
clusion. In GPU Gems 3. Addison-Wesley.
Loos, B. J. and Sloan, P.-P. (2010). Volumetric obscurance.
McGuire, M., Osman, B., Bukowski, M., and Hennessy,
P. (2011). The alchemy screen-space ambient ob-
scurance algorithm. In High-Performance Graphics
2011.
Ownby, J.-P. (2010). Toy Story 3: The video game render-
ing techniques. In Advances in Real-Time Rendering
Course, SIGGRAPH.
Shanmugam, P. and Arikan, O. (2007). Hardware acceler-
ated ambient occlusion techniques on GPUs. In Pro-
ceedings of the 2007 Symposium on Interactive 3D
Graphics and Games, I3D ’07, pages 73–80. ACM.
GRAPP 2012 - International Conference on Computer Graphics Theory and Applications
298