ROBUST GRAYSCALE CONVERSION FOR

VISION-SUBSTITUTION SYSTEMS

Codruta Orniana Ancuti, Cosmin Ancuti and Philippe Bekaert

Hasselt University - tUL -IBBT, Expertise Center for Digital Media, Belgium

Keywords: Color-to-Gray, Visual saliency, Auditory substitution systems.

Abstract:

Substitution systems have proved an important potential in mobility assistance for visually disable persons.

Particularly, proﬁcient users of auditory-vision substitution are able to identify and reconstruct visual targets.

The content of non-visual image is simpliﬁed with the purpose to minimize the cognitive process for recog-

nition and also to reduce the duration of the sound patterns. Motivated by these facts, many of the existing

substitution systems discard the color information by dealing with grayscale images. This paper presents a

robust and effective method of color-to-gray transformation, that preserves the original color contrast of the

initial images but also the original saliency. The study is focused taken into consideration the hypothesis that

visual salient areas are tightly connected with visual attention. We show that an appropriate translation allows

a more accurate rendering of the important image regions but that creates a better mental representation of the

environment.

1 INTRODUCTION

Humans orientation and mobility ability is highly

correlated with the capacity of mental mapping the

spaces and the possible navigation paths in the en-

vironment. Much of this information is gathered

through the visual channel. Visually disabled persons

lack this crucial information and as a consequence

face great difﬁculties to orient in novel environments

since they are not capable of creating mental maps

of spaces. Recent technological advances have im-

proved the development of portable non-invasive sub-

stitution systems (Meijer, 1992; Capelle C., 1998;

Pun et al., 2007) designed for visually disabled per-

sons. These systems aim to provide assistance at the

perceptual level, by compensating the deﬁciency in

visual sense. The main task of these systems is to

translate the acquired image and to made it avail-

able to other senses such as audio, haptic or smell.

For haptic systems the white can provides the low-

resolution information about the nearby surroundings,

the feet estimate the characteristics of the navigation

surface while the palms and ﬁngers provide the high-

resolution information allowing the ﬁne recognition

of objects textures and forms (Pun et al., 2007). Com-

plementary, the auditory channel usually supplies in-

formation about events (e.g. person presence), scene

distances and rough interpretation over the environ-

ment (Hill, 1993).

The work presented in this paper is focused to

image-to-sound substitution systems in order to gen-

erate more appropriate map representation of the

scene. The aim of such systems is to induce rep-

resentations or mental images for visually disabled

users (in general proﬁcient users) due to imaginary

process. Investigations in neural rehabilitation ﬁeld

are explaining these phenomena through cross-modal

brain plasticity, where large areas in brain cortex (of

the visually disabled persons) are recruited to process

non-visual tasks (Auvray M., 2005; Capelle C., 1998;

Cronly-Dillon and Persaud, 1999). Most of the stan-

dard systems employ a faster scheme transformation

by only selecting the intensity channel information

and modulating the amplitude of signal proportional

with the pixel intensity value. However, this straight-

forward technique may fail to interpret perceptually

properly the scene appearance due to the fact that the

color information is not considered. Separately the

color does not provide enough information about ob-

jects forms or scene geometry. Nevertheless, by mod-

eling a gray translation with color information we can

implicitly identify color cues (mostly corresponding

130

Ancuti C., Ancuti C. and Bekaert P. (2010).

ROBUST GRAYSCALE CONVERSION FOR VISION-SUBSTITUTION SYSTEMS.

In Proceedings of the International Conference on Computer Vision Theory and Applications, pages 130-135

DOI: 10.5220/0002846701300135

 SciTePress

Salientregions(intensityandcolor)

Salientregions(onlyintensity)

a. b.

Figure 1: Salient regions detection. For the images from the left side part are displayed the salient regions detected by the

algorithm of Itti et al. (Itti and E., 1998) when both color and intensity are considered while for the images from the right side

part are shown the detected salient regions when only intensity is taken as a discrimination feature.

to texture or particular classes such as sky, grass or

ﬂowers) beside those cues that are provided by inten-

sity variations. Additionally, a good interpretation of

the most salient regions overcomes deprived informa-

tion about the most attractive areas in the scene and

leads to focus the attention to the most interesting re-

gions. By this approach the chances of object and

persons identiﬁcation into the scene are substantially

increased (see ﬁgure 1).

We propose an effective color-to-gray method that

is able to preserve the high contrast appearance of

salient regions. We aim to develop a suitable de-

colorization method that enhances the contrast of the

grayscale image to better visually reﬂect the chro-

matic contrast of the initial color image. Additionally,

we have been concerned to reduce the loss of visual

information from converted image. The utility of this

approach has been tested based on the well-known

sound substitution system vOICe (Meijer, 1992; Mei-

jer, 1998). The experiments prove that our decol-

orized images better preserve the saliency of the orig-

inal color scene compared with the standard grayscale

and other specialized techniques.

2 RELATED WORK

While the history of vision-substitution systems

stretches back to the 1970s (y Rita, 1967; Fish, 1976),

more recent approaches have been introduced. The

common stages that are performed by auditory-vision

substitution systems are: image acquiring, image pro-

cessing algorithms (e.g. grayscale conversion, de-

tails visibility enhancement, gamma correction, au-

tomatic labels/objects recognition, etc.) and ﬁnally

image translation into sound frequencies.

Several well known approaches including

vOICe (Meijer, 1992; Meijer, 1998) and PSVA

(Prosthesis for Substitution of Vision by Audi-

tion) (Capelle C., 1998) deal only with grayscale

images. The rendering operation implies scanning the

image from left to right, and computing per-pixel the

audio amplitudes/frequencies (by various schemes)

that are ﬁnally rendered to the user. In vOICe (Meijer,

1998) approach the pitch elevation is given by the

position in the visual pattern, and the loudness is

proportional with the luminance intensity, therefore

in this approach white is played loudly and black

silently. The PSVA (Capelle C., 1998) is based on

ROBUST GRAYSCALE CONVERSION FOR VISION-SUBSTITUTION SYSTEMS

131

a raw model of the primary visual system with two

resolution levels, one that corresponds to artiﬁcial

central retina and one that corresponds to simulated

peripheral retina. The Vibe approach (Auvray M.,

2005) splits the image into conﬁgurable distributed

receptive ﬁelds that interprets the mean value of the

gray levels in their allocated areas. The basic compo-

nents of the sound are sinusoidal, being produced by

virtual placed sources.

Image simpliﬁcation is employed in Cronly et

al. (Cronly-Dillon and Persaud, 1999) approach. This

system reduces the image information by selectively

permitting (of user choice) the separately extraction

of horizontal/oblique lines. The main idea is that fea-

ture extraction can segment the image before trans-

lation and contributes to recognize patterns (squares,

circles, polygons). After this step, image-to-sound

rendering follows the scheme where pixels in a col-

umn deﬁne a chord and the horizontal lines are played

sequentially as a melody.

Recently, several color-to-gray algorithms have

been introduced in literature in order to overcome the

problems of the standard graysacale conversion that

employs only the luminance channel. Although the

results are quite promising, the computational com-

plexity of the most proposed techniques is still an

important bottleneck. The general goal of the trans-

formation is to generate an image that preserve the

image appearance rather than simply record light in-

tensities (like in standard approach). Color plays a

signiﬁcant role in the scene interpretation in terms of

visual perception. In general the distribution of the

color contrast is obtained by evaluating the color dif-

ferences of the image pixels. Gooch et al. (Gooch

et al., 2005) proposed a technique that attempts to pre-

serve the sensitivity of the human visual system by

comparing each color pixel value with the average of

its neighbor region. The algorithm is highly compu-

tational expensive and performs poorly for high res-

olution images. Rasche et al. (Rasche et al., 2005)

introduced a method that computes the distribution of

all the image colors previously quantized in a num-

ber of landmark points. Due to the color quantiza-

tion, some image details can be lost. Grundland and

Dodgson (Grundland and Dodgson, 2007) introduced

a faster technique, that as will presented in the follow-

ing inspired our approach, to decolorize images based

on the chrominance and luminance fusion. Neumann

et al. (Neumann et al., 2007) computes the gradient

ﬁeld with two different formulas, one that takes ad-

vantage of the Coloroid (Nemcsis, 1987) color space

and the second that presents a generalized technique

based on CIELab.

3 SUBSTITUTION SYSTEMS

OVERVIEW

We have chosen to built our approach on the well-

known vOICe

system (Meijer, 1998). Due to the

fact that pixels intensity values are reﬂected in the

amplitude of the frequency (perceived as sound loud-

ness), we have investigated modalities to exploit the

bandwidth to the optimal way in order to enhance im-

portant visual cues of the scene. The difﬁculties in

scene understanding appears when visual salient re-

gions are not accurately represented and this resumes

in misinterpretations of local and global content of the

scene. A general overview of the system is presented

Color-to-grayconversion Imagescanning

Audiotranslation

Soundspectrum

Figure 2: Overview of the vOICe auditory-substitution sys-

tem.

in the following. The system translates the acquired

frontal images into a time-multiplexed auditory rep-

resentation. Each image is rendered with a resolution

of 64 × 64 pixels in an approximate conversion time

of T = 1.05 seconds. The translation operation is a

per-pixel operation by encoding the vertical position

into frequency and the horizontal position into time.

The pixel intensity gives the oscillation amplitude,

therefore white is mapped into loudness and black is

mapped into silence of its associated oscillator.

Firstly the image matrix elements are associated

with one of the G gray tones:

= (p

) , p

∈ {g

,...,g

}

i, j = 1... N,N = 64

(1)

where i and j represent the columns and lines indexes

that are limited to the maximum values N = 64 (the

input image has a resolution of 64× 64 pixels).

Each of the N column that corresponds to the sig-

nal s(t) is played in T/N seconds. As already pre-

sented, the amplitudes of sinusoidal components of

the s(t) signal are proportional with the intensity lev-

els. Considering that ω

= 2π f

the sound pattern

transformation is mathematical expressed as follow-

ing:

www.seeingwithsound.com

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

132

s(t) =

∑

i=1

· sin(ω

t + θ

)

t ∈



+ ( j − 1) ·

+ j ·



j = 1. ..N,k = 1, 2,.. .

(2)

The algorithm computes frequency distribution

equidistant as expressed in equation 3. In addition

to linear frequency distribution the approach allows

also exponential distribution of frequency to render

the patterns (see equation 4):

= f

i−1

N−1

· ( f

− f

), i = 1... N

(3)





i−1

N−1

· f

, i = 1. ..N

(4)

where f

(default f

= 500Hz) and f

(default f

5KHz) are the lowest and respectively the highest fre-

quency.

Finally, after each image, as a distinct end-of-

frame mark is inserted a synchronization click sound

that indicates the end of the played image, respec-

tively the beginning of a new input.

4 SALIENCY PRESERVING

DECOLORIZATION

Decolorization or color-to-gray can be seen as an in-

formation compression operation since it maps three

dimension information onto only one dimension.

Standard transformation that employs only the lumi-

nance channel neglects the color information and as

consequence in many cases visually important fea-

tures are lost. This is due to the fact that different

isoluminant colors are mapped on the same inten-

sity level. Recently introduced decolorization meth-

ods aim for a better conservation of the original scene

content after compression. Since the majority of

existing methods are computationally expensive, in

this work we have chosen to adapt Grundland’s ap-

proach (Grundlandand Dodgson, 2007) mainly due to

the fact that this technique can preserve effectively the

original image chromatic contrast but with low com-

putational cost. However, our experiments disclosed

several important limitations of these technique. Be-

cause the technique considers only a single dominant

color axis it may fail to preserve a consistent appear-

ance of images that are characterized by uniform hue

distribution, since a single hue is highly advantaged

(see ﬁgure 3).

Taken into consideration these aspects we develop

a new technique addressing several issues: chromatic

contrast adaptation based on hue distribution identi-

ﬁcation, image intensity manipulation and ﬁnal con-

strains that provide a consistent output even when the

Colorimage GrundlandandDogson

Ourresult

Figure 3: Spectrum representation. As can be seen the De-

colorize (Grundland and Dodgson, 2007) approach fails to

preserve a consistent appearance since a single hue is highly

advantaged. Notice the appearance of the red ﬂowers.

parameters (e.g. chromatic contrast λ) are stretched

on high values. We introduced several additional pa-

rameter constrains that generates more pleasant re-

sults but also a better control in comparison with the

original technique.

In the following are described the main steps of

the algorithm. The presentation is focused mainly

onto the modiﬁcations that have been added to ﬁt

properly this strategy into our system.

The transformation of the color image is per-

formed in YPQ linear color space. The channel Y ∈

[0,1] is the achromatic luminance channel and the

pair channels P ∈ [−1,1] and Q ∈ [−1, 1] represent

the opponent-colors channels: yellow-blue and red-

green.













0.2989 0.587 0.114

0.5 0.5 −1

1 −1 0













(5)

Beside luminance channel Y the computation of

the chromatic channels (Hue-H and Saturation-S) is

performed in a straightforward way as follows:

H =

tan

−1

(

) (6)

S =

+ P

(7)

The ﬁrst step after image is converted into YPQ

color space is to analyze the distribution of the im-

age feature chromatic contrast. This is performed

by computing the color difference between pairs ran-

domly chosen and sampled by Gaussian pairing. The

main idea of this approach is that nearby pixels may

represent similar color since they might be part of

the same feature, while more distant pixels have in-

creased chances of having different colors. To iden-

tify the main principal chromatic contrast axis the

method uses the predominant component analysis.

ROBUST GRAYSCALE CONVERSION FOR VISION-SUBSTITUTION SYSTEMS

133

This represents a derivation of the well-known dimen-

sionality reduction technique - principal component

analysis (PCA) (Dunteman, 1989). The method op-

timizes the differences between observations by pro-

jection onto the two principal chromatic contrast axes.

The purpose of these chromatic axes is to recover

within a single direction the color contrast magni-

tude that is not contained by the luminance chan-

nel. This search maximizes the covariance between

chromatic contrast and the weighted polarity of lu-

minance contrast. However, this approach has the

drawback that a single chromatic axis is not able to

depict differently color contrasts that are perpendic-

ular to it. Despite of this disadvantage, in general

the image contrast is relatively pleasantly enhanced.

The main advantage of this approach is the processing

time that is linear with the image resolution. Gaussian

pairing sampling technique reduces the time spent to

compute color differences in comparison with similar

techniques (Gooch et al., 2005; Rasche et al., 2005).

Following the predominant component analysis step

that decides the representative color contrast axis, the

next step is to fuse the chromatic information with the

luminance channel Y. Predominant chromatic chan-

nel contributes to the luminance with a λ degree of

contrast enhancement (default value is 0.5). Our ex-

tensive experiments of approximately +200 images

have shown that this parameter should be also cor-

related with the chromatic distribution (see ﬁgure 4).

Hue histogram analysis controls the parameter η. The

parameter is equal with 1 if the image hue distribu-

tion does not contain both red-green/yellow-blue op-

ponent pairs and is equal with 3 if the hue distribution

covers the entire range.

= (ηY

+ λC

)/η (8)

In order to maintain the luminance polarity the chro-

matic axis orientation needs to generate similar chro-

matic contrast with the luminance contrast.

An important desired feature in many cases is to

control the contrast effects. The goal of image decol-

orization is to obtain a perceptual preservation of the

original saliency (see ﬁgure 4). During our exten-

sive tests we have noticed that for higher values of λ

(λ=1) the saliency regions are better preserved when

applying the well-known Itti algorithm (Itti and E.,

1998) that identiﬁes the most salient regions. We as-

sume that the saliency is preserved only if the detected

salient regions in the color image are maintained al-

most in the same regions as in the decolorized image.

In addition, we observed that when increasing

chromatic contrast parameter λ, several undesired ar-

tifacts are introduced into the output image. Since

increasing the chromatic contrast has a similar impact

with changing the illumination color (blue areas be-

Colorimage

Ourresult

CIEY

GrundlandandDogson

Figure 4: The Decolorize (Grundland and Dodgson, 2007)

conversion fails to maintain the color salient regions while

our result is capable to better preserve the original salient

regions.

comes darken when opponent yellow areas becomes

lighter), there is a requirement to limit the impact to a

certain boundaries that may assure a decent visibility

of details but also to maintain the original image ap-

pearance. In the biological system proposed in (Land,

1971) the signal of simulated neural path travels until

it ﬁnds an inhibitory signal that is larger or equal with

the sequential product. The principle is that the signal

is blocked rather than transformed into negativevalue.

Correlated to our approach this can be seen as an ef-

fective slicing technique of the intensity level. There-

fore, in order to reduce the level of undesired artifacts

we enforce the decolorization results to remain in the

range of [l ∗ Min(R,G,B), l ∗ Max(R,G,B)] ( l = 1 is

default value).

5 DISCUSSIONS AND

CONCLUSIONS

This paper introduces a novel decolorization method

in order to support auditory-vision substitution sys-

tems to translate efﬁciently grayscale images into

sound patterns. A challenging problem in image

grayscale conversion is how to interpret the scene

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

134

Colorimage

Ourresult

Standardgrayscale

Smithetal.2008

Figure 5: The contrast enhancement approache of (Smith

et al., 2008) but also the standard grayscale risk to not pre-

serve the original color salient regions.

cues elements that permit to represent more ac-

curately the scene content. The selection of a

good information reduction method is fundamental

for the effectiveness of image understanding or at-

tention focus guidance. In comparison with other

approaches (Cronly-Dillon and Persaud, 1999) our

model takes advantage of the color contrast. Regard-

less of scene complexity if the target object is not

distinctively rendered the participants risk to inaccu-

rately locate it. Comparing with existing approaches,

our translation model is able to improve the user per-

ception over the chromatic contrast image content. In

low illuminated scenes many decolorization methods

fail to convert accurately images while increasing the

contrast. Our improved decolorization method has

shown promising results against standard and recent

algorithms. For images with isoluminant areas the

system is able to translate with a higher recognition

rate the visual cues. Even if for the moment the vi-

sual substitution systems are far from being compa-

rable with the visual feedback, due to the limitation

imposed by the input sensory, these systems can be

designed suitable for basic speciﬁc tasks. For the mo-

ment all the available systems require costly training

period in order to obtain reliable interpreted results.

For future work we aim to perform extensive tests for

more complex tasks such as object localization and

mobility assistance.

REFERENCES

Auvray M., Hanneton S., L. C. (2005). There is some-

thing out there: distal attribution in sensory substitu-

tion, twenty years later. Journal of Integrative Neuro-

science, 4:505–521.

Capelle C., Trullemans C., A. C. V. (1998). A real-time

experimental prototype for enhancement of vision re-

habilitation using auditory substitution. IEEE Trans-

actions on Biomedical Engineering, (45):1279–1293.

Cronly-Dillon, J. and Persaud, G. R. (1999). The perception

of visual images encoded in musical form: a study

in cross-modality information. Biological Sciences,

pages 2427–2433.

Dunteman, G. (1989). rincipal components analysis. Sage,

Thousand Oaks, CA.

Fish, R. (1976). An audio display for the blind. IEEE Trans-

actions on Biomedical Engineering, 23(2):144–154.

Gooch, A. A., Olsen, S. C., Tumblin, J., and Gooch, B.

(2005). Color2gray: salience-preserving color re-

moval. ACM Trans. Graph., 24(3):634–639.

Grundland, M. and Dodgson, N. A. (2007). Decolorize:

Fast, contrast enhancing, color to grayscale conver-

sion. Pattern Recogn., 40(11):2891–2896.

Hill, E., R. J. H. M. H. M. H. J. H. R. (1993). How persons

with visual impairments explore novel spaces: Strate-

gies of good and poor performers. Journal of Visual

Impairment and Blindness, pages 295–301.

Itti, L., K. C. and E., N. (1998). A model of saliency-based

visual attention for rapid scene analysis. IEEE Trans.

on Patt. Anal. and Machine Intell. (PAMI).

Land, Edwin H., M. J. J. (1971). Journal of the optical

society of america. 61(1).

Meijer, P. (1992). An experimental system for auditory im-

age representations. IEEE Transactions on Biomedi-

cal Engineering, 39(2):112–121.

Meijer, P. (1998). Cross-modal sensory streams. In Confer-

ence Abstracts and Applications, ACM SIGGRAPH.

Nemcsis, A. (1987). Color space of the coloroid color sys-

tem. In In Color Research and Applications.

Neumann, L., Cadik, M., and Nemcsics, A. (2007). An ef-

ﬁcient perception-based adaptive color to gray trans-

formation. In In Computational Aesthetics.

Pun, T., Roth, P., Bologna, G., Moustakas, K., and Tzo-

varas, D. (2007). Image and video processing for vi-

sually handicapped people. Journal Image Video Pro-

cess., 5:1–12.

Rasche, K., Geist, R., and Westall, J. (2005). Re-coloring

images for gamuts of lower dimension. Comput.

Graph. Forum, 24(3):423–432.

Smith, K., Landes, P.-E., Thollot, J., and Myszkowski, K.

(2008). Apparent greyscale: A simple and fast con-

version to perceptually accurate images and video. In

EUROGRAPHICS.

y Rita, B. (1967). Sensory plasticity. applications to a vision

substitution system. Acta Neurological Scandinavica,

43(4):417–426.

ROBUST GRAYSCALE CONVERSION FOR VISION-SUBSTITUTION SYSTEMS

135