ColEnViSon: Color Enhanced Visual Soniﬁer

A Polyphonic Audio Texture and Salient Scene Analysis

Codruta Ancuti, Cosmin Ancuti and Philippe Bekaert

Hasselt University -tUL-IBBT, Expertise Centre for Digital Media

Wetenschapspark 2, Diepenbeek, B3590, Belgium

Keywords:

Blind Navigation,Visual Saliency, Color Transformation.

Abstract:

In this work we introduce a color based image-audio system that enhances the perception of the visually

impaired users. Traditional sound-vision substitution systems mainly translate gray scale images into corre-

sponding audio frequencies. However, these algorithms deprive the user from the color information, an critical

factor in object recognition and also for attracting visual attention. We propose an algorithm that translates

the scene into sound based on some classical computer vision algorithms. The most salient visual regions

are extracted by a hybrid approach that blends the computed salient map with the segmented image. The

selected image region is simpliﬁed based on a reference color map dictionary. The centroid of the color space

are translated into audio by different musical instruments. We chose to encode the audio ﬁle by polyphonic

music composition reasoning that humans are capable to distinguish more than one instrument in the same

time but also to reduce the playing duration. Testing the prototype demonstrate that non-proﬁcient blindfold

participants can easily interpret sequence of colored patterns and also to distinguish by example the quantity

of a speciﬁc color contained by a given image.

1 INTRODUCTION

For sighted people, vision is the most efﬁcientsensory

modality to process the spatial information and thus it

dominates the other faculties of perception. Visual

impaired people compensate their deﬁciency by the

other senses. Several visual substitution systems have

been proposed in the literature. Among of them, au-

dio to sound systems manifest increasingly attention

in the last period.

Traditional sound-vision substitution systems

mainly translate the gray scale images into corre-

sponding audio frequencies. One important imple-

mentation constraint is that the common sighted per-

sons acquire and also understand almost instanta-

neously what they see. Ideally, the necessary render-

ing time of the substitution system per frame needs

to be relative short. For this reason, the informa-

tion should be displayed as distinctive as possible in a

large audio volume but also in concise intervals.

However, color improves scenes recognition by

playing a bounding role in memorial representation

(Clifford, 2004; Rossion, 2004) and also by enhanc-

ing surface segmentation and edge detection (Fine, ).

We introduce a novel vision-sound substitution sys-

tem: ColEnViSon (Color Enhanced Visual Soniﬁer)

that codes accurately and distinctively by musical in-

struments the color information of videos and images.

For rendering an image into sound our approach con-

tains several stages. Given an image we identify the

most salient regions by computing and combining dif-

ferent salient feature maps. Afterwards, the regions

of focus are reﬁned based on the image segmenta-

tion. Inspired by our visual system, we transform im-

ages by two resolution levels. The salient regions (fo-

cused) are rendered more accurately while the rest of

the image is interpreted polyphonically. The system

translates into audio a given image by converting the

values of the color pixels into sound variations of mu-

sical instruments. To facilitate the process of learning

we chose to use only ten musical instruments. There-

fore, the color space is simpliﬁed based on a prede-

ﬁned color dictionary. After image color simpliﬁca-

tion step, the initial information of luminance is pre-

served. Intensity variations are played proportional to

the musical instrument scale. Besides of the general

proposed scheme, that we believe that is more easy

and pleasing to learn comparing with the previous

work, an additional improvement is that our system

reduces the rendering time by adopting a polyphonic

approach.

566

Ancuti C., Ancuti C. and Bekaert P. (2009).

ColEnViSon: Color Enhanced Visual Soniﬁer - A Polyphonic Audio Texture and Salient Scene Analysis.

In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, pages 566-572

DOI: 10.5220/0001805105660572

 SciTePress

Finally, we compared our system with the well known

vOICe (Meijer, 1992) approach in the context of

identifying simple color patterns by users that expe-

rienced for short periods both systems. Our experi-

ments reveal very promising results for non-proﬁcient

blindfold participants that were able to easily interpret

sequences of colored patterns and also to distinguish

by example the quantity of a speciﬁc color contained

in an image.

2 RELATED WORK

Non-invasive methods for vision-substitution have

been investigated since the ’60s (Bach-y Rita P.,

1969) but un-succeeding on large scale to replace the

common accessories used by visually impaired peo-

ple. New recently developed portable and wearable

devices available on large scale better encouragecom-

plex vision substitution systems development.

Many investigations in the neural rehabilitation

domain suggest that sensory-deprived person’s abil-

ities involve connections with their damaged cerebral

areas due to the cerebral plasticity . Cerebral reorga-

nization studies involve developing auditory substitu-

tion systems (Mitchell T. V., 2007) and tactile sys-

tems (Bach-y Rita P., 2003; R. Valazquez and Main-

greaud, 2005). Non-invasive systems, the tactile-

vision substitution, propose to translate frontal im-

ages into tactile information.

In the vOICe (Meijer, 1992; Meijer, 1998) ap-

proach, the gray scale images are scanned from left to

right and then are translated into sound based on the

following rule: the pitch elevation is given by the po-

sition in the visual pattern, and the loudness is propor-

tional with the brightness, therefore white is played

loudly and black silently. Each column of a 64x64

image is rendered in about 10 ms and is represented

by a superposition of sinusoidal waves with ampli-

tudes depending on the luminance pixels.

The PSVA (Prosthesis for Substitution of Vision

by Audition) (Capelle C. and C., 1998; Arno P. and

C., 1999) is based on a raw model of the primary vi-

sual system with two resolution levels, one that cor-

responds to artiﬁcial central retina and one that corre-

sponds to simulated peripheral retina.

The way of rendering images to sound is similar

with the vOICe, but this approach is more musical us-

ing distinct sinusoidal for each pixel of the column.

Total image size is 124 pixels, 64 pixels (8x8) located

in the fovea area with higher resolution and the rest of

60 pixels are located in the peripheral artiﬁcial retina

with low resolution.

Our approach blends the features of the PSVA

with the well known vOICe system by playing more

musical the color images. Comparing the rendering

times, the vOICe takes one second to translate 4000

pixels, PSVA takes 18 seconds for 124 pixels and our

prototype renders 4096 pixels in only 4 seconds.

TheVibe (Auvray M., 2005) approach is another

implementation that converts images into sound pat-

terns. The basic components of the sound are sinu-

soidal produced by virtual placed sources. The sound

amplitude is dependent by the luminosity mean of the

pixels which corresponds to the receptive ﬁeld. The

frequency and the inter-aural disparity are determined

by the center of the coordinates of the receptive ﬁeld

pixels. The user hears a sum of all sounds produced

by all sources of the image.

The model of Cronly et al. (Cronly-Dillon J.,

1999) reduces the image information by some image

processing steps in order to include only the black

pixels. The pixels in a column deﬁne a chord, and the

horizontal lines are played sequentially as a melody.

The system is able to decompose complex images

and to obtain basic patterns (squares, circles, poly-

gons). Although the practical results demonstrate that

this model can obtain satisfactory mental images it re-

quires strong concentration of the tested persons.

Our approach has some similarities with the

vOICe system in the way of interpreting the intensity

values of the image pixels. Comparing with this ap-

proach, our system has a higher rendering resolution

by treating each pixel independently. Additionally,

our system is able to translate accurately the color

information of the images. Our system is also re-

lated with PSVA interpreting individual image pixels

in a similar way. However our approach reduces sig-

niﬁcantly the playing time by adopting a polyphonic

technique and also a shorter timing of the playing du-

ration.

3 COLOR IMAGES

INTERPRETATION BY MUSIC

On the designing process of the vision substitution

system there are some initial assumptions on tailor-

ing the vision sensory over the sound sensory. These

constrains can increase the quality and the quantity

of the transferred information of the vision sensory.

Sound segregation capacity has some similar corre-

spondences with the scene analysis (Wilson and Keil,

1999). Experiments on auditory segregation (Breg-

man, 1990) showed that an alternate sequences of

high and low frequencies tones played at different

rates inﬂuence the segregation sensation. When the

stream is played at slower rate, the listener is able to

follow the entire sequence of tones. At higher rates,

the sequence splits into two streams, one high and

ColEnViSon: Color Enhanced Visual Sonifier - A Polyphonic Audio Texture and Salient Scene Analysis

567

Figure 1: Overview of the system. First, we simplify the

color images by computing for each image pixel a cor-

responding value from a color dictionary. Based on the

color dictionary the basic color values are determined. The

brightness is blended with the basic colors of the image.

The image is scanned horizontally and the regions of 2x2

pixels are played in 4 separated musical layers.

one low pitch, being difﬁcult to follow the entire se-

quences of tones. Auditory stream segregation related

with the sound frequencies seems to follow the char-

acteristics of apparent motion in human vision sys-

tems (Strybel T. Z., 1998).

3.1 Color Perception

Our system aims to facilitate the user to perform the

scene interpretation by itself. Considering both au-

ditory and vision features we intend to interpret the

cues content as distinctive as possible. We try to fo-

cus on the image color features in order to have more

expressive low level image components of particular

patterns. The system accomplish this task by identify-

ing the gamut color for each pixel and translating the

value into corresponding distinctive instrument tones.

The variation of the color tone is direct proportional

reﬂected on the variation of the instrument scale.

The color translation involves identifying in ad-

vance the color name. Brent and Kay (Brent and Kay,

1991) claim that all the mature human languages con-

tain eleven colors names: red, purple, pink, yellow,

brown, orange, green, blue, white, black and gray. In

our approach we reduce this color set to ten values by

replacing the purple and pink with violet. The main

motivation is because their relative lighter and darker

appearance can be confused with the perception of the

red color in some cases. Brown is also a color that can

be misclassiﬁed by orange or yellow, but we prefer

to keep it because the color appears frequently in the

natural scenes. Also violet is more perceptually dis-

tinctive than pink. Color perception is subjective due

to the many inﬂuences (e.g. light color, scene context,

material properties). Naming and identifying a color

is a learned skill, once we learned the basics, the sen-

sitivity to shades is individual. In gray scale images,

most of the visual attention that comes from the color

is lost. Also the shape identiﬁcation is more distinc-

tive in color based transformation as the blending with

the background is reduced.

Similar approaches for color selection like

Berreta’s MettaPalette (Beretta, 1990) uses CIELAB

space to select harmonious color palettes.

In our preliminary tests we have tried the color

identiﬁcation by simply applying hue constrains.

CIELch derived from CIELab (Fairchild, 2005) has

been chose because is more perceptually uniform than

HSV color space. After some experiments we came

to the conclusion that a more advanced mathemati-

cal model than simply using the hue as decision fac-

tor (Belpaeme, 2002) is needed. We classify our

color space by using the Universal Color Language

and Dictionary of Names (Kelly and Judd., 1976).

Each color sample is visually labeled with a more ap-

propriate corresponding value. To validate our selec-

tion different context images are transformed until a

satisfactory result was obtained.

Figure 2: The left side table presents the mapping between

the set of basic colors and the musical instruments. In the

right side are presented the rendering times for different

tempos and number of layers.

3.2 Music Encoding

Colors and instruments are used by centuries to evoke

emotions and powerful moods. Instrument’s timbre

or the sound color is a unique quality. The same note

played by different instruments has the same pitch but

distinctive timbre. With few practice, users can rec-

ognize instruments from an orchestra. This happens

because the attack at the beginning of the note is es-

sentially to identify the timbre, therefore it is relative

easy to identify particular instruments even for play-

ing short tempos.

Our algorithm scans every image column from left

to right and compiles the musical pattern by map-

ping the pixel color intensity into musical instruments

notes. We transform lighter intensities of the same

color into notes on higher scales of the same musical

instrument. The selection of the musical instruments

is relative subjective. In Figure 2 are shown the map-

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

568

ping between the color and the musical instruments

that we have chosen.

In our experiments we played music patterns with

different durations (see Figure 2) and we observed

that by using polyphonic approach the quality of the

resulted sound is unessential compromised. We also

observed that patterns with the same color soundmore

distinctively and the repeatable small patterns sound

more pleasing avoiding the cacophony.

The scanning of the images is performed horizon-

tally. Our system is processing simultaneously a 2x2

pixels region by playing in 4 separated layers (chan-

nels). The luminosity information is rendered into

sound by vary the musical instruments among 10 oc-

taves.

For a 64x64 image as can be observed in the Fig-

ure 2 it is required an audition time of 17 seconds for

one layer and 4 seconds for four layers (channels).

4 IMAGE SIMPLIFICATION

Common modern digital image acquisition and dis-

play devices such as scanners and video digitizers

work with true-color images. These images con-

tain more than 16 millions of possible color varia-

tions and therefore are relative intractable to be stored

and transferred. Another inconvenienceappears when

true-color images are visualized on graphical devices

using CLUT (Color Lookup Table) that have a re-

duced number of concurrent displayable colors (e.g.

mobile phones).To overcome these drawbacks, some

kind of color quantization is required in order to re-

duce the amount of the information.

There has been much research on the color quan-

tization (Puzicha J., 2000; Gibson, 2001; Shyi-

Chyi Cheng, 2001). The process of color quantiza-

tion has the main disadvantage that implicates visual

artifacts. In general the color quantization is done by

separating the color space of the original images into

disjoint cells according to some criteria. The spatial

color quantization (Puzicha J., 2000) is one of the

well known methods. However due to the fact that it

fails to reduce a large number of colors and also due

to its expensive computation, that recommends to be

used mostly for ofﬂine processing, we adopt a simpler

strategy.

We opted to built our color image simpliﬁcation

on the early work of Herbert (Heckbert, 1982). We

adapt this simple algorithm by using a reference color

dictionary. As mention so far, we use a small set of

instruments to render a 64x64 size color image. Ba-

sically, our algorithm tries to emulate the human per-

ception when translating the color image information

and afterwards to transform the values of the image

pixels into pleasing and distinctive musical sounds.

An important aspect is that the initial illumination

conditions are transferred in the ﬁnal image, too (see

Figure 3).

Figure 3: From left to right: Initial images, limitation of the

hue based approach, dictionary based simpliﬁcation of the

initial images, basic colors, ﬁnal rendered images (used in

our approach) after the intensity was blended with the basic

colors.

Additionally, our system includes a function of nam-

ing the color that can be very useful also for color

visual impaired users. However, the general expected

result is that a blind person to be able to name easily

the color by hearing the translated sound. To address

this problem a color classiﬁcation scheme is needed

by our approach. We use the standard NBS/ISCC

(Kelly and Judd., 1976) dictionary of colors. This

dictionary deﬁnes 267 centroids in the color space, a

number enough to be easily learned but large enough

to make the distinctions needed for many applica-

tions.

First, each pixel in the original image is mapped

to its best corresponding value of the color-map. The

nearest-neighbor search method is applied for ﬁnding

the most representative color. This exhaustive search

(Heckbert, 1982) decides the matched color by mini-

mizing a metric distance (e.g. Euclidian). For small

saturation values we reduce the inﬂuence of the illu-

mination value by a tempering parameter. To speed

up the execution time we initially create a sorted color

vector that contains the entire dictionary, the separa-

tion information and the vector of pre-computed val-

ues. Finally, each pixel of the image is linked to the

corresponding component of the dictionary that has

been previously assigned visually to one of the 10 col-

ors of the basic set. Each instrument plays a variation

of ten octaves that emulates the color light intensity.

Some results obtained in our experiments are shown

in the Figure 3.

Our process does not perform any dithering in

order to not introduce additional illusionary col-

ors. Comparing with the classical color quantization

methods our approach does not impose a limit for the

number of the matching colors. The number of ba-

sic colors is equivalent with the number of instru-

ments and for more experienced users this number

can be increased by introducing several intermediate

ColEnViSon: Color Enhanced Visual Sonifier - A Polyphonic Audio Texture and Salient Scene Analysis

569

colors/instruments (e.g. Yellowish Green, Yellowish

Red).

100

150

200

250

300

100

150

200

250

300

100

150

200

250

300

Figure 4: Salient maps. From left to right: initial im-

age, color salient map, intensity salient map and orientation

salient map.

5 IMAGE ATTENTION ANALYSIS

Digital images contains many details relative easy

to interpret by the human visual system. The naive

solution to translate directly into the sound the entire

image, may create confusions and misunderstandings

due to the important amount of the information

contained in the images. We propose a method

that emphasis only the important regions of the

images. What image regions are more important is

a relative topic that involved important research in

the recent years by the cognitive and computer vision

communities.

Human visual system is attracted differently

by the objects or parts of a given scene. Even if

for humans ﬁnding an particular object in a scene

is relative trivial for machines, with all the recent

progress, this still represents a difﬁcult task. When

examining an image, only certain objects are seen

as important. This way of perception of our visual

system requires good understanding of the image

semantics. An effective solution is to identify salient

region of a given image.

The main heuristic approach of visual saliency

originates from the early studies of Neisser (Neisser,

1964). His model consists in two main stages. Firstly

,in the pre-attentive stage, the local feature points of

the images are extracted. These ﬁltered locations are

mainly characterized by irregularities of the image

contrast. Secondly, in the attentive stage, the feature

points are related and grouped by their properties.

In this paper we extract salient regions by an

hybrid approach that blends the computed salient

map with the global information of the images. First,

we identify and classify the most salient regions of

a given image. For this task we derive our model

from the approach of Itti et all (Itti and E., 1998).

Then, the images are simpliﬁed by a segmentation in

order that users to be able to comprehend the entire

meaningful region that was selected.

Saliency Model. We use a bottom-up attention sys-

tem based on the Itti et al. (Itti and E., 1998) ap-

proach. The system was extended recently (Walther

and Kochb, 2006) demonstrating good results for

modeling attention to proto objects. Our choice is

motivated by the biological plausible steps that are

embraced by this model. Three feature maps are ex-

tracted and blend in a ﬁnal salient map (see Figure 4).

An image pyramid is built by convolving the image

with a Gaussian kernel and decimated by a factor of

two. For each pyramid level an averaging of the nor-

malized color channels values is performed in order

to obtained the intensity feature map. The color map

is computed from the RG and BY color opponencies.

This operation is relative similar with how human’s

retinal ganglion cells are processing the information.

Local orientation map is obtained by convolving

with oriented Gabor ﬁlters correspondingto four main

directions (0

◦

, 45

◦

, 90

◦

, 135

◦

). The Gabor ﬁlters are

associated with the functions of the neurons in the pri-

mary visual cortex.

Finally, a winner-take-all procedure is employed

for identifying the most salient image region (see the

right side of the Figure 5).

Figure 5: The left side picture displays the segmentation

result obtained after the mean shift method was employed.

In the right side of the ﬁgure is shown the original image

with the emphasized edges and also the most salient regions

(red, blue, yellow) extracted by our approach.

Reﬁned Salient Region. Following the observation

that the extracted salient regions represent only a part

of the object that is focused, images need also to be

partitioned in meaningful regions. In general the ex-

tracted salient regions are not related with an entire

object contained by the image.

Image segmentation is a well studied topic. Se-

lecting the optimal segmentation may be a difﬁcult

task. In general the searching space of possible pixel

groups is very large being essential to use a sub-

optimal search to make the problem more tractable.

We opted for the well known mean shift method that

provides a clean and robust formulation.

The technique was introduced by Comaniciu et al.

(Comaniciu and Meer, ; Comaniciu and Meer, 2002)

and has become probably the most widely-used tech-

nique in computer vision even if several recent studies

(Felzenszwalb and Huttenlocher, 2004; Wang et al.,

2004; Liu et al., 2008) attempt to improve it. Mean

shift segmentation is mathematically related with the

bilateral ﬁltering (Tomasi and Manduchi, ) and con-

sists in two main steps. First, in the ﬁltering stage,

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

570

the image information is smoothed but conserving

the boundaries between regions. Second, the ﬁltered

points are clustered by a single linkage clustering.

In our experiments the two parameters values corre-

sponding to the radius of the kernel for the spatial and

color features are set to 7 and 15, respectively.

After segmentation (see Figure 5), the selected

squared image region needs to include all the im-

age segments that overlaps the winning salient region.

Due to the fact that our system renders images of

64x64 size, larger images or larger salient regions are

downsampled in order to be compatible with the sys-

tem requirements.

6 DISCUSSION AND RESULTS

As was presented, our system renders color images of

64x64 size into sound patterns that are played by a

set of musical instruments. After the salient regions

are extracted, the color images are simpliﬁed by com-

pressing the color space using a predeﬁned color dic-

tionary. The image is scanned horizontally and the

algorithm is processing regions of 2x2 pixels playing

4 separated music layers. The brightness information

is rendered into sound by varying the octaves of the

musical instruments.

In this early stage of our prototype we propose

a simple validation procedure comparing the results

of our system with the well known vOICe (Meijer,

1992) system. Validation of a vision-sound system is

not a trivial task and in general needs long training

periods in order that users to accustom to the system

functionalities and reactions. Experience and learning

are two main characteristics that allow human beings

to assimilate and understand the information from the

environment.

The test consists in verifying the ability of the vol-

unteers to recognize and to associate basic image fea-

tures of a given database representing the audio trans-

lation of several color images. Our database contains

color images of 24 different national ﬂags (see Fig-

ure 6).Each participant had a training period of 30

minutes to get used with the responses of the systems

when images from the database are processed individ-

ually. Afterwards, a number of ten images have been

randomly selected. For each selected ﬂag the volun-

teer was asked to listen the translation into sound and

then to answer to several questions. The questionnaire

contains the following questions:

Q1: Does the ﬂag contain vertical and/or horizontal

color stripes?

Q2: How many distinct colors can be recognized?

Q3: Does the ﬂag contain a national coat of arms?

Q4: Given a reference color pattern and its audio

Figure 6: 64x64 images of different national ﬂags analyzed

in our experiments.

translation, does this color appear in the selected ﬂag

image?

Four volunteers have been asked to ﬁll the question-

naire. One of the volunteer was a researcher but from

a non-IT area. Two were undergraduate students and

one was aged 50-60 years.

The Figure 7 presents the averaged results ob-

tained after every volunteer answered to every ques-

tion. For the ﬁrst and the third questions the results of

the both systems are relative similar because the color

does not play an important role. On the other hand, as

we expected, the results for the second and the fourth

questions our system performs clearly better than the

vOICe system. Even if without long training periods

of the users the conclusions are relative subjective,

the obtained results are encouraging and we believe

that our system has a high potential to transfer opti-

mally the color information into pleasing and distinc-

tive sound frequencies.

1 2 3 4

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Question number

Answer correctness (%)

Our system

vOICe

Figure 7: Comparative results between our system and

vOICe obtained after volunteers answered the question-

naire.

7 CONCLUSIONS

In this paper we introduced a system that assists visu-

ally impaired people to detect cues in the color images

by sound translation. The system has a bottom-up ap-

proach that identiﬁes and focuses salient regions that

are interpreted into polyphonic sounds generated by

musical instruments.

Comparing with existing approaches, our system

emphasis the user perception over the general image

content. The basic elements are played separately in

distinctive patterns and the color associated with the

musical instruments produces more pleasing and dis-

tinctive sounds than previous approaches. An addi-

tional important feature of our system is that the play-

ColEnViSon: Color Enhanced Visual Sonifier - A Polyphonic Audio Texture and Salient Scene Analysis

571

ing time is signiﬁcantly reduced by using polyphonic

approach while still preserving the image details.

Future work will aim at optimizing the code for ad-

ditional speed up for mobile devices (e.g. PDA). Ad-

ditional work is required to determine the optimal se-

lection of the musical instruments.

REFERENCES

Arno P., Capelle C., W. D. M. C. A. M. and C., V. (1999).

Auditory coding of visual patterns for the blind. In

Perception, volume 28(8), pages 1013–1029.

Auvray M., Hanneton S., L. C. (2005). There is some-

thing out there: distal attribution in sensory substitu-

tion, twenty years later. Journal of Integrative Neuro-

science, 4:505–521.

Bach-y Rita P., Collins C., S. F. W. B. S. L. (1969). Visual

substitution by tactile image projection. In Nature,

volume 221, pages 963–964.

Bach-y Rita P., S. W. K. (2003). Sensory substitution and

the human-machine interface. Trends in Cognitive Sci-

ences, 7(12):541–546.

Belpaeme, T. (2002). Factors inﬂuencing the origins of

color categories. PhD Thesis, Artiﬁcial Intelligence

Lab, Vrije Universiteit Brussel.

Beretta, G. (1990). Color palette selection tools. The Soci-

ety for Imaging Science and Technology.

Bregman, A. (1990). Auditory scene analysis. MIT Press,

Cambridge, MA.

Brent, B. and Kay, P. (1991). Basic color terms: their uni-

versality and evolution. Berkeley: University of Cali-

fornia Press.

Capelle C., Trullemans C., A. P. and C., V. (1998).

A real-time experimental prototype for enhance-

ment of vision rehabilitation using auditory substitu-

tion. IEEE Transactions on Biomedical Engineering,

45(10):1279–1293.

Clifford, C.W.G., H. A. P. J. (2004). Rapid global form

binding with loss of associated colors. Journal of Vi-

sion, 4:1090–1101.

Comaniciu, D. and Meer, P. Robust analysis of feature

spaces: color image segmentation. In In Proc of Com-

puter Vision and Pattern Recognition (CVPR ’97).

Comaniciu, D. and Meer, P. (2002). Mean shift: A robust

approach toward feature space analysis. IEEE Trans.

Pattern Anal. Mach. Intell., 24(5):603–619.

Cronly-Dillon J., Persaud K., G. R. (1999). The perception

of visual images encoded in musical form: a study

in cross-modality information. Biological Sciences,

pages 2427–2433.

Fairchild, M. D. (2005). Color appearance models, 2nd ed.

Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efﬁcient

graph-based image segmentation. Int. J. Comput. Vi-

sion, 59(2):167–181.

Fine, I., M. D. B. G. Surface segmentation based on the lu-

minance and color statistics of natural scenes. Journal

of the Optical Society of America (2003).

Gibson, S. Harvey, R. (2001). Morphological color quanti-

zation. Proc IEEE Conf. on Comp. Vision and Pattern

Recog.

Heckbert, P. S. (1982). Color image quantization for frame

buffer display. ACM SIGGRAPH.

Itti, L., K. C. and E., N. (1998). A model of saliency-based

visual attention for rapid scene analysis. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence

(PAMI), 20(11):1254–1259.

Kelly, K. L. and Judd., D. B. (1976). Color: Universal lan-

guage and dictionary of names. National Bureau of

Standards, Spec. Publ. 440.

Liu, T., Zhou, H., Lin, F., Pang, Y., and Wu, J. (2008). Im-

proving image segmentation by gradient vector ﬂow

and mean shift. Pattern Recogn. Lett., 29(1):90–95.

Meijer, P. (1992). An experimental system for auditory im-

age representations. IEEE Transactions on Biomedi-

cal Engineering, 39(2):112–121.

Meijer, P. (1998). Cross-modal sensory streams. In Confer-

ence Abstracts and Applications, ACM SIGGRAPH.

Mitchell T. V., M. M. T. (2007). How vision matters for

individuals with hearing loss. In Informa Healthcare,

volume 46(9), pages 1499–2027.

Neisser, U. (1964). Visual search. Scientiﬁc American,

210(6):94–102.

Puzicha J., Held M., K. J. B. J. F. D. (2000). On spatial

quantization of color images. IEEE Transactions on

Image Processing, 9(4):666–682.

R. Valazquez, E. E. Pissaloux, J. C. G. and Maingreaud,

F. (2005). Walking using touch:design and prelimi-

nary prototype of a noninvasive eta for the visually

impaired. In in Proceedings of the the Medicine and

Biology 27th Annual Conference.

Rossion, B., P. G. (2004). Revisiting snodgrass and van-

derwarts object pictorial set: The role of surface detail

in basic-level object recognition. Perception, 33:217–

236.

Shyi-Chyi Cheng, C.-K. Y. (2001). A fast and novel tech-

nique for color quantization using reduction of color

space dimensionality. Pattern Recognition Letters,

22(8):845–856.

Strybel T. Z., M. M. L. (1998). Auditory apparent motion

between sine waves differing in frequency. Percep-

tion, 27(4):483–495.

Tomasi, C. and Manduchi, R. Bilateral ﬁltering for gray and

color images. In In IEEE ICCV ’98.

Walther, D. and Kochb, C. (2006). Modeling attention

to salient proto-objects. Neural Networks, 19:1395–

1407.

Wang, J., Thiesson, B., Xu, Y., and Cohen, M. F. (2004).

Image and video segmentation by anisotropic kernel

mean shift.

Wilson, R. A. and Keil, F. C. (1999). The mit encyclopedia

of the cognitive sciences. Cerebral Cortex: Top-Down

Processing in Vision.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

572