The Role of Attention in Understanding Spatial

Expressions under the Distractor Condition

Tatsumi Kobayashi

, Asuka Terai

and Takenobu Tokunaga

Department of Computer Science, Tokyo Institute of Technology

Tokyo Meguro

Ookayama 2-12-1, 152-8552 Japan

Global Edge Institute, Tokyo Institute of Technology

Tokyo Meguro

Ookayama 2-12-1, 152-8552 Japan

Abstract. To develop a computational model of understanding spatial expres-

sions, various factors should be taken into account. We have been exploring the

relations between the goodness-of-ﬁt of spatial terms and various geometric fac-

tors such as the object’s size, the distance between objects and the observers’

viewpoint. Although the dual-object relation between the located and reference

objects can be handled with relatively simple models, introducing a distractor

object requires a model considering further factors to explain relations, such

as attention to the objects. Based on our experiment using Japanese topologi-

cal and projective terms, this paper proposes a computational model to estimate

the goodness-of-ﬁt of spatial terms which incorporates an attention model for a

distractor object. The proposed model was evaluated by using our experimental

data.

1 Introduction

Elucidating the human’s cognitive mechanism of understanding spatial expressions is

important not only for cognitive science and linguistics but also engineering, in which

broad applications are expected in ﬁelds such as human-robot interaction. There have

been numerous attempts to tackle this problem by proposing computational models and

by conducting psychological experiments. Most of them, however, estimate goodness-

of-ﬁt functions of spatial terms in limited combinations of static visual conﬁgurations

and language expressions. Analyzing the nature of each spatial term at a perceptual level

with limited conditions would be a good starting point. In fact, effective methods using

a spatial template to represent the range of spatial terms have been established [1, 2],

and several computational models have also been proposed [3–6]. However, to realise

applications for complex real world problems, the study of spatial cognition still needs

more progress.

To tackle realistic spatial cognition problems, many issues should be solved. For

instance, in a dialogue involving spatial relations, differences of visual information

and knowledge between dialogue participants must be considered. The dialogue his-

tory should be taken into account as well. The computational model would also have

to cover the diversity and complexity of geometric factors in the environment. When

Kobayashi T., Terai A. and Tokunaga T. (2008).

The Role of Attention in Understanding Spatial Expressions under the Distractor Condition.

In Proceedings of the 5th International Workshop on Natural Language Processing and Cognitive Science, pages 74-83

DOI: 10.5220/0001738200740083

 SciTePress

considering the functional factors between objects [7], the dialogue topic, participants’

intentions and plans, common sense and domain knowledge would be necessary. Ob-

viously there are many situations which we would be unable to resolve solely by com-

piling individual computational models of spatial terms. If we were to tackle all issues

at the same time without an appropriate research strategy, the goal to build a realistic

computational model would be unachievable. In order to get a step further toward com-

putational models of spatial terms which are applicable to the real world, we focus on

exploring the following problems [8,9].

Problem 1. Although the computational models in the past research include several

parameters for ﬁtting real data, criteria to decide the appropriate values for those pa-

rameters are not always clear. Through our experiments, we found some of the clues

that can be used to adjust the parameters.

Problem 2. Most of the past computational models dealt with a simple conﬁguration

consisting of two objects: a located object and a reference object. Several studies have

also pointed out that a distractor object changes the goodness-of-ﬁt of projective terms

(e.g. left) and topological terms (e.g. near) [10–12]. We found that the effect of a dis-

tractor object depends on certain geometric factors in the visual presentation.

Problem 3. Kelleher and Kruijff [13] pointed out the difference of cognitive load be-

tween topological terms and projective terms. That is, cognitive load of understanding

topological terms is less than that of projective terms, since projective terms require set-

ting an appropriate reference frame for its interpretation. Thus, they claimed that topo-

logical terms are more preferable than projective terms. However, our intuition says

that such a simple solution could not always be accepted in reality, that is, depending

on the combination of geometric factors, left may be dominant in some situations, and

near may be dominant in others. Regarding this cognitive load issue, we analysed the

goodness-of-ﬁt of spatial terms in variation of geometric factors.

This paper proposes an extension of an existing computational model estimating the

goodness-of-ﬁt of spatial terms. The proposed model is based on our ﬁndings from the

experiments which were conducted to explore the nature of spatial terms corresponding

to speciﬁc attention patterns in the visual scene. Especially, it focuses on modeling

of the distractor object’s effect for the Japanese topological terms tikai (near) and t

(far from) and the projective term hidari (left). In the following sections, we ﬁrstly

explain our previous experiments, and then point out the importance of attention in the

computational model of spatial terms. Subsequently, incorporating the attention factor,

we propose a new model. Then we give a general discussion before concluding the

paper and looking at the future work.

2 Finding a Bridge to the World

We conducted experiments to investigatethe effect of a distractor object on the goodness-

of-ﬁt of spatial terms relating two objects: a reference object and a located object [8].

06121824303642

LO (red)

DO (blue) RO (green)

∼d

: absolute position of DO

∼ l

: absolute position of LO

relative position of DO

Fig.1. An arrangement of in the experiment (LO at l

, DO at d

and large LO).

2.1 Experiments with Japanese Spatial Terms

In the experiments, subjects were sequentially presented 3-D CG pictures together with

sentences describing the relationship between the objects in the picture. The spatial

terms used for the experiment were two Japanese topological terms: tikai (near) and t

(far from) and one Japanese projectiveterm: hidari (left). One of the following Japanese

sentences was displayed above each picture.

“Akai b

oru ha midori no b

oru







no tika-ku ni

kara t

o-ku ni

no hidari ni







arimasu”

In English, they mean “The red ball is {near / far from / to the left of } the green

ball.” As shown in Fig. 1, the picture shows three objects: the located object (LO), the

distractor object (DO) and the reference object (RO). They are arranged on the same

line with the RO being ﬁxed at the origin. Both the RO and DO are medium-size balls

(diameter 4), and the LO is one of three sizes: small, medium and large. The colours

of the LO, DO and RO are red, blue and green respectively. We have three conditions

of the LO position (l

, l

and l

), three conditions of the LO size (diameters: 2, 4 and

8), ﬁve conditions of the DO position (position at d

, d

and no distractor case)

and three conditions of spatial terms (tikai (near), t

oi (far from) and hidari (left)). This

makes the total of 135 stimuli (= 3 × 3 × 5 × 3), which were presented randomly to the

subjects on a computer display. Subjects were asked to provide a rating on how well the

sentence described the relationship between the LO and RO by selecting one of nine

buttons from 1 (not relevant) to 9 (most relevant).

The ANOVA results of the experiment are shown in Fig. 2; (a) shows the subjects’

mean ratings of each spatial term without DO; (b) shows the interaction between the

spatial terms and the DO’s absolute position from the RO (p < .001); (c) shows the

interaction between the spatial terms and the DO’s relative position from the LO (p <

.001); (d) shows the interaction between the spatial terms and the LO size (p < .001).

In (b), (c) and (d), the vertical axis represents the mean difference of ratings between

with DO and without DO conditions of the same subject. In addition, the analysis of

each mean rating of 135 stimuli simply indicated that hidari (left) was highest-rated at

the l

and l

positions of the LO, and t

oi (far from) at the l

position of the LO. Tikai

(near) was second highest-rated at the position l

of the LO. Detailed observations are

as follow:

hidari (left)

tôi (far from)

tikai (near)

Mean rating

12 24 36

LO position

(a)

Mean difference of rating

between w/ DO and w/o DO

0.5

0.0

-0.5

-1.0

DO absolute position

6 18 30 42

hidari (left)

tôi (far from)

tikai (near)

(b)

Mean difference of rating

between w/ DO and w/o DO

0.5

0.0

-0.5

-1.0

-30 -18 -6 6 18 30

DO relative position

hidari (left)

tôi (far from)

tikai (near)

(c)

Mean difference of rating

between w/ DO and w/o DO

1.0

0.5

0.0

-0.5

2 4 8

LO size

hidari (left)

tôi (far from)

tikai (near)

(d)

Fig.2. Results of ANOVA on Japanese spatial terms.

1. In the case of t

oi (far from), the subjects’ rating shows its peak at the leftmost posi-

tion, and decreases linearly to the region near the RO. In other words, the subjects’

attention is on the region between the RO and the left boundary of the picture.

2. Tikai (near) indicates almost the opposite tendency of t

oi (far from). Its rating de-

creases linearly, gradually going apart from the RO. It turns out that the left bound-

ary is used as a kind of reference object in terms of nearness.

3. In the case of hidari (left), the subjects’ rating decreases as they gradually goes

apart from the RO, however, the left boundary is not considered as a reference

object.

One conclusion is that the computational model must take into account the boundary

(i.e. the leftmost position in this case) for tikai (near) and t

oi (far from), even though

it is not explicitly stated in the linguistic expressions. Regarding the aforementioned

Problem 1, it suggests the possibility to utilise the boundary as information to ﬁt the

model to the visual scene. In addition, some properties listed below in respect to the

DO’s effect were found. Here, F

(n = 1 ∼ 4) are properties of t

oi (far from), N

(n = 1 ∼ 3) are properties of tikai (near), and L

(n = 1 ∼ 3) are properties of hidari

(left).

) The closer the DO is to the RO, the better the rating.

) When the DO is located between the LO and RO, the rating im-

proves.

) When the LO is larger than the DO, the rating improves.

) When the DO is located far side of the LO from the RO, the rating

decreases.

), (L

) The closer the DO is to the RO, the rating decreases.

), (L

) When the DO locates between the LO and RO, the rating decreases.

) When the LO is larger than the DO, the rating decreases.

) The size of the LO has little inﬂuence on the effect by the DO.

These properties summarise tendencies of the DO’s effect for each spatial term,

which could provide a partial solution to the Problem 2 raised in section 1. At the same

time, it suggests circumstances which cannot be solved simply by using the prioritised

list of spatial terms considering the human cognitiveload as suggested in the Problem 3.

2.2 Comparison with the Relative Proximity Model

We conﬁrmed in [9] that our experimental results of tikai (near), described in the previ-

ous section, could not be explained by Kelleher’s Relative Proximity Model (RPM) [11]

for the English spatial term near. We brieﬂy provide the veriﬁcation result and what we

learned from it. The RPM calculates P

rel

(L, x), the goodness-of-ﬁt (relative proxim-

ity value in the Kelleher’s original paper) of the object L at position x by subtract-

ing the highest absolute proximity value given by the other object at position x, from

abs

(L, x), the absolute proximity value of the object L as shown in equation (3).

abs

(L, x) = (1 − dis t

norm

(L, x))S(L) (1)

S(L) =

vis

(L) + S

disc

(L)

(2)

rel

(L, x) = P

abs

(L, x) − max

∀L

6=L

P (L

, x) (3)

abs

(L, x) is adjusted by the salience parameter consisting of visual salience S

vis

(L)

and discourse salience S

disc

(L), and dist

norm

(L, x) is the normalised distance to the

position x from the object L.

Table 1 shows the comparison between the subjects’ rating in our experiment and

the results computed by the RPM in the case that the DO position is 18 and the LO

size is small (diameter = 2). The RO’s absolute proximity (a) is the subjects’ mean

rating without the DO in our experiment, and the RO’s relative proximity (d) is the

Table 1. Comparison between our experiment and the RPM (DO position=18, LO size=small).

LO’s

position

(a) LO abs prox

w/o DO (Exp)

(b) DO abs prox

w/o DO (RPM)

(a)-(b) (RPM)

(d) LO rel prox

w/ DO (Exp)

(e) DO abs prox

(a)-(d) (Exp)

12 6.929 7.0 -0.071 6.286 0.643

24 3.857 7.667 -3.810 3.429 0.428

36 2.214 5.0 -2.786 2.0 0.214

subject’s mean rating with the DO. The DO’s absolute proximity (b) is calculated by

linear interpolation assuming that the DO position 18 has rating 9 and the both ends of

the picture have rating 1. The linearity of the goodness-of-ﬁt for near was conﬁrmed

from the data as shown in Fig. 2 (a). In the experiment, since the RO and DO were

the same size, the salience parameter S(L) was set to 1. Based on these conditions and

the assumption that the values in column (a) minus the values in column (b) equal to

the values of the RPM’s equation (3), we calculated the relative proximity of the RO

(column (c)) at positions 12, 24 and 36. For comparison, we also calculated the values

of column (a) minus column (d) by considering the subjects’ ratings as the DO’s effect

(column (e)).

It is obvious that the values in column (c) computed by the RPM is quite differ-

ent from the experimental result (column (d)). The DO’s absolute proximity based on

the RPM (column (b)) is ten times bigger than the DO’s effect of the experiment (col-

umn (e)). We think that assuming the same salience parameters for both the RO and DO

causes the same result as that for Kelleher et al. [11]. Since the LO’s relative proximity

is relatively high as shown in column (d), the DO’s salience should be extremely low ac-

cording to the equation (1). We presume the problem here to be the use of the DO’s size

(= 1) directly for the DO’s salience parameter S

vis

(L), meaning equation (2) should

be reconsidered. In addition, our experiment results suggest the need for considering

attention on speciﬁc parts of space when modeling the computational model of spatial

terms considering the DO.

3 Attention-based Computational Model with a Distractor Object

Based on our experiment,we introduce a computationalmodel estimating the goodness-

of-ﬁt of spatial terms with a distractor object, and evaluate it with our experimental data.

3.1 A Computational Model

We propose a model r

TOTAL

representing the spatial term’s goodness-of-ﬁt by the sum

of the dual-object relation model r

and the DO’s effect r

. Here, x

is the distance

between the RO and LO, and x

is the distance between the RO and DO. We normalise

the distance between the RO and the end point (the boundary) of the scene to 1.

TOTAL

= r

+ r

(4)

= px

+ θ

+ C

(5)

= θ

+ f

) + C

(6)

consists of the LO’s positions effect px

, the LO’s size effect θ

and the

constant C

. θ

is the LO’s salience parameter which adjusts the ratio of the LO

size to the RO size, s

. r

consists of an attention distribution f

in the vicinity of

the DO, a monotonic attention distribution f

over the area from the RO to the farthest

point, an asymmetric attention distribution f

of both sides of the DO, the DO size’s

effect θ

and a constant C

. θ

is the DO’s salience parameter. We assume

that s

is represented by the ratio of the LO size to the DO size because the DO’s

effect was affected by the LO size in our experiment. θ

reﬂects the tendency

shown in Fig. 2 (d).

= e

−x

)

}

(7)

= 1 − x

(8)

1 + e

β(x

−x

)

(9)

Here, f

is an effect of an interaction between the LO and DO. The closer the DO is

to the LO, the effect increases. The further the DO is from the LO, the effect decreases

gradually. f

is an effect determined by the DO’s distance from the RO. That is, it

includes the effect of the DO absolute position as shown in Fig. 2 (b), where θ

deﬁnes

the slope of the curve of each spatial term. On the other hand, f

is an asymmetric effect

of the DO, depending on the LO’s position in the RO sides of the DO and the opposite

side. Especially, for t

oi (far from), with θ

and C

, f

could provide negative effect

when the LO is between the RO and DO, but positive effect when the LO is between

the DO and the farthest point. Using these fundamental attention elements, equation (6)

represents an effect of the DO relative position as shown in Fig. 2 (c).

Table 2. Model parameter estimation and model evaluation.

hidari tikai tˆoi

Parameters (left) (near) (far from)

: p -0.232 -1.24 1.226

0.027 0.131 -0.193

0.966 0.966 0.039

: α 0.02 0.015 0.003

β 70.0 70.0 70.0

γ -2.0 -0.05 -0.05

-0.015 -0.3 0.23

-0.02 -0.02 0.05

0.993 0.987 0.986

0.772 0.420 0.931

3.2 Simulation and Discussion

We performed a nonlinear regression analysis on our experimental data to estimate the

parameters of r

and r

. Table 2 summarises the resultant parameters. r

’s param-

eters are estimated from the subjects’ mean ratings without the DO. r

’s parameters

are estimated from the difference between the ratings with and without the DO of each

subject. For β and C

, the values shown in Table 2 were given as constraints.

The correlation factor of r

exceeds 0.98 for all spatial terms to verify good preci-

sion. Conversely, r

ﬁts very well for t

oi (far from), but does not for hidari (left) and

tikai (near). For hidari (left), the model does not ﬁt the data (r

is largely negative)

when the LO size is large and both the LO and DO are close to the RO. The correlation

factor of the case with LO’s size 1 and 2 increases to 0.510, which suggests room for

further improvement of the model.

is negativefor tikai (near) and hidari (left), which works to degrade the goodness-

of-ﬁt of the LO. On the other hand, θ

of t

oi (far from) is positive, which increases

the goodness-of-ﬁt of the LO. The absolute values of both θ

and θ

for hidari

(left) are relatively smaller than the others. As the experiment revealed, the effect of

the size of the LO is almost constant for hidari (left), suggesting it would be a speciﬁc

characteristic of the projective terms.

4 Related Work

In the previousstudies of computational models of spatial cognition, the AVS (Attention

Vector Sum) model [5] introduced an attention vector on the Spatial Template for the

conﬁguration of two objects. This paper proposed to estimate LO’s goodness-of-ﬁt by

superimposing several different attention factors. The proposed computational model is

the sum of the LO’s goodness-of-ﬁt and the DO’s effect, making it similar to the RPM

for English term near.

In the past study, the salience of objects is considered to come from their attributes

such as the size and colour. In the modeling of attention to the DO, however, those

object attributes are part of geometric factors affecting the spatial term’s goodness-of-

ﬁt. If we generalise the source of salience to consider the salience of objects affected by

the degree of attention to the objects, the salience factor varies depending on the DO’s

position as well for instance. In addition, assuming that attention to an object might

be affected by its linguistic referring expression, the salience factor must be redeﬁned

based on overall properties of objects involving multiple factors: the object size and

position, linguistic expressions, etc.

Carlson-Redvansky and Logan proposed a framework of basic steps for the spatial

cognition process [14]. They focused on the process of recognizing the simple two-

object relation. In the case involving a distractor object, we need to take into account

other factors of visual scenes, such as the speciﬁc attention model of each spatial term.

Subsequently, the other factors need to be considered in order to calculate the effect of

the DO against the LO. This paper contributes to reveal these other aspects of the spatial

cognition process.

5 Conclusions and Future Work

This paper proposed a computational model of the goodness-of-ﬁt of spatial terms.

The model incorporates attention to a distractor object, particularly, the effects of their

geometric factors. The proposed model was evaluated by using the experimental data

to conﬁrm its effectiveness.

The following is the agenda for furture work.

– We need to extend the model to deal with the wider scope of geometric factors. For

instance, a situation involving multiple distractors and a situation where objects are

not aligned on a single line should be handled by the model.

– We need to conﬁrm if the model is robust against the change of the distractor size

and the reference object size.

– Modeling the change of viewpoint is another issue to be tackled. We analysed this

problem in our previous work [8] using the other two Japanese projective terms:

mae (in front of) and ushiro (back), but we have not incorporated the ﬁndings into

the model yet.

– Another challenge is using attention modeling to account for conventional usages

of spatial terms. Herskovits [15] analysed some conventional expressions of spatial

terms in association with the object functions and contexts. Some of these cases

could be explained within the scope of geometric factors. For instance, we say

“the cat is under the table.” instead of “the cat is in the table.”. The preference of

under over in could be explained by an attention model which captures the relations

among objects based on geometric factors (the shape of the table in this case) to

which human attention is directed. In this instance, the table top is more salient to

attract human attention, thus relation under could be preferred for describing the

relation between the cat and the table (top).

References

1. Hayward, W.G., Tarr, M.J.: Spatial language and spatial representation. Cognition 55 (1995)

39–84

2. Logan, G.D., Sadler, D.D.: A computational analysis of the apprehension of spatial relations.

In Bloom, P., Peterson, M.A., Nadel, L., Garrett, M., eds.: Language and Space. The MIT

Press (1996) 493–529

3. Gapp, K.P.: Basic meanings of spatial relations: Computation and evaluation in 3d space. In:

Proceedings of AAAI-94. (1994) 1411–1417

4. Kelleher, J., Kruijff, G.J., Costello, F.J.: Proximity in context: An empirically grounded com-

putational model of proximity for processing topological spatial expressions. In: Proceedings

of the 21st International Conference on Computational Linguistics and 44th Annual Meeting

of the Association for Computational Linguistics. (2006) 745–752

5. Regier, T., Carlson, L.A.: Grounding spatial language in perception: An empirical and com-

putational investigation. Journal of Experimental Psychology: General 130 (2001) 273–298

6. Tokunaga, T., Koyama, T., Saito, S., Nakajima, M.: Classiﬁcation of japanese spatial nouns.

In: Proceedings of 4th International Conference on Language Resources and Evaluation

(LREC 2004). (2004) 1829–1832

7. Coventry, K.R., Garrod, S.C.: Saying, Seeing, and Acting: The Psychological Semantics of

Spatial Prepositions. Psychology Press (2004)

8. Kobayashi, T., Terai, A., T., T.: The effect of geometric factors on spatial term selection.

Cognitive Studies 15 (2008) 144–160 (in Japanese).

9. Kobayashi, T., Terai, A., T., T.: On the effect of geometric factors on spatial term selection.

In: Proceedings of 14th Annual Meeting of Association of Natural Language Processing

(Japan). (2008) 689–692 (in Japanese).

10. Carlson, L.A., Logan, G.D.: Using spatial terms to select an object. Memory & Cognition

29 (2001) 883–892

11. Kelleher, J., van Genabith, J.: A computational model of the referential semantics of pro-

jective prepositions. In Saint-Dizier, P., ed.: Computational Linguistics: Dimensions of the

Syntax and Semantics of Prepositions. Kluwer Academic Press (2005) 211–228

12. Kojima, T., Kusumi, T.: The effect of the extra object on the linguistic apprehension of spatial

relationship between two objects. Spatial Cognition and Computation (2006) 145–160

13. Kelleher, J., Kruijff, G.J.: Incremental generation of spatial referring expressions in situated

dialogue. In: Proceedings of the 21st International Conference on Computational Linguistics

and 44th Annual Meeting of the Association for Computational Linguistics. (2006) 1041–

1048

14. Carlson-Radvansky, L.A., Logan, G.D.: The inﬂuence of reference frame selection on spatial

template construction. Journal of Memory and Language 37 (1997) 411–437

15. Herskovits, A.: On the spatial uses of prepositions. In: Proceedings of 18th Annual Meeting

of ACL. (1980) 1–5