
physical context it is involved, it is possible to apply a corresponding visual model. A 
very general starting point for building a visual model is the assumption of the 
existence of a ground plane and a gravity vector, which allows us to define upward 
and downward directions. If we project the camera axis to the ground plane, then we 
can also define forward, backward, left and right. We can also define altitude as the 
distance to the ground plane. It is also possible to allow the existence of different 
horizontal planes in a single model, for example, if there is a table, over the ground, 
there can be other objects over the table. Most of the objects -more precisely, non-
flying objects- either are supported by a horizontal plane or accelerate with gravity. 
Supported objects have an almost constant altitude, and their vertical orientation is 
usually almost constant and sometimes predetermined. 
(iii) Temporal context: The cinematic models for the object’s and observer’s 
movements define their relative positions in different time steps. Thus, if an object is 
detected in a given position at time step k, then it should appear at a certain position in 
time step k+1. 
(iv) Objects’ configuration context: Normally physical objects are seen in specific 
spatial configurations or groups. For instance, a computer monitor is normally 
observed near a keyboard and a mouse; or a face, when detected in its normal upright 
pose, it is seen above the shoulders and below hair.  
(v) Scene context: In some specific cases, scenes captured in images can be 
classified in some defined types [8], as for examples “sunset”, “forest”, “office 
environment”, “portrait”, etc. This scene context, which can be determined using a 
holistic measurement from the image [1][2][7] and/or the objects detected in the same 
image, can contribute to the final detection or recognition of the image’s objects. 
(vi) Situation context: A situation is defined by the surround in which the observer 
is immersed (environment and place), as well as by the task being performed. An 
example of a situation context could be: “playing tennis in a red clay curt, in a sunny 
day, at 3PM”. The situation context is determined using several consecutive visual 
perception, as well as other source of perceptual information (e.g. auditory) and high-
level information (e.g. task being carried out, weather, time of the day). 
In [12] are also defined the photometric context (the information surrounding the 
image acquisition process, mainly intrinsic and extrinsic camera parameters), and also 
the computational context (the internal state of processing of the observer). However, 
we believe that those do not correspond to contextual information in the same sense 
we are defining it in this work. 
Low-level context is frequently used in computer vision. Thus, most of the systems 
performing color or texture perception uses low-level context to some degree (see for 
example [13]). Scene context have been also addressed in some computer vision [10] 
and image retrieval [4] systems. However, we believe that not enough attention has 
been given in robot and computer vision to the physical-spatial context, the temporal 
context, the objects’ configuration context, and the situation context.  
Having as our main motivation the development of robust and high performing 
robot vision systems that can operate in dynamic environment in real-time, in this 
work we propose a generic vision system for a mobile robot with a mobile camera, 
which employs all defined spatiotemporal contexts. We strongly believe that as in the 
case of the human vision, contextual information is a key factor for achieving high 
performance in dynamic environments. Although other systems, as for example [1][3] 
137