Learning Spatial Relations with a Standard Convolutional Neural

Network

Kevin Swingler

and Mandy Bath

Computing Science and Mathematics, University of Stirling, Stirling, FK9 4LA, Scotland

Keywords:

Convolutional Neural Networks, Spatial Reasoning, Computer Vision.

Abstract:

This paper shows how a standard convolutional neural network (CNN) without recurrent connections is able to

learn general spatial relationships between different objects in an image. A dataset was constructed by placing

objects from the Fashion-MNIST dataset onto a larger canvas in various relational locations (for example,

trousers left of a shirt, both above a bag). CNNs were trained to name the objects and their spatial relationship.

Models were trained to perform two different types of task. The ﬁrst was to name the objects and their

relationships and the second was to answer relational questions such as “Where is the shoe in relation to

the bag?”. The models performed at above 80% accuracy on test data. The models were also capable of

generalising to spatial combinations that had been intentionally excluded from the training data.

1 INTRODUCTION

The ability to understand the relationships between

objects in an image is an important step towards a

complete solution to computer vision. Objects can

have many types of relationship, such as subject-

object or relative location. This paper describes work

aimed at allowing neural networks to learn about the

spatial relationships between pairs of objects in an im-

age and report both image labels and relationship la-

bels, for example “The bag is above the coat”. This

work is part of a project that aims to use computer vi-

sion to build assistive technology for the blind. The

project is called the Artiﬁcial Intelligence Sight Loss

Assistant (AISLA)

and this work contributes to a

module designed to process simple questions about

the locations of objects in a room.

Convolutional Neural Networks (CNNs) have

proven to be very successful in recent years at a num-

ber of signal processing tasks including computer vi-

sion. Early CNNs were designed to classify a whole

image and so were only able to process a single object

at a time. Work on improving these algorithms con-

tinues. Examples include LeNet (LeCun et al., 1998),

AlexNet (Krizhevsky et al., 2012), VGG (Simonyan

and Zisserman, 2014), the Inception models (Szegedy

et al., 2015), (Szegedy et al., 2016), ResNet (He et al.,

2016), and ResNeXt (Xie et al., 2017). At the same

https://orcid.org/0000-0002-4517-9433

https://www.aisla.org.uk/

time, a number of large image collections have been

published with data identifying the objects in the im-

ages and their locations. Examples include the COCO

dataset (Lin et al., 2014), the Fashion-MNIST dataset

(Xiao et al., 2017) and the ImageNet database (Deng

et al., 2009).

The same convolutional idea is also applied to the

dual task of locating and labelling more than one ob-

ject in an image. This is collectively known as object

detection and well known object detection models in-

clude R-CNN (Girshick et al., 2014), Faster R-CNN

(Ren et al., 2015), and the series of YOLO models

(Redmon and Farhadi, 2017), (Redmon and Farhadi,

2018). Object detectors generate a list of object la-

bels and associated bounding boxes, locating them in

the image. Given the bounding boxes in an image,

some simple geometric calculations can be used to

test the relative positions of objects to each other in

the ﬂat plane of the image. However, it is an inter-

esting question as to whether a CNN, with its focus

on local features, is capable of learning larger scale

spatial relationships among the objects in an image.

Some work has been carried out that attempts to

describe spatial relationships in images. Automated

image captioning uses a mixture of natural language

processing and computer vision to associate an im-

age with a descriptive sentence such as “A man walk-

ing on a beach with a dog”. Some of the descriptions

contain spatial relationships - “A vase on a table” for

example, but as the words are generated statistically,

464

Swingler, K. and Bath, M.

Learning Spatial Relations with a Standard Convolutional Neural Network.

DOI: 10.5220/0010170204640470

In Proceedings of the 12th International Joint Conference on Computational Intelligence (IJCCI 2020), pages 464-470

ISBN: 978-989-758-475-6

the spatial meaning is lost. Many of these models mix

CNNs with LSTMs to process the series of words that

make up a caption, see (Wang et al., 2016) for exam-

ple. Hossain et al. (Hossain et al., 2019) provide a

useful review of image captioning techniques.

In 2014, in his talk on what is wrong with con-

volutional neural nets, Geoff Hinton talked about the

limitations of max pooling and how CNNs can recog-

nise the right elements, but in the wrong order. For

example, it might detect two eyes, a nose and a mouth

and classify a face even if those elements are not ar-

ranged as a face. We were interested in whether or not

a CNN could be made to learn such spatial relation-

ships if the target outputs made them explicit. This is

slightly different from the point that Hinton was mak-

ing, but sparked the question all the same. Can a CNN

learn image wide spatial relationships by simply en-

coding the name of such relationships at the output

layer?

There have been several attempts at explicitly ad-

dressing the challenge of learning spatial relation-

ships among objects in an image using an architec-

ture that adds spatial speciﬁc elements to the standard

CNN. Mao et al. (Mao et al., 2014) use a mixture of

recurrent network layers and convolutional layers in a

multimodal approach they call an m-CNN. The model

uses a statistical approach to produce words that form

sentences that describe images. Words are selected

from a probability model based on the image and the

previous words in the sentence.

More recently, Raposo et al. (Raposo et al., 2017)

proposed relation networks (RN) as a way of allow-

ing networks to learn about the relationships between

objects in a scene. The RN models the relationships

between pairs of objects and is used in conjunction

with CNNs and LSTMs for image and language pro-

cessing.

In this paper, we address the question of whether a

simple CNN architecture without recurrent or LSTM

components is able to learn relative spatial relation-

ships in images. The important question is whether

or not a CNN can learn to generalise concepts such

as above or below from example images without an

architecture that is speciﬁcally designed to capture

those relationships. This was done by generating im-

ages with a small number of object classes arranged

in a variety of spatial conﬁgurations while ensuring

that some combinations did not appear in the training

data. When tested, the model was able to correctly re-

port the relative locations of object combinations that

were absent from the training data.

The motivation for the work is to add speciﬁc out-

put nodes to a CNN, which refer to a deﬁned concept.

In this case, the concepts describe relative locations,

but in future work they might describe an action or

even an intention. Rather than generating a sentence

(such as the girl is drinking the milk) that requires

further post-processing to extract meaning, we aim

to generate meaningful outputs directly from the im-

age (object=girl, subject=milk, verb=drink, for exam-

ple). We want to be able to use a single architecture (a

standard CNN) and change only the output targets to

be able to train on different meaningful relationships

among objects in an image.

The remainder of this paper is organised as fol-

lows. Section 2 describes the preparation of the train-

ing, validation and test data. Sections 3 and 4 de-

scribe two experiments with CNNs for spatial relation

recognition. Finally, section 5 provides an analysis of

the results and some ideas for further work.

2 DATA PREPARATION

The datasets for the experiments were constructed us-

ing the Fashion-MNIST dataset (Xiao et al., 2017).

Images in this collection are 28 by 28 pixels in size

and they were used to generate larger images of 56 by

56 pixels by pasting two or three of the original im-

ages onto a larger canvas. The original images were

selected at random without replacement and placed on

the larger canvas in randomly chosen non-overlapping

locations. In this way, any two objects in an image

could have a clearly deﬁned spatial relationship from

the set {above, below, left, right, above left, below

left, above right, below right}. The data was auto-

matically labelled using the algorithm that generated

it, making the process of producing large quantities

of data very efﬁcient. Figure 1 shows two example

images. Note that the Fashion-MNIST images are 28

by 28 pixels, so the ﬁgures in this paper represent the

quality of those images accurately.

The Fashion-MNIST object classes are: T-

shirt/top, Trouser, Pullover, Dress, Coat, Sandal,

Shirt, Sneaker, Bag, Ankle boot. There are 60,000

training images and 10,000 test images in the dataset.

All of the images are provided in grey scale so they

do not have a colour dimension.

The training/validation/test protocol was as fol-

lows. The original MNIST training data were split

into 20% test data and 80% training data before the

composite images were generated. The 80% used for

training were further split into 80% train and 20% val-

idation sets, all before the composite images were cre-

ated. Consequently, the train, validation and test data

share no original images in common. What is more,

on different training runs, certain combinations of ob-

ject class and spatial relationship (shirt above bag,

Learning Spatial Relations with a Standard Convolutional Neural Network

465

Figure 1: Two example training images. The ﬁrst showing

a sneaker below left of a pullover and the other showing a

t-shirt below left of a dress. The categories were deﬁned so

that objects were either perfectly aligned, meaning that one

is to the left of the other, or offset vertically meaning one is

below left or below right the other.

for example) were actively excluded to ensure that

the validation and test data contained novel combi-

nations. In total there were 48,000 training examples,

12,000 validation examples and 10,000 test examples

created. The validation data were used to tune the hy-

perparameters of the CNN and the test data were used

to report the ﬁnal accuracy values.

Two different experiments were performed, each

on a different variant of the spatial relationship la-

belling task. The ﬁrst trains a model to produce the

class names of two objects in an image along with the

name of the spatial relationship between them. The

second takes an image with three objects in it along

with the names of two of them and produces the name

of the relationship between the two named objects as

output. The experiments are described in the next two

sections.

3 EXPERIMENT ONE - TWO

OBJECTS

This section describes the ﬁrst experiment, in which

a CNN is trained to describe an image containing two

objects. The images in the training data each contain

two objects from the fashion MNIST data set and the

target output for each consists of the two labels plus

the name of the spatial relationship between them.

Each image forms the input part of a training point

and the output is a vector of 24 binary values. The ﬁrst

10 form a one-hot encoding of the name of the ﬁrst

object in the image. The second 10 use the same rep-

resentation for the second object in the image and the

ﬁnal four represent the four possible spatial relation-

ships of the ﬁrst object to the second. They are in the

set {left, above, below left, and below right}. There

is no need to encode the relationships to the right of

or below as they can be represented by switching the

order of the object labels. For example, a coat to the

left of a bag would encode coat, bag, left and a coat to

the right of a bag would encode bag, coat, left. Con-

sequently, each output vector contains exactly three

values set to one and the rest at zero.

3.1 Network Architecture

A convolutional neural network architecture was used

as follows. The input layer takes grey level images of

56 by 56 pixels, so has a volume of 56×56×1. There

are then three convolutional layers, the size of which

is explored as one of the architectural hyperparame-

ters. Each one of the three layers uses batch normali-

sation, max pooling and dropout, all optimised during

the training and validation process. All layers except

the output layer use ReLU activation functions as this

is well established in the literature as a good choice

for speed and stability.

There is then a fully connected layer that is itself

fully connected to the ﬁnal, output layer, which uses

logistic activation functions paired with a cross en-

tropy loss function. Softmax activation here would

not be suitable as there are always three nodes with a

target value of one, and softmax forces all the outputs

to sum to one.

During training, the hyperparameters were tuned

using a mixture of manual and hyperband search (Li

et al., 2017). The hyperparameters, their possible

settings and search method are shown in Table 1.

The manual search involved the researcher making a

judgement about how to set the hyperparameter based

on previously built models from the same set of ex-

periments. It is fair to say that the model was not

fully optimised, but the purpose of the work is to test

whether or not a CNN can learn spatial relationships

rather than squeezing the last percentage of accuracy

from the data.

Table 1: The hyperparameter search space for the convo-

lutional neural network. In the Search column, H means

hyperband and M means manual.

Hyperparameter Options Search

Number of ﬁlters 32, 64, 128 H

Kernel size 3 × 3, 5 × 5, 11× 11 M

Optimiser Adam, SGD H

Learning rate 0.01, 0.001, 0.0001 M

LR Decay Yes, No M

Batch size 1, 32, 64, full M

Dropout 0.2, 0.3, 0.4, 0.5 M

During training, early stopping was used if the val-

idation accuracy stopped falling. The ﬁrst two convo-

lutional layers used small kernels, as is common in

the literature, but for the third layer we experimented

with larger kernels to allow the network to capture

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

466

broader spatial relationships. Peng et al. (Peng et al.,

2017) describe the advantages of larger kernels when

performing object localisation. Learning rates started

at the values stated in Table 1, but learning rate decay

was also found to be advantageous.

3.2 Hyperparameter Search Results

The results of the hyperparameter search were a

model with the architecture shown in Figure 2. The

model was trained for 162 epochs before early stop-

ping. The optimal hyperparameters were found to be

a learning rate of 0.001 with learning rate decay on

each epoch. The optimal batch size was found to be

32 and the dropout rate was 0.3. ReLU activation

functions were used throughout except at the output

layer, where logistic functions were used. Adam was

found to be more effective than stochastic gradient de-

scent (SGD) as the optimiser.

It was found that using smaller kernels at the ﬁnal

convolutional layer adversely affected the accuracy of

the model, suggesting that the larger kernel is needed

to capture positional relationships. This is not a sur-

prising ﬁnding as these relationships naturally occur

over large distances. Being at later layers, the larger

kernels are able to operate on smaller input volumes

than the full input image while still spanning the full

scope of the original image. This is because the archi-

tecture of the network shrinks the input size at each

layer with max pooling so a single cell in layer three

covers the information in 4 pixels of the original im-

age.

3.3 Results

After 162 training epochs, early stopping caused the

process to terminate. The training accuracy was 0.93

and the test accuracy was 0.83. The speciﬁc combi-

nation of the objects bag and trousers were excluded

from the training data but included in the test data.

The test data images containing a bag and a pair of

trousers were then used to test the model’s ability to

generalise to combinations of object and location that

were not in the training data. The model was able to

correctly give the object labels and relative locations

for these images. An example image is shown in Fig-

ure 3.

4 EXPERIMENT TWO - THREE

OBJECTS

In the ﬁrst experiment described above, the task for

the CNN was to name the classes for two objects in

Figure 2: The architecture of the relationship detector CNN.

Figure 3: An example output from the model showing the

correct response to an image containing a pair of trousers to

the left of a bag. This was a combination of objects that was

explicitly excluded from the training data.

Learning Spatial Relations with a Standard Convolutional Neural Network

467

Figure 4: An example input and output encoding. In this

example, the relationship ”bag left of ankle boot” is repre-

sented.

an image and label the spatial relationship between

them. In the second experiment, the task is to take an

image containing three objects, along with the labels

for two of them as an input and generate the nature of

the relationship between them as the output. This is

equivalent to answering a question like ”Where is the

coat in relation to the bag?”. It presupposes that you

know the identity of two of the objects in the image.

However, the presence of a third object in the image

means that this is not simply a question of identify-

ing where the objects are. The network must identify

which are the two objects of interest and then gener-

ate the relationship between them. Figure 4 shows an

example input image.

The data representation for this task consists of the

image as a 56×56 array of grey level pixel values plus

two vectors of ten, each with a one-hot encoding of

the object class to include in the question. We do not

attempt to process whole sentences, such as ”What is

to the left of the bag?”, but encode the question ex-

plicitly as two words and an image. The output rep-

resents the 8 possible relationships of the second ob-

ject to the ﬁrst from the set {above, below, left, right,

below left, below right, above left, above right}. As

with the inputs, we are not trying to generate full sen-

tences, just the answer to the encoded question. Of

course, it is trivial, given the two input object labels

and the name of their relationship to generate a mean-

ingful sentence from the output.

Training images were generated by placing three

random images from the Fashion-MNIST training

data in non-overlapping locations on a 56 × 56 pixel

blank canvas. Two of the objects were selected at ran-

dom and their spatial relationship calculated and en-

coded as the target output. Note that each training im-

age represents only one of the possible relationships

it contains. In ﬁgure, 4, for example, the coat object

Figure 5: The architecture of the three object CNN.

is not represented in the output.

The architecture for the model has three channels

that merge in the ﬁrst fully connected layer. One uses

the standard CNN components of convolutions, max

pooling, ReLU, and batch normalisation. The other

two channels simply feed the one-hot encoded rep-

resentations of the question words into the ﬁrst fully

connected layer. This layer then merges the three

channels into a ﬁnal, fully connected layer to a one-

hot encoded softmax layer that represents the target

relationship label at the output. The architecture is

shown in Figure 5.

4.1 Results

After a hyperparameter search following the protocol

described in section 3.1, a network with the architec-

ture shown in Figure 5 was trained until early stop-

ping at 100 epochs. The learning rate was initially

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

468

Figure 6: The model correctly labelling a bag above the

trousers. The training data contained no examples of this

combination.

0.001 and rate decay was used to decrease it further

after each epoch. Convolutional layers had ReLU ac-

tivation functions and the output was softmax. The

dropout rate was 0.3 and Adam was used for the opti-

miser. Categorical cross entropy was used for the cost

function.

The ﬁnal model achieved an accuracy of 94% on

the training data and 86% on the unseen test data. As

before, images containing a bag and a pair of trousers

were missing from the training data, but were explic-

itly included for testing. Figure 6 shows an example

of the model correctly generalising to generate an an-

swer to the question “Where is the bag in relation to

the trousers?”.

5 CONCLUSION

This short piece of work shows that a standard convo-

lutional neural network is capable of learning image-

wide concepts about the relative locations of objects

in an image. The architecture of the CNN uses a com-

bination of max pooling layers that shrink the input

dimensions at each step and a large kernel at the ﬁnal

layer. This allows broad scale relationships across the

original image to be captured in smaller feature maps.

We speculate that the large kernel at the ﬁnal convolu-

tional layer of the network is responsible for learning

the relative locations. There is evidence for this in the

fact that reducing the size of the ﬁnal kernels reduces

the ability of the network to correctly label relative

locations of objects in an image.

The second of the two experiments shows that a

very simple form of visual question and answering

can be implemented with a standard CNN by encod-

ing the question at the inputs and the correct answer

at the outputs. The input encodes both the image and

the question, which speciﬁes which elements of the

image should be used to generate the answer. So the

question “Here is a picture that contains a bag, a coat

and a shoe, but where is the shoe in relation to the

bag?” can be answered with a one-of-eight encoding

of relative locations.

There is plenty of scope to extend this work.

Larger images with more objects could be introduced

and the images could be placed on a noisy background

rather than a plain white one. Real images could

also be used. For example, the well known COCO

dataset (Lin et al., 2014) has object labels and bound-

ing boxes in its annotations. The relative locations of

objects could be roughly inferred from the bounding

box coordinates. Other relationships types could also

be introduced, for example “on” as in “the lamp is on

the desk” or “in front” or “behind”.

The primary application that motivates this work

is the development of a personal assistant technology

for people with sight loss. Fixed cameras in a home

could be used to answer location based questions such

as “where did I leave my radio?” or to warn a user

about a potential obstacle or danger. Other appli-

cations that require a speciﬁc representation of rela-

tive locations include self-driving cars and warehouse

picking robots.

The difference between this approach, where spe-

ciﬁc relationships are one-hot encoded as speciﬁc

nodes, and the sentence generating approaches de-

scribed in section 1 is that the relationship is made

explicit in the representation. With the three labels:

“Coat”, “Left”, “Bag”, one can easily generate a cap-

tion: “The coat is to the left of the bag” but one can

also answer other questions such as “What is to the

left of the bag?” or “Where is the coat?” without the

need to use natural language understanding to decode

an automatically generated caption. The facts of the

image are made explicit.

REFERENCES

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detec-

tion and semantic segmentation. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 580–587.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Learning Spatial Relations with a Standard Convolutional Neural Network

469

Hossain, M. Z., Sohel, F., Shiratuddin, M. F., and Laga, H.

(2019). A comprehensive survey of deep learning for

image captioning. ACM Computing Surveys (CSUR),

51(6):1–36.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).

Gradient-based learning applied to document recogni-

tion. Proceedings of the IEEE, 86(11):2278–2324.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and

Talwalkar, A. (2017). Hyperband: A novel bandit-

based approach to hyperparameter optimization. The

Journal of Machine Learning Research, 18(1):6765–

6816.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-

manan, D., Doll

ar, P., and Zitnick, C. L. (2014). Mi-

crosoft COCO: Common objects in context. In Euro-

pean conference on computer vision, pages 740–755.

Springer.

Mao, J., Xu, W., Yang, Y., Wang, J., and Yuille, A. L.

(2014). Explain images with multimodal recurrent

neural networks. arXiv preprint arXiv:1410.1090.

Peng, C., Zhang, X., Yu, G., Luo, G., and Sun, J. (2017).

Large kernel matters–improve semantic segmentation

by global convolutional network. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 4353–4361.

Raposo, D., Santoro, A., Barrett, D., Pascanu, R., Lillicrap,

T., and Battaglia, P. (2017). Discovering objects and

their relations from entangled scene representations.

arXiv preprint arXiv:1702.05068.

Redmon, J. and Farhadi, A. (2017). Yolo9000: better, faster,

stronger. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 7263–

7271.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-

CNN: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-

jna, Z. (2016). Rethinking the inception architecture

for computer vision. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2818–2826.

Wang, C., Yang, H., Bartz, C., and Meinel, C. (2016). Im-

age captioning with deep bidirectional LSTMs. In

Proceedings of the 24th ACM international confer-

ence on Multimedia, pages 988–997.

Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-

MNIST: a novel image dataset for benchmark-

ing machine learning algorithms. arXiv preprint

arXiv:1708.07747.

Xie, S., Girshick, R., Doll

ar, P., Tu, Z., and He, K. (2017).

Aggregated residual transformations for deep neural

networks. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 1492–

1500.

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

470