Indexed Operations for Non-rectangular Lattices Applied to

Convolutional Neural Networks

Mikael Jacquemont

1,2

, Luca Antiga

, Thomas Vuillaume

, Giorgia Silvestri

, Alexandre Benoit

Patrick Lambert

and Gilles Maurin

Laboratoire d’Annecy de Physique des Particules, CNRS, Univ. Savoie Mont-Blanc, Annecy, France

LISTIC, Univ. Savoie Mont-Blanc, Annecy, France

Orobix, Bergamo, Italy

Keywords:

Deep learning, Kernel, Convolution, Image Analysis.

Abstract:

The present paper introduces convolution and pooling operators for indexed images. These operators can be

used on images that do not provide Cartesian grids of pixels, as long as a list of neighbor’s indices can be

provided for each pixel. They are foreseen being useful for convolutional neural networks (CNN) applied to

special sensors, especially in science, without requiring image pre-processing. The present work explains the

method and its implementation in the Pytorch framework and shows an application of the indexed kernels to

the classiﬁcation task of images with hexagonal lattices using CNN. The obtained results show that the method

gives the same performances as the standard convolution kernels. Indexed convolution thus makes deep neural

network frameworks more general and capable of addressing unconventional image lattices.

The current implementation, as well as code to reproduce the experiments described in this paper are made

available as open-source resources on the repository www.github.com/IndexedConv.

1 INTRODUCTION

Traditional convolutional kernels have been devel-

oped for rectangular and regular pixel grids as found

in traditional images. However, some imaging sen-

sors present different shapes and do not have regu-

larly spaced nor rectangular pixel lattices. This is

particularly the case in science experiments where

sensors use various technologies and must answer

speciﬁc technological needs. Examples (displayed

in ﬁgure 1) of such sensors in physics include the

XENON1T experiment (Scovell, 2013), the KM3NeT

experiment (Katz, 2012) or Imaging Atmospheric

Cherenkov Telescope (IACT) cameras such as the

ones of H.E.S.S. (Bolmont et al., 2014) or the

Cherenkov Telescope Array (CTA) (NectarCam, Gli-

censtein et al. 2013; LSTCam, Ambrosi et al. 2013;

and FlashCam, Pühlhofer et al. 2012).

A traditional approach to overcome this and use

traditional convolution neural network framework out

of the box is to over-sample the image into a Carte-

sian grid. For regular lattices, such as hexagonal ones,

it is also possible to apply geometrical transforma-

tion to the images to shift them into Cartesian grids.

In that case, masked convolutions can be used to re-

Figure 1: Example of physics experiments presenting non-

Cartesian sensor grids. On the left, the XENON1T photo-

multiplier tube layout. Credit: XENON Collaboration. On

the right, an image from the H.E.S.S. camera. Credit: The

H.E.S.S. collaboration.

spect the original layout of the images, like in Hex-

aConv (Hoogeboom et al., 2018). In this paper, the

authors present group convolutions for square pixels

and hexagonal pixel images. A group convolution

consists in applying several transformations (e.g. ro-

tation) to the convolution kernel to beneﬁt from the

axis of symmetry of the images. In the hexagonal grid

case they use masked convolutions applied to hexago-

nal pixel images represented in the Cartesian grid (via

shifting).

362

Jacquemont, M., Antiga, L., Vuillaume, T., Silvestri, G., Benoit, A., Lambert, P. and Maurin, G.

Indexed Operations for Non-rectangular Lattices Applied to Convolutional Neural Networks.

DOI: 10.5220/0007364303620371

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 362-371

ISBN: 978-989-758-354-4

However, such approaches may have several

drawbacks:

• oversampling or geometric transformation may

introduce distortions that can potentially result in

lower accuracy or unexpected results;

• oversampling or geometric transformation impose

additional processing, often performed at the CPU

level which slows inference in production;

• geometric transformation with masked convolu-

tion adds unnecessary computations as the mask

has to be applied to the convolution kernel at each

iteration;

• oversampling or geometric transformations can

change the image shape and size.

In order to prevent these issues and be able to work

on unaltered data, we present here a way to apply

convolution and pooling operators to any grid, given

that each pixel neighbor is known and provided. This

solution, denoted indexed operations in the follow-

ing, driven by scientiﬁc applications, is applied to an

hexagonal kernel since this is one of the most com-

mon lattice besides the Cartesian one. However, in-

dexed convolution and pooling are very general solu-

tions, easily applicable to other domains with irregu-

lar grids.

At ﬁrst, a reminder of how convolution and pool-

ing work and are usually implemented is done. Then

we present our custom indexed kernels for convolu-

tion and pooling. This solution is then applied to stan-

dard datasets, namely CIFAR-10 and AID to validate

the approach and test performances. Finally, we dis-

cuss the results obtained as well as potential applica-

tions to real scientiﬁc use cases.

2 CONVOLUTION

2.1 Background

Convolution is a linear operation performed on data

over which neighborhood relationships between ele-

ments can be deﬁned. The output of a convolution

operation is computed as a weighted sum (i.e. a dot

product) over input neighborhoods, where the weights

are reused over the whole input. The set of weights is

referred to as the convolution kernel. Any input data

can be vectorized and then a general deﬁnition of con-

volution can be deﬁned as:

∑

k=1

(1)

where K is the number of elements in the kernel, w

is the value of the k-th weight in the kernel, and N

is the index of the k-th neighbor of the j-th neighbor-

hood.

This general formulation of discrete convolution

can be then made more speciﬁc for data over which

neighborhood relationships are inherent in the struc-

ture of the input, such as 1D (temporal), 2D (image)

and 3D (volumetric) data. For instance, in the case of

classical images with square pixels, we deﬁne convo-

lution as:

i j

∑

k=−W

∑

h=−H

(i−k)( j−h)

(2)

where the convolution kernel is a square matrix of

size (2W + 1, 2H +1) and neighborhoods are implic-

itly deﬁned through corresponding relative locations

from the center pixel. Analogous expressions can be

deﬁned in N dimensions.

Since the kernel is constant over the input, i.e. its

values do not depend on the location over the input,

convolution is a linear operation. In addition, it has

the advantage of accounting for locality and transla-

tion invariance, i.e. output values solely depend on

input values in local neighborhoods, irrespective of

where in the input those values occur.

Convolution cannot be performed when part of the

neighborhood cannot be deﬁned, such as at the bor-

der of an image. In this case, either the correspond-

ing value in the output is skipped, or neighborhoods

are extended beyond the reach of the input, which is

referred to as padding. Input values in the padded re-

gion can be set to zero, or reproduce the same values

as the closest neighbors in the input.

It is worth noting that the convolution can be com-

puted over a subset of the input elements. On reg-

ular lattices this results in choosing one every n ele-

ments in each direction, an amount generally referred

as stride. The larger the stride, the smaller the size of

the output.

The location of neighbors in convolution kernels

does not need to be adjacent, as it is in the image for-

mulation above. Following the ﬁrst expression, neigh-

borhoods can be deﬁned arbitrarily, in terms of shape

and location of the neighbors. In case of regular lat-

tices the amount of separation between the elements

of a convolution kernel in each direction is referred to

as dilation or atrous convolution (Holschneider et al.,

1990). The larger the dilation, the further away from

the center the kernel reaches out in the neighborhood.

In case of inputs with multiple channels such as an

RGB images, or multiple features in intermediate lay-

ers in a neural network, all input channels contribute

to the output and convolution is simply obtained as

the sum of dot products over all the individual chan-

nels to produce output values. Equation 3 shows the

Indexed Operations for Non-rectangular Lattices Applied to Convolutional Neural Networks

363

2D image convolution case with C input channels.

i j

∑

c=1

∑

k=−W

∑

h=−H

ckh

c(i−k)( j−h)

(3)

Therefore, the size of kernels along the channel di-

rection determines the number of input features that

the convolution operation expects, while the number

of individual kernels employed in a neural network

layer determines the number of features in the output.

2.2 Implementation

In neural network applications convolutions are per-

formed over small spatial neighborhoods (e.g. 3 × 3,

5 × 5 for 2D images). Given the small size of the el-

ements in the dot product, the most computationally

efﬁcient strategy for computing the convolution is not

an explicitly nested loop as described on equation 2,

but a vectorized dot product over all neighborhoods.

Then, as most deep learning frameworks intensively

do, one can make use of the highly optimized ma-

trix multiplication operators available in linear alge-

bra libraries (van de Geijn and Goto, 2011). Let us

consider the im2col operation that transforms any in-

put (1D, 2D, 3D and so on) into a 2D matrix where

each column reports the values of the neighbors to

consider for each of the input samples (respectively,

time stamp, pixel, voxel and so on) as illustrated in the

example given in ﬁgure 2. Given this layout, convolu-

tion consists in applying the dot product of each col-

umn with the corresponding ﬂattened, columnar ar-

rangement of weights of the convolution kernel. Per-

forming the dot product operation over all neighbor-

hoods amounts to a matrix multiplication between the

column weights and the column image.

Figure 2: Example of pixel neigborghood arrangements for

a 3 × 3 kernel.

In multiple channel case (see equation 3 for C

input channels), all input channels contribute to the

output. At the column matrix level, this translates

into stacking individual columns from all channels

along a single column, and similarly for the kernel

weights. Conversely, in order to account for multi-

channel output, multiple column matrices are consid-

ered, or, equivalently, the column matrix and the cor-

responding kernel weights have an extra dimension

along C

out

In this setting, striding consists in applying the

im2col operation on a subset of the input, while di-

lation consists in building columns according to the

dilation factor, using non-immediate neighbors. Last,

padding can be achieved by adding zeros (or the

padded values of choice) along the columns of the

column matrix.

Owing to the opportunity for vectorization and

cache friendliness of the general matrix multiply op-

erations (GEMM), the resulting gains in efﬁciency

outweigh the additional memory consumption due to

duplication of values in the column image, since ev-

ery value in the input image will be replicated in as

many locations as the neighbors it participates to (see

ﬁgure 2).

The im2col operation is easily reversible. This

will be considered for deep neural networks training

steps where the backward gradient propagation is ap-

plied in order to optimize the network parameters.

3 INDEXED KERNELS

Given the general interpretation of convolution and its

implementation as given in the previous sections, the

extension of convolution from rectangular lattices to

more general arrangements is now straightforward.

Given an input vector of data and a matrix of

indices describing every neighborhood relationships

among the elements of the input vectors, a column

matrix is constructed by picking elements from the

input vector according to each neighborhood in the

matrix of indices. Analogously to the case of rect-

angular lattices, neighborhoods from different input

channels are concatenated along individual columns,

as are kernel weights. At this point, convolution can

be computed as a matrix multiplication.

We will now show how the above procedure can

be performed in a vectorized fashion by resorting to

advanced indexing. Modern multidimensional array

frameworks, such as NumPy, TensorFlow and Py-

Torch, implement advanced indexing, which consists

in indexing multidimensional arrays with other multi-

dimensional arrays of integer values. The integer ar-

rays provide the shape of the output and the indices

at which the output values must be picked out of the

input array.

In our case, we can use the matrix of indices de-

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

364

scribing neighborhoods in order to index into the in-

put tensor, producing the column matrix in one pass,

both on CPU and GPU devices. Since the indexing

operation is differentiable with respect to the input

(but not with respect to the indices), a deep learn-

ing framework equipped with automatic differenti-

ation capabilities (like PyTorch or TensorFlow) can

provide the backward pass automatically as needed.

We will now present a PyTorch implementation of

such indexed convolution in a hypothetical case.

We consider in the following example an input

tensor with shape B, C

, W

, where B is the batch size

equal to 1, C

is the number of channels equal to 2,

or features, and W

is the width equal to 5, i.e. the

number of elements per channel,

input = torch.ones(1, 2, 5)

and a speciﬁcation of neighbors as an indices tensor

with shape K, W

out

, where K is the size of the con-

volution kernel equal to 3 and W

out

equal to 4 is the

number of elements per channel in the output

indices = torch.tensor([[ 0, 0, 3, 4],

[ 1, 2, 4, 0],

[ 2, 3, 0, 1]])

where values, arbitrarily chosen in this example, rep-

resent the indices of 4 neighborhoods of size 3 (i.e.

neighborhoods are laid out along columns). The num-

ber of columns corresponds to the number of neigh-

borhoods, i.e. dot products, that will be computed

during the matrix multiply, hence they correspond to

the size of the output per channel.

The weight tensor describing the convolution ker-

nels has shape [C

out

, C

, K], where C

out

equal to 3 is

the number of channels, or features, in the output. The

bias is a column vector of size C

out

weight = torch.ones(3, 2, 3)

bias = torch.zeros(3)

At this point we can proceed to use advanced in-

dexing to build the column matrix according to in-

dices.

col = input[..., indices]

Here we are indexing a B, C

, W

tensor with a

K, W

out

tensor, but the indexing operation has to pre-

serve batch and input channels dimensions. To this

end, we employ the ellipsis notation ..., which pre-

scribes indexing to be replicated over all dimensions

except the last. This operation produces a tensor

shaped B, C

, K, W

out

, i.e. 1, 2, 3, 4.

As noted above, the column matrix needs values

from neighborhoods for all input channels concate-

nated along individual columns. This is achieved by

reshaping the col tensor so that C

and K dimensions

are concatenated:

B = in pu t . sh ap e [ 0]

W_ ou t = i ndi ce s . sh ap e [1 ]

col = col . v ie w (B , -1 , W_ out )

The columns in the col tensor are now a concate-

nation of 3 values (the size of the kernel) per input

channel, resulting in a B, K · C

, W

out

. Note that the

col tensor is still organized in batches.

At this point, weights must be arranged so that

weights from different channels are concatenated

along columns as well:

C_ ou t = w eig ht . s hap e [0]

we ig ht _co l = we igh t . vie w ( C_o ut , -1)

which leads from a C

out

, C

, K to a C

out

, K ·C

ten-

sor.

Multiplying the weight_col and col matrices will

now perform the vectorized dot product correspond-

ing to the convolution:

out = to rch . mat mu l ( we i gh t_ co l , col )

Note that we are multiplying a C

out

, K ·C

tensor

by a B, K ·C

, W

out

tensor, to obtain a B, C

out

, W

out

ten-

sor. In this case, the B dimension has been automati-

cally broadcast, without extra allocations.

In case bias is used in the convolution, it must be

added to each element of the output, i.e. a constant is

summed to all values per output channel. In this case,

bias is a tensor of shape C

out

, so we can perform the

operation by again relying on broadcasting on the ﬁrst

B and last W

out

dimension:

out += bias . u n squ e e z e (1)

Padding can be handled by prescribing a place-

holder value, e.g. −1, in the matrix of indices. The

following instruction shows an example of such a

strategy:

indices = torch.tensor([[-1, 0, 3, 4],

[ 1, 2, 4, 0],

[ 2, 3, 0, 1]])

The location can be used to set the corresponding

input to the zero padded value, though multiplication

of the input by a binary mask. Once the mask has been

computed, the placeholder can safely be replaced with

a valid index so that advanced indexing succeeds.

indices = indices.clone()

padded = indices == -1

indices[padded] = 0

mask = torch.tensor([1.0, 0.0])

mask = mask[..., padded.long()]

col = input[..., indices] * mask

Indexed Operations for Non-rectangular Lattices Applied to Convolutional Neural Networks

365

4 POOLING

4.1 Pooling Operation

In deep neural networks, convolutions are often asso-

ciated with pooling layers. They allow feature maps

down-sampling thus reducing the number of network

parameters and so the time of the computation. In ad-

dition, pooling improves feature detection robustness

by achieving spatial invariance (Scherer et al., 2010).

The pooling operation can be deﬁned as:

= f (I

) (4)

where O

is the output pixel i, f a function, I

the

neighborhood of the input pixel i of a given input fea-

ture map I. The pooling function f provided on eq.

4 is applied on I

using a sliding window. f can be

of various forms, for example an average, a Softmax,

a convolution or a max. The use of a stride greater

than 2 on the sliding window translation enables to

sub-sample the data. With convolutional networks, a

max-pooling layer with stride 2 and width 3 is typi-

cally considered moving to a 2 times coarser feature

maps scale after having applied some standard con-

volution layers. This proved to reduce network over-

ﬁt while improving task accuracy (Krizhevsky et al.,

2012).

4.2 Indexed Pooling

Following the same procedure as for convolution de-

scribed in section 3, we can use the matrix of indices

to produce the column matrix of the input and apply,

in one pass, the pooling function to each column.

For instance, a PyTorch implementation of the in-

dexed pooling, in the same hypothetical case as pre-

sented in section 3, with max as the pooling function

is:

col = input[..., indices]

out = torch.max(col, 2)

5 APPLICATION EXAMPLE: THE

HEXAGONAL CASE

The indexed convolution and pooling can be applied

to any pixel organization, as soon as one provides

the list of the neighbors of each pixel. Although

the method is generic, we ﬁrst developed it to be

able to apply Deep Learning technic to the hexag-

onal grid images of the Cherenkov Telescope Ar-

ray (from NectarCam, Glicenstein et al. 2013; LST-

Cam, Ambrosi et al. 2013; and FlashCam, Pühlhofer

et al. 2012). Even if hexagonal data processing is

not usual for general public applications, several other

speciﬁc sensors make use of hexagonal sampling. The

Lytro light ﬁeld camera (Cho et al., 2013) is a con-

sumer electronic device example. Several Physics ex-

periments also make use of hexagonal grid sensors,

such as the H.E.S.S. camera (Bolmont et al., 2014)

or the XENON1T detector (Scovell, 2013). Hexag-

onal lattice is also used for medical sensors, such as

DEPFET (Neeser et al., 2000) or retina implant sys-

tem (Schwarz et al., 1999).

Moreover, hexagonal lattice is a well-known and

studied grid (Sato et al., 2002; Shima et al., 2010;

Asharindavida et al., 2012; Hoogeboom et al., 2018)

and offers advantages compared to square lattice

(Middleton et al., 2001) such as higher sampling den-

sity and a better representation of curves. In addi-

tion, some more beneﬁts have been shown by (Sousa,

2014; He and Jia, 2005; Asharindavida et al., 2012)

such as equidistant neighborhood, clearly deﬁned

connectivity, smaller quantization error.

However, processing hexagonal lattice images

with standard deep learning frameworks requires spe-

ciﬁc data manipulation and computations that need to

be optimized on CPUs as well as GPUs. This sec-

tion proposes a method to efﬁciently handle hexago-

nal data without any preprocessing as a demonstration

of the use of indexed convolutions. We ﬁrst describe

how to build the index matrix for hexagonal lattice

images needed by the indexed convolution.

For easy comparison, we want to validate our

methods on datasets with well-known use cases (e.g.

a classiﬁcation task) and performances. To our

knowledge, there is no reference hexagonal image

dataset for deep learning. So, following Hexa-

Conv paper (Hoogeboom et al., 2018) we constructed

two datasets with hexagonal images based on well-

established square pixel image datasets dedicated to

classiﬁcation tasks: CIFAR-10 and AID. This enables

our method to be compared with classical square pix-

els processing in a standardized way.

5.1 Indexing the Hexagonal Lattice and

the Neighbors’ Matrix

As described in section 3, in addition to the image

itself, one needs to feed the indexed convolution (or

pooling) with the list of the considered neighbors for

each pixel of interest, the matrix of indices. In the

case of images with a hexagonal grid, provided a

given pixel addressing system, a simple method to re-

trieve these neighbors is proposed.

Several addressing systems exist to handle images

with such lattice, among others: offset (Sousa, 2014),

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

366

ASA (Rummelt, 2010), HIP (Middleton et al., 2001),

axial - also named orthogonal or 2-axis obliques

(Asharindavida et al., 2012; Sousa, 2014). The latter

is complete, unique, convertible to and from Carte-

sian lattice and efﬁcient (He and Jia, 2005). It offers a

straightforward conversion from hexagonal to Carte-

sian grid, stretching the converted image, as shown in

ﬁgure 3, but preserving the true neighborhood of the

pixels.

Figure 3: Hexagonal to Cartesian grid conversion with the

axial addressing system.

Our method relies on the axial addressing system

to build an index matrix of hexagonal grid images.

Assuming that a hexagonal image is stored as a vector

and that we have the indices of the pixels of the vec-

tor images represented in the hexagonal grid, one can

convert it to an index matrix thanks to the axial ad-

dressing system. Then, building the list of neighbors,

the matrix of indices, consists in applying the desired

kernel represented in the axial addressing system to

the index matrix for each pixel of interest.

Figure 4: Building the matrix of indices for an image with

a hexagonal grid. The image is stored as a vector, and the

indices of the vector are represented in the hexagonal lattice.

Thanks to the axial addressing system, this representation

is converted to a rectangular matrix, the index matrix. The

neighbors of each pixel of interest (in red) are retrieved by

applying the desired kernel (here the nearest neighbors in

the hexagonal lattice, in blue) to the index matrix.

An example is proposed on ﬁg. 4, with the ker-

nel of the nearest neighbors in the hexagonal lattice.

Regarding the implementation, one has to deﬁne in

advance the kernel to use as a mask to be applied to

the index matrix, for the latter example:

kernel = [[1, 1, 0],

[1, 1, 1],

[0, 1, 1]]

5.2 Experiment on CIFAR-10

The indexed convolution method, in the special case

of hexagonal grid images, has been validated on the

CIFAR-10 dataset. For this experiment and the one

on the AID dataset (see Sec. 5.3), we compare our

results with the two baseline networks of HexaConv

paper (Hoogeboom et al., 2018). These networks do

not include group convolutions and are trained respec-

tively on square and hexagonal grid image versions

of CIFAR-10. The network trained on the hexagonal

grid CIFAR-10 consists of masked convolutions. To

allow a fair comparison, we use the same experimen-

tal conditions, except for the Deep Learning frame-

work and the square to hexagonal grid image trans-

formation of the datasets.

The CIFAR-10 dataset is composed of 60000 tiny

color images of size 32x32 with square pixels. Each

image is associated with the class of its foreground

object. This is one of the reference databases for im-

age classiﬁcation tasks in the machine learning com-

munity. By converting this square pixel database into

its hexagonal pixel counterpart, this enables to com-

pare hexagonal and square pixel processing in differ-

ent case studies for image classiﬁcation. This way,

the same network with:

• standard convolutions (square kernels),

• indexed convolutions (square kernels),

• indexed convolutions (hexagonal kernels),

has been trained and tested, respectively on the

dataset for the square kernels and its hexagonal ver-

sion for the hexagonal kernels. For reproducibility,

the experiment has been repeated 10 times with dif-

ferent weights initialization, but using the same ran-

dom seeds (i.e. same weights initialization values) for

all three implementations of the network.

5.2.1 Building a Hexagonal CIFAR-10 Dataset

The ﬁrst step is to transform the dataset in a hexago-

nal one. Compared to a rectangular grid, an hexagonal

grid has one line out of two shifted of half a pixel (see

ﬁgure 5). Square pixels (orange grid) cannot be re-

arranged directly in a hexagonal grid (blue grid). For

these shifted lines, pixels have to be interpolated from

the integer position pixels of the rectangular grid. The

interpolation chosen here is the average of the two

consecutive horizontal pixels. A fancier method could

have been to take into account all the six square pix-

els contributing to the hexagonal one, in proportion

to their involved surface. In that case, the both pix-

els retained for our interpolation method would cover

90.4% of the surface of the interpolated hexagonal

pixel.

Indexed Operations for Non-rectangular Lattices Applied to Convolutional Neural Networks

367

Fig. 6 shows a conversion example, one can ob-

serve that the interpolation method is rough as one can

see on the back legs of the horse so that hexagonal

processing experiments suffer from some input image

distortion. However, our preliminary experiments did

not show strong classiﬁcation accuracy difference be-

tween such conversion and a higher quality one.

Then the images are stored as vectors and the in-

dex matrix based on the axial addressing system is

built. Before feeding the network, the images are

standardized and whitened using a PCA, following

Hoogeboom et al. 2018.

Figure 5: Resampling of rectangular grid (orange) images

to hexagonal grid one (blue). One line of two in the hexag-

onal lattice is shifted by half a pixel compared to the corre-

sponding line in the square lattice. The interpolated hexag-

onal pixel (with a green background) is the average of the

two corresponding square pixels (with red dashed borders).

Figure 6: Example of an image from CIFAR-10 dataset re-

sampled to hexagonal grid.

5.2.2 Network Model

Table 1: Number of features for all three hexagonal and

square networks used on CIFAR-10.

conv1 stage 1 stage 2 stage 3

Hexagonal

kernels

17 17 35 69

Square

kernels

15 15 31 61

The network used for this experiment is described

in section 5.1 of (Hoogeboom et al., 2018) and re-

lies on a ResNet architecture (He et al., 2015). As

shown in ﬁgure 7, it consists of a convolution, 3 stages

Figure 7: ResNet model used for the experiment on CIFAR-

10.

with 4 residual blocks each, a pooling layer and a

ﬁnal convolution. The down-sampling between two

stages is achieved by a convolution of kernel size

1x1 and stride 2. After the last stage, feature maps

are squeezed to a single pixel (1x1 feature maps) by

the use of an average pooling over the whole feature

maps. Then a ﬁnal 1x1 convolution (equivalent to a

fully connected layer) is applied to obtain the class

scores. Three networks have been implemented in Py-

Torch, one with built-in convolutions (square kernels)

and two with indexed convolutions (one with square

kernels and one with hexagonal kernels). Rectangular

grid image versions have convolution kernels of size

3x3 (9 pixels) while the one for hexagonal grid im-

ages has hexagonal convolution kernels of the nearest

neighbors (7 pixels). The number of features per layer

is set differently, as shown in table 1, depending on

the network so that the total number of parameters of

all three networks are close, ensuring the comparison

to be fair. These networks have been trained with the

stochastic gradient descent as optimizer with a mo-

mentum of 0.9, a weight decay of 0.001 and with a

learning rate of 0.05 decayed by 0.1 at epoch 50, 100

and 150 for a total of 300 epochs.

5.2.3 Results

As shown in table 2, all three networks with hexag-

onal indexed convolutions, square indexed convolu-

tions and square standard convolutions exhibit simi-

lar performances on the CIFAR-10 dataset. The dif-

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

368

ference between the hexagonal kernel and the square

kernel with standard convolution on the one hand and

between both square kernel is not signiﬁcant, accord-

ing to the Student T test. For the same number of

parameters, the hexagonal kernel model gives slightly

better accuracy than the square kernel one in the con-

text of indexed convolution, even if the images have

been roughly interpolated for hexagonal image pro-

cessing. However, to satisfy this equivalence in the

number of parameters, since hexagonal convolutions

involve fewer neighbors than the squared counterpart,

some more neurons are added all along the network

architecture. This leads to a larger number of data

representations that are combined to achieve the task.

One can then say that Hexagonal convolution pro-

vides richer features for the same price in the param-

eters count. This may also compensate for the im-

age distortions introduced when converting input im-

ages to hexagonal sampling. Such distortions actually

sat Hexagonal processing in an unfavourable initial

state but the hexagonal processing compensated and

slightly outperformed the standard approach.

Hoogeboom et al. (2018) carried out a similar ex-

periment and observed the same accuracy difference

between hexagonal and square convolutions process-

ing despite a shift in the absolute accuracy values

(88.75 for hexagonal images, 88.5 for square ones)

that can be explained by different image interpolation

methods, different weights initialization and the use

of different frameworks.

Table 2: Accuracy results for all three hexagonal and square

networks on CIFAR-10. i. c. stands for indexed convolu-

tions.

Hexagonal

kernels (i.c.)

Square

kernels (i.c.)

Square

kernels

88.51 ± 0.21 88.27 ± 0.23 88.39 ± 0.48

5.3 Experiment on AID

Similar to the experiment on CIFAR-10, the indexed

convolution has been validated on Aerial Images

Dataset (AID) (Xia et al., 2016). The AID dataset

consists of 10000 RGB images of size 600x600

within 30 classes of aerial scene type. Similar to

section 5.2, the same network with standard convo-

lutions (square kernels) and then with indexed convo-

lutions (square kernels and hexagonal kernels) have

been trained and tested, respectively on the dataset

for the square kernels and its hexagonal version for

the hexagonal kernels. The experiment has also been

repeated ten times, but with the same network ini-

tialization and different random split between train-

ing set and validating set, following Hoogeboom et al.

(2018).

5.3.1 Building a Hexagonal AID Dataset

After resizing the images to 64x64 pixels, the dataset

is transformed to a hexagonal one, as shown ﬁg. 8, in

the same way as in section 5.2.1. Then the images are

standardized.

Figure 8: Example of an image from AID dataset resized

to 64x64 pixels and resampled to hexagonal grid.

5.3.2 Network

Table 3: Number of features for all three hexagonal and

square networks used on AID.

conv1 stage 1 stage 2 stage 3

Hexagonal

kernels

42 42 83 166

Square

kernels

37 37 74 146

The network considered in this experiment is still

a ResNet architecture but adapted to this speciﬁc

dataset. One follows the setup proposed in section 5.2

of Hoogeboom et al. 2018. Three networks have been

implemented and trained in the same way described

in section 5.2.2, with the number of features per layer

described in table 3.

5.3.3 Results

As shown in table 4, all three networks with hexago-

nal convolutions and square convolutions do not ex-

hibit a signiﬁcant difference in performances on the

AID dataset. Again, no accuracy loss is observed in

the hexagonal processing case study despite the rough

image re-sampling.

However, unlike on the CIFAR-10 experiment, we

don’t observe a better accuracy of the model with

hexagonal kernels, as emphasized in (Hoogeboom

et al., 2018).

Table 4: Accuracy results for all three hexagonal and square

networks on AID. i. c. stands for indexed convolutions.

Hexagonal

kernels (i.c.)

Square

kernels (i.c.)

Square

kernels

79.81 ± 0.73 79.88 ± 0.82 79.85 ± 0.50

Indexed Operations for Non-rectangular Lattices Applied to Convolutional Neural Networks

369

6 COMMENTS/DISCUSSION

This paper introduces indexed convolution and pool-

ing operators for images presenting pixels arranged

in non-Cartesian lattices. These operators have been

validated on standard images as well as on the special

case of hexagonal lattice images, exhibiting similar

performances as standard convolutions and therefore

showing that the indexed convolution works as ex-

pected. However, the indexed method is much more

general and can be applied to any grid of data, en-

abling unconventional image representation to be ad-

dressed without any pre-processing. This differs from

other approaches such as image re-sampling com-

bined with masked convolutions (Hoogeboom et al.,

2018) or oversampling to square lattice (Holch et al.,

2017) that actually require additional pre-processing.

Moreover, both methods increase the size of the trans-

formed image (adding useless pixels of padding value

for the resampled image to be rectangular and / or

multiplying the number of existing pixels) and are re-

stricted to regular grids. On the other hand, they make

use of out the box operators already available in cur-

rent deep learning frameworks.

The approach proposed in this paper is not lim-

ited to hexagonal lattice and only needs the index

matrices to be built prior the training and inference

processes, one for each convolution of different in-

put size. No additional pre-processing of the image

is then required to apply convolution and pooling ker-

nels. However, the current implementation in Python

shows a decrease in computing performances com-

pared to the convolution method implemented in Py-

torch. We have observed an increase of RAM usage of

factors varying between 1 and 3 and training times of

factors varying between 4 and 8 on GPU (depending

on the GPU model), of factor 1.5 on CPU (but slightly

faster than masked convolutions on CPU) depending

on the network used.

These drawbacks are actually related to the use of

un-optimized codes and work is carried out to ﬁx this

by the use of optimized CUDA and C++ implementa-

tions.

As a future work, we will use the indexed oper-

ations for the analysis of hexagonal grid images of

CTA. We also plan to experiment with arbitrary ker-

nels, which are another beneﬁt of the indexed opera-

tions, for the convolution (e.g. retina like kernel with

more density in the center, see the example in the

github repository).

ACKNOWLEDGEMENTS

This project has received funding from the European

Union’s Horizon 2020 research and innovation pro-

gram under grant agreement No 653477.

This work has been done thanks to the facilities of-

fered by the Université Savoie Mont Blanc MUST

computing center.

REFERENCES

Ambrosi, G., Awane, Y., Baba, H., et al. (2013). The

cherenkov telescope array large size telescope. arXiv

preprint arXiv:1307.4565.

Asharindavida, F., Hundewale, N., and Aljahdali, S. (2012).

Study on hexagonal grid in image processing. 45:282–

288.

Bolmont, J., Corona, P., Gauron, P., et al. (2014). The

camera of the ﬁfth h.e.s.s. telescope. part i: System

description. Nuclear Instruments and Methods in

Physics Research Section A: Accelerators, Spectrom-

eters, Detectors and Associated Equipment, 761:46 –

57.

Cho, D., Lee, M., Kim, S., and Tai, Y.-W. (2013). Modeling

the calibration pipeline of the lytro camera for high

quality light-ﬁeld image reconstruction. In Proceed-

ings of the IEEE International Conference on Com-

puter Vision, pages 3280–3287.

Glicenstein, J., Barcelo, M., Barrio, J., et al. (2013). The

NectarCAM camera project. ArXiv e-prints.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep

residual learning for image recognition. CoRR,

abs/1512.03385.

He, X. and Jia, W. (2005). Hexagonal structure for intelli-

gent vision. In Information and Communication Tech-

nologies, 2005. ICICT 2005. First International Con-

ference on, pages 52–64. IEEE.

Holch, T. L., Shilon, I., Büchele, M., et al. (2017). Probing

convolutional neural networks for event reconstruc-

tion in gamma-ray astronomy with cherenkov tele-

scopes. arXiv preprint arXiv:1711.06298.

Holschneider, M., Kronland-Martinet, R., Morlet, J., and

Tchamitchian, P. (1990). A real-time algorithm for

signal analysis with the help of the wavelet transform.

In Combes, J.-M., Grossmann, A., and Tchamitchian,

P., editors, Wavelets, pages 286–297, Berlin, Heidel-

berg. Springer Berlin Heidelberg.

Hoogeboom, E., Peters, J. W., Cohen, T. S., and Welling,

M. (2018). Hexaconv. In International Conference on

Learning Representations.

Katz, U. (2012). A neutrino telescope deep in the Mediter-

ranean Sea. CERN Cour., 52N6:31–33.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

370

Middleton, L., Sivaswamy, J., and Coghill, G. (2001). The

fft in a hexagonalimage processing framework. New

Zealand Conference on Image and Vision Computing

(IVCNZ 2001), (January):231–236.

Neeser, W., Bocker, M., Buchholz, P., et al. (2000). The

depfet pixel bioscope. IEEE Transactions on Nuclear

Science, 47(3):1246–1250.

Pühlhofer, G., Bauer, C., Biland, A., et al. (2012). Flash-

Cam: A fully digital camera for CTA telescopes. In

Aharonian, F. A., Hofmann, W., and Rieger, F. M., ed-

itors, American Institute of Physics Conference Series,

volume 1505 of American Institute of Physics Confer-

ence Series, pages 777–780.

Rummelt, N. I. (2010). Array set addressing: enabling efﬁ-

cient hexagonally sampled image processing. Univer-

sity of Florida.

Sato, H., Matsuoka, H., Onozawa, A., and Kitazawa, H.

(2002). Hexagonal image representation for 3-d pho-

torealistic reconstruction. In Pattern Recognition,

2002. Proceedings. 16th International Conference on,

volume 2, pages 672–676. IEEE.

Scherer, D., Müller, A., and Behnke, S. (2010). Evaluation

of pooling operations in convolutional architectures

for object recognition. In Artiﬁcial Neural Networks–

ICANN 2010, pages 92–101. Springer.

Schwarz, M., Hauschild, R., Hosticka, B. J., et al. (1999).

Single-chip cmos image sensors for a retina implant

system. IEEE Transactions on Circuits and Systems

II: Analog and Digital Signal Processing, 46(7):870–

877.

Scovell, P. (2013). The Xenon 100 Detector. In Cline, D.,

editor, Springer Proceedings in Physics, volume 148

of Springer Proceedings in Physics, page 87.

Shima, T., Sugimoto, S., and Okutomi, M. (2010). Com-

parison of image alignment on hexagonal and square

lattices. In Image Processing (ICIP), 2010 17th IEEE

International Conference on, pages 141–144. IEEE.

Sousa, N. A. (2014). Hexagonal grid image processing al-

gorithms in neural vision and prediction. pages 1–19.

van de Geijn, R. and Goto, K. (2011). BLAS (Basic Linear

Algebra Subprograms), pages 157–164. Springer US,

Boston, MA.

Xia, G., Hu, J., Hu, F., et al. (2016). AID: A benchmark

dataset for performance evaluation of aerial scene

classiﬁcation. CoRR, abs/1608.05167.

Indexed Operations for Non-rectangular Lattices Applied to Convolutional Neural Networks

371