FisheyeSuperPoint: Keypoint Detection and Description Network for

Fisheye Images

Anna Konrad

1,2

, Ciar

an Eising

, Ganesh Sistu

, John McDonald

, Rudi Villing

and Senthil Yogamani

Hamilton Institute, Maynooth University, Ireland

Department of Electronic Engineering, Maynooth University, Ireland

Department of Electronic & Computer Engineering, University of Limerick, Ireland

Valeo Vision Systems, Galway, Ireland

Department of Computer Science, Maynooth University, Ireland

Keywords:

Keypoints, Interest Points, Feature Detection, Feature Description, Fisheye Images, Deep Learning.

Abstract:

Keypoint detection and description is a commonly used building block in computer vision systems particularly

for robotics and autonomous driving. However, the majority of techniques to date have focused on standard

cameras with little consideration given to ﬁsheye cameras which are commonly used in urban driving and

automated parking. In this paper, we propose a novel training and evaluation pipeline for ﬁsheye images.

We make use of SuperPoint as our baseline which is a self-supervised keypoint detector and descriptor that

has achieved state-of-the-art results on homography estimation. We introduce a ﬁsheye adaptation pipeline

to enable training on undistorted ﬁsheye images. We evaluate the performance on the HPatches benchmark,

and, by introducing a ﬁsheye based evaluation method for detection repeatability and descriptor matching

correctness, on the Oxford RobotCar dataset.

1 INTRODUCTION

Keypoint detection and description is a fundamental

step in computer vision for image registration (Ma

et al., 2021). It has a wide range of applications in-

cluding 3D reconstruction, object tracking, video sta-

bilization and SLAM. Approaches designed to be in-

variant to changes in scale, illumination, perspective,

etc. have been extensively studied in the computer vi-

sion literature. However, despite their prevalence in

automotive and robotic systems, few approaches have

explicitly considered ﬁsheye images that pose an ad-

ditional challenge of spatially variant distortion. Thus

a patch in the centre of an image looks different com-

pared to a region in the periphery of the image where

the radial distortion is much higher. Fisheye cameras

are a fundamental sensor in autonomous driving nec-

essary to cover the near ﬁeld around the vehicle (Yo-

gamani et al., 2019). Four ﬁsheye cameras on each

side of the vehicle can cover the entire 360

◦

ﬁeld of

view.

A common approach to use ﬁsheye images for

tasks in computer vision is to ﬁrst rectify the image

and then apply algorithms suited for standard images

(Lo et al., 2018), (Esparza et al., 2014). A principal

drawback of such methods is that the ﬁeld of view

is reduced and resampling artifacts can be introduced

into the periphery of the image. An alternative ap-

proach, as demonstrated in (Ravi Kumar et al., 2020),

is to work directly in the ﬁsheye image space and

thereby avoid such issues. Recently, there has been

progress with such approaches for various visual per-

ception tasks such as dense matching (H

ane et al.,

2014), object detection (Rashed et al., 2021), depth

estimation (Kumar et al., 2018), re-localisation (Tri-

pathi and Yogamani, 2021), soiling detection (Uric

et al., 2019) and people detection (Duan et al., 2020).

Feature detectors and descriptors can describe cor-

ners (also called interest points or keypoints), edges or

morphological region features. Traditionally, feature

detection and description has been done with hand-

crafted algorithms (Ma et al., 2021). Some of the

most well-known algorithms include Harris (Harris

and Stephens, 1998), FAST (Rosten and Drummond,

2006) and SIFT (Lowe, 2004). Thorough reviews

of traditional and modern techniques have been con-

ducted in various surveys (Ma et al., 2021), (Mikola-

jczyk and Schmid, 2005), (Li et al., 2015).

340

Konrad, A., Eising, C., Sistu, G., McDonald, J., Villing, R. and Yogamani, S.

FisheyeSuperPoint: Keypoint Detection and Description Network for Fisheye Images.

DOI: 10.5220/0010795400003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages

340-347

ISBN: 978-989-758-555-5; ISSN: 2184-4321

Recently, several CNN based feature correspon-

dence techniques have been explored which outper-

form classical features. For example, a universal cor-

respondence network in (Sarlin et al., 2016) demon-

strates state-of-the-art results on various datasets by

making use of a spatial transformer to normalise for

afﬁne transformations. This is an example of feature

correspondence learning independent of the applica-

tion in which it is used. It is an open problem to learn

feature correspondence which is optimal for the later

stages in the perception pipeline e.g. bundle adjust-

ment. For instance, end-to-end learning of feature

matching could possibly learn diversity and distribu-

tion rather than focusing solely on measures such as

distinctiveness and repeatability. This is particularly

useful for training on ﬁsheye images which have a do-

main gap compared to regular images. In addition,

learning based detectors permit the encoder backend

to be merged into a multi-task model. Such multi-task

models share the same encoder, which can improve

performance and reduce latency for all tasks (Leang

et al., 2020), (Ravikumar et al., 2021).

SuperPoint is a self-supervised CNN framework

for keypoint detection and description (DeTone et al.,

2018). It consists of one encoder and two different

decoders for the detection and the description output.

It is pretrained as an corner detector via a syntheti-

cally generated dataset containing basic shapes like

rectangles, lines, stars, etc. The resulting corner de-

tector shows inconsistent keypoint detections on the

same scene for varying camera viewpoints. To im-

prove on the detection consistency for varying cam-

era viewpoints, homographic adaptation is used to

generate a superset of keypoints from random homo-

graphic warpings on the MS-COCO training dataset

(Lin et al., 2014). The network is then trained on

those keypoints and the whole process is repeated for

several iterations. Recently, it has achieved state-of-

the-art results in several benchmarks in combination

with the SuperGlue matcher (Sarlin et al., 2020).

We adapt the SuperPoint feature detector and de-

scriptor so it can be trained and evaluated on ﬁsheye

images. The contributions of this paper include:

• Random ﬁsheye warping and unwarping in place

of homographic transforms.

• Implementation of ﬁsheye adaption for self-

supervised training of the SuperPoint network.

• Evaluation of the repeatability of detectors and

comparison of their performance to FisheyeSu-

perPoint on ﬁsheye and standard images.

• Evaluation of the matching correctness of de-

scriptors and comparison of their performance to

FisheyeSuperPoint on ﬁsheye & standard images.

Figure 1: Example imagery of the Oxford RobotCar Dataset

(Maddern et al., 2017), showing different weather and light-

ing conditions.

2 METHOD

We propose FisheyeSuperPoint, a SuperPoint net-

work that has been trained self-supervised on ﬁsheye

images. To enable this, the homographic adaptation

step in the training pipeline has been exchanged with

our ﬁsheye adaptation process to cope with the non-

linear mapping of ﬁsheye images.

2.1 Datasets Used

The data used for training of the FisheyeSuperPoint

network is a subset of the Oxford RobotCar Dataset

(Maddern et al., 2017) (RobotCar). The RobotCar

dataset contains ﬁsheye images, stereo images, LI-

DAR sensor readings and GPS from a vehicle across

various seasons and weather conditions. Some exam-

ples from the ﬁsheye images are shown in Figure 1.

The data is made available in subsets for each drive.

The subsets contain data from three different ﬁsheye

cameras which were mounted on the left, right and

rear of the vehicle. To generate a representative sub-

set of ﬁsheye images, image sequences from each of

seven different weather conditions (sun, clouds, over-

cast, rain, snow, night, dusk) were used with 840k im-

ages in total. To reduce the total number of images

and duplicates in the resulting dataset, we sampled

every 10th frame from the sequences.

The resulting training dataset contained 84k

1024x1024 ﬁsheye images. During training, the im-

ages were downsampled to 256x256 images to match

the resizing of MS-COCO (Lin et al., 2014) that was

used for the training of SuperPoint (DeTone et al.,

2018).

2.2 Fisheye Warping and Unwarping

In order to train FisheyeSuperPoint on ﬁsheye im-

ages, a substitute for the homographic warping in Su-

perPoint is needed. The pinhole camera model is

the standard projection function for much research

in computer vision, as in SuperPoint (DeTone et al.,

2018). The pinhole projection function is given as

FisheyeSuperPoint: Keypoint Detection and Description Network for Fisheye Images

341

Figure 2: Results of the Fisheye Warping. a) the original image, b) the warped image, c) the unwarped image and d) the

difference between the original and unwarped image. c) is the result of applying the inverse of the ﬁsheye warping to b). The

structural differences in d) are regions which were outside of the warped image b). The remaining differences in d) are due to

the bilinear interpolation used for warping and unwarping the image.

p =



f X



, where X = (X ,Y, Z)

is a point in

the camera coordinate system, and f is the nominal

focal length of the pinhole camera.

Fisheye functions provide a nonlinear mapping

from the camera coordinate system (e.g. Figure 2).

We can deﬁne a mapping from R

to the ﬁsheye im-

age as

π : R

→ I

A true inverse is naturally not possible, as all

depth information is lost in the formation of the im-

age. However, we can deﬁne an unprojection map-

ping from the ﬁsheye image domain to the unit central

projective sphere:

−1

: I

→ S

Unfortunately, the Oxford RobotCar Dataset does

not provide details of the ﬁsheye model they use, nor

its parameters. However, they provide a look-up-table

that can map from a distorted to an undistorted im-

age. We use this look-up-table and ﬁt the fourth order

polynomial model p(θ) described below.

In principle, it does not matter exactly which ﬁsh-

eye mapping function is used, as long as it provides

a reasonably accurate model of the image transforma-

tion. In our case, we use a radial polynomial function

for π, as per (Yogamani et al., 2019):

π(X) =

p(θ)





, d =

p(θ) = a

θ + a

+ . . . + a

θ = arccos



√

+ Z



(1)

where p(θ) is a polynomial of order n, with n = 4

typically sufﬁcient.

In SuperPoint (DeTone et al., 2018), random ho-

mographies are used to simulate multiple camera

viewpoints. In order to train a SuperPoint network

with ﬁsheye images, the homographic warping needs

to be replaced with an equivalent transform that is ap-

plicable to ﬁsheye imagery. Using the ﬁsheye func-

tions (π and π

−1

) described above, we can consider

the following steps to generate a new ﬁsheye warped

image:

1. Each point is projected from the image I

to a unit

sphere S

using π

−1

2. A new virtual camera position is selected by a ran-

dom rotation R and translation t (six degrees of

freedom Euclidean transform)

3. Each point on the unit sphere is reprojected to this

new, virtual camera position, by applying π(RX +

This results in a mapping from I

→ I

, where

represents the new image with the random Eu-

clidean transform applied. We call this mapping ﬁsh-

eye warping F and ﬁsheye unwarping F

−1

F (I

) = π(Rπ

−1

) + t) (2)

In practice, to avoid sparsity in the new image,

each pixel on the warped image is inverse transformed

to the corresponding sub-pixel location on the original

image and sampled using bilinear interpolation. Ad-

ditionally, as the ﬁsheye unprojection function π

−1

computationally costly due to the requirement for a

polynomial root solver, a set of 2000 look-up-tables

is pre-computed for the image warpings. For each

look-up-table, we sample the rotation and translation

along each axis uniformly at random from the inter-

val U

∈ [−30

◦

, 30

◦

] and U

∈ [−0.3, 0.3] relative to

the unit sphere S

, respectively. In order to invert

this warping, as required for unwarping the detected

point responses, the inverse of the steps 1 - 3 above

are followed. An example of the resulting images is

shown in Figure 2. The original image a) is warped to

b) and unwarped to c) using the pre-computed look-

up-tables. The difference d) between the original a)

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

342

Figure 3: Self-supervised ﬁsheye Keypoint Detection and Description training framework adapted from SuperPoint frame-

work (DeTone et al., 2018). Random ﬁsheye warpings are applied to a single image. Point responses are received from the

base detector when applying it to the warped images. The point responses are then unwarped and accumulated to get an

aggregated keypoint superset of the original image.

and unwarped image c) shows no structural differ-

ences apart from the regions which were outside of

the warped image b) and therefore remain black in the

unwarped image. The other differences are due to the

bilinear interpolation used for warping and unwarping

the image.

2.3 Fisheye Adaptation

The ﬁsheye warping is incorporated into the Super-

Point training pipeline as shown in Figure 3. The aim

of the adaptation process is to provide consistent key-

point responses on the same scene under varying cam-

era viewpoints. A single ﬁsheye image I

and a base

keypoint detector f

are input to the ﬁsheye adapta-

tion process. Within that process, a random ﬁsheye

warping F is sampled and applied to the image with

bilinear interpolation resulting in a warped ﬁsheye

image I

. The base detector is used to detect point

responses x

, which are subsequently unwarped back

to the original image space by applying the inverse

ﬁsheye warping F

−1

The adaptation process is repeated for N

= 100

random ﬁsheye warpings and the resulting point re-

sponses are accumulated. The point responses from

different ﬁsheye warpings are expected to differ ini-

tially. By accumulating the point responses into a su-

perset of detected keypoints, a new ground-truth for

keypoints in the original image is generated. Subse-

quently, the base detector is trained on this superset

to generate a superior, more consistent keypoint de-

tector. The full process can be repeated iteratively in

order to enhance performance and consistency of the

resulting model. The descriptor decoder is learned in

a semi-dense manner in the last iteration of the model

enhancement, as described in (DeTone et al., 2018).

3 RESULTS

FisheyeSuperPoint is developed on top of a train-

able tensorﬂow implementation (Pautrat and Sarlin,

2021) based on SuperPoint (DeTone et al., 2018). To

train FisheyeSuperPoint, we use a magic-point net-

work (Pautrat and Sarlin, 2021) trained on MS-COCO

as a base detector, where we apply ﬁsheye adaptation

to the RobotCar dataset and train FisheyeSuperPoint

with 600,000 iterations. To train SuperPoint, we use

the same pretrained network, where we apply homo-

graphic adaptation to the MS-COCO dataset and train

SuperPoint with 600,000 iterations.

The training including two ﬁsheye/homographic

adaptation iterations is executed on two Nvidia GTX

1080Ti GPUs and takes approximately one week to

complete. The duration of the training with ﬁsh-

eye adaptation is similar to the training with homo-

graphic adaptation in SuperPoint. While the ﬁsheye

adaptation is a more complex procedure, the use of

pre-calculated look-up-tables for the ﬁsheye warping

means it is computationally efﬁcient. Note that the

same testing data is used for all models in the experi-

ments.

3.1 Benchmark Setup

HPatches: The performance of FisheyeSuperPoint

and SuperPoint in comparison to traditional corner

detection techniques is evaluated by the repeatabil-

ity of detections and the homography estimation cor-

rectness on the HPatches benchmark (Balntas et al.,

2017). It contains multiple images of planar objects

from varying camera viewpoints or with different il-

luminations. As the ground truth homography is pro-

vided for each corresponding image pair, detections

FisheyeSuperPoint: Keypoint Detection and Description Network for Fisheye Images

343

on an image pair can be warped and their consis-

tency can be compared. The detection repeatability

of FisheyeSuperPoint and SuperPoint is compared to

the detection algorithms FAST (Rosten and Drum-

mond, 2006), Harris (Harris and Stephens, 1998) and

Shi (Shi and Tomasi, 1994). The homography estima-

tion correctness of FisheyeSuperPoint and SuperPoint

is compared to SIFT (Lowe, 2004) and ORB (Rublee

et al., 2011). The evaluation methodology is the same

as described in (DeTone et al., 2018), where the ho-

mography is estimated with OpenCV based on nearest

neighbour matching on the descriptors.

The estimated homography is compared to the

ground truth homography of HPatches by transform-

ing four corner points c

of the image with both ho-

mographies, resulting in c

and ˆc

with j = 4. As

per (DeTone et al., 2018), the homography estimation

correctness is then calculated based on the distance

between the corner points by

∑

j=1

||c

− ˆc

|| ≤ ε

and averaged over all n = 295 test images.

Fisheye Oxford RobotCar: In order to assess the

performance of the trained networks on ﬁsheye im-

ages, we evaluate keypoint detection repeatability and

matching correctness using a ﬁsheye test set. Unfortu-

nately, there currently is no ﬁsheye image equivalent

to the HPatches benchmark available that contains

precise ground truth viewpoint relations. Therefore,

to evaluate performance on ﬁsheye images, we gener-

ate an artiﬁcial dataset based on a test set of RobotCar

(Maddern et al., 2017). The test set was generated

from approximately 51k images across ﬁve different

weather conditions and image sequences that had not

been used in the training set for FisheyeSuperPoint.

To increase diversity in the test set, one out of every

172 frames was sampled resulting in a base test set of

300 images.

From the base test set, 300 illumination change

test images are created by applying gamma correc-

tion, with gamma for each image drawn randomly

from a uniform distribution γ ∈ [0.1, 2]. Indepen-

dently, each image of the base test set is warped to

create a viewpoint change test image using a ﬁsheye

warping F from a set of 300 random ﬁsheye warpings

not previously used for training.

By applying the same keypoint detector to corre-

sponding images we generate two different point re-

sponses:

x = f

), x

= f

)

In the case of viewpoint changes, point responses that

lie outside the overlapping region of both images are

ﬁltered by applying boolean masks. The masks are

generated by warping all-ones matrices of image size

with F and F

−1

for x

and x, respectively. We ap-

ply F (x) to warp the point response x into the image

space of x

and calculate the detector repeatability as

described in (DeTone et al., 2018).

In order to evaluate the descriptor matching cor-

rectness M

on RobotCar, nearest neighbour match-

ing is performed on the descriptors, resulting in a

set of matches M. The keypoints x

of the resulting

matches in I

are warped using F (x

) and the eu-

clidean distance d() to their corresponding keypoint,

, is calculated. The inliers G are calculated as

G = {x

, x

∈R

: d(x

, x

) < ε}. The threshold with

ε = 3px is set to the same distance as for the homog-

raphy estimation correctness. The matching correct-

ness M

is calculated as the ratio of inliers to matches

|G|

|M|

, where |G| denotes the number of elements

in the set G. We average the descriptor matching cor-

rectness M

over all n = 300 test images for one given

model.

In addition, we report the root-mean-square error

(RMSE) of the distance in all i = n ×k matches by

RMSE =

+ d

+ ···+ d

(3)

for one given model. The number of keypoint detec-

tions k = 300 and the number of test images n = 300

is kept constant for all experiments.

3.2 Comparison

A comparison of the detection repeatability on

FisheyeSuperPoint, SuperPoint, as well as FAST

(Rosten and Drummond, 2006), Harris (Harris and

Stephens, 1998) and Shi (Shi and Tomasi, 1994) is

shown in Table 1. The default OpenCV implementa-

tion is used for FAST, Harris and Shi. We apply Non-

Maximum Suppression (NMS) on a square mask with

size of 4px, 8px to the keypoint detections. The num-

ber of detected points k = 300 and correct distance

threshold ε = 3 stays constant. Images in HPatches

are resized to 240 ×320 and RobotCar images are re-

sized to 256 ×256.

FisheyeSuperPoint outperforms the other detec-

tors for viewpoint and illumination changes in Robot-

Car when a high NMS is applied. While the tradi-

tional Harris detector outperforms FisheyeSuperPoint

with a NMS = 8 on viewpoint changes, it ranks sec-

ond and is superior to SuperPoint, FAST and Shi.

For the detection repeatability on the HPatches

benchmark, FisheyeSuperPoint outperforms the clas-

sical detectors on the scenes with illumination

changes. For the scenes with viewpoint changes, it

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

344

Table 1: Detector Repeatability on HPatches and RobotCar.

Algorithm Illumination Changes Viewpoint Changes

NMS=4 NMS=8 NMS=4 NMS=8

HPatches

FisheyeSuperPoint 0.664 0.631 0.678 0.626

SuperPoint (DeTone et al., 2018) 0.663 0.622 0.672 0.610

FAST (Rosten and Drummond, 2006) 0.576 0.493 0.598 0.492

Harris (Harris and Stephens, 1998) 0.630 0.590 0.725 0.612

Shi (Shi and Tomasi, 1994) 0.584 0.515 0.613 0.523

RobotCar

FisheyeSuperPoint 0.896 0.876 0.768 0.716

SuperPoint (DeTone et al., 2018) 0.897 0.869 0.754 0.708

FAST (Rosten and Drummond, 2006) 0.837 0.751 0.724 0.569

Harris (Harris and Stephens, 1998) 0.876 0.841 0.827 0.693

Shi (Shi and Tomasi, 1994) 0.831 0.769 0.709 0.604

Table 2: Homography correctness on HPatches, descriptor matching correctness and RMSE on RobotCar.

Algorithm HPatches H

RobotCar M

RobotCar RMSE

FisheyeSuperPoint 0.712 0.862 38.4

SuperPoint (DeTone et al., 2018) 0.668 0.859 36.3

SIFT (Lowe, 2004) 0.766 0.663 120.7

ORB (Rublee et al., 2011) 0.414 0.463 136.9

outperforms the other detectors when a higher NMS

is applied. FisheyeSuperPoint and SuperPoint consis-

tently outperform FAST and Shi. While the repeata-

bility values for all detectors are lower in (DeTone

et al., 2018) the ranking of the detectors is consistent.

The results match the values reported by (Pautrat and

Sarlin, 2021).

The results of the homography and matching cor-

rectness on HPatches and RobotCar are shown in Ta-

ble 2. In addition to FisheyeSuperPoint and Super-

Point, we compare the performance to SIFT (Lowe,

2004) and ORB (Rublee et al., 2011) which are imple-

mented using OpenCV. NMS = 8 and ε = 3 is applied

for all experiments. The number of detected points is

set to k = 1000 for HPatches and k = 300 for Robot-

Car. The HPatches images are resized to 480 ×640

and the RobotCar images to 512 ×512.

Both FisheyeSuperPoint and SuperPoint show a

superior performance compared to SIFT and ORB

when used for descriptor matching on the RobotCar

test data. This is particularly evident when taking

into account the RMSE results as per 3, which in-

dicate a high number of outliers for SIFT and Orb

with RMSE > 100. The matching performance on

RobotCar is also shown in Figure 4, where matches

with a euclidean distance of d ≤ 3px are indicated

with green lines. SIFT outperforms the other algo-

rithms for the homography correctness on HPatches.

FisheyeSuperPoint ranks second and is superior to

SuperPoint and ORB.

4 CONCLUSION

This work describes the new FisheyeSuperPoint key-

point detection and description network which uses

a pipeline to train and evaluate it directly on ﬁsheye

image datasets. To enable the self-supervised training

on ﬁsheye images, ﬁsheye warping is utilised. The

ﬁsheye image is mapped to a new, warped ﬁsheye im-

age through the intermediate step of projection to a

unit sphere, with the camera’s virtual pose being var-

ied in six degrees of freedom. This process is embed-

ded in an existing SuperPoint implementation (Pautrat

and Sarlin, 2021) and trained on the RobotCar dataset

(Maddern et al., 2017).

In order to compare the performance of Fisheye-

SuperPoint to other detectors, we introduce a method

to evaluate keypoint detection repeatability and

matching correctness on ﬁsheye images. Fisheye-

SuperPoint consistently outperforms SuperPoint for

the experiments on standard images (HPatches), es-

pecially in terms of homography correctness. This

might be due to more variations in the RobotCar train-

ing data. Both FisheyeSuperPoint and SuperPoint

perform similarly in our ﬁsheye evaluations. This

was unexpected and hints towards the robustness of

the SuperPoint network, suggesting that it could be

used for keypoint detection and description on ﬁsheye

images directly. Further evaluations on non-artiﬁcial

data for descriptor matching correctness could pro-

vide a better insight into the performance of both net-

FisheyeSuperPoint: Keypoint Detection and Description Network for Fisheye Images

345

Figure 4: Qualitative results of feature matching on RobotCar images with k = 300 detected points. Nearest neighbour

matches with a distance d ≤ 3px are shown in green. From left to right: FisheyeSuperPoint, SuperPoint (DeTone et al., 2018),

SIFT (Lowe, 2004), ORB (Rublee et al., 2011).

works.

While Harris and SIFT achieve higher repeatabil-

ities and homography correctness than FisheyeSuper-

Point in a few cases, our method comes with several

advantages. One of those advantages is the adapt-

ability of the network structure, which could be en-

hance with alternatives such as deformable convolu-

tional layers (Dai et al., 2017). The adaptive man-

ner of deformable convolutional layers could help in-

crease detection and description performance under

the inﬂuence of the radial distortion in ﬁsheye images.

Another opportunity is to incorporate FisheyeSu-

perPoint into multi-task visual perception networks

like Omnidet (Ravikumar et al., 2021). Multi-task

networks can present advantages in computational

complexity and performance by sharing base layers

of a network, which will be enhanced with Fisheye-

SuperPoint.

ACKNOWLEDGEMENTS

This publication has emanated from research con-

ducted with the ﬁnancial support of Science Founda-

tion Ireland (SFI) under Grant number 18/CRT/6049

and 16/RI/3399. The opinions, ﬁndings and conclu-

sions or recommendations expressed in this material

are those of the author(s) and do not necessarily re-

ﬂect the views of the Science Foundation Ireland.

REFERENCES

Balntas, V., Lenc, K., Vedaldi, A., and Mikolajczyk, K.

(2017). HPatches: A benchmark and evaluation of

handcrafted and learned local descriptors. In Proceed-

ings of CVPR, pages 5173–5182.

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and

Wei, Y. (2017). Deformable convolutional networks.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

346

In Proceedings of the IEEE International Conference

on Computer Vision (ICCV).

DeTone, D., Malisiewicz, T., and Rabinovich, A. (2018).

Superpoint: Self-supervised interest point detection

and description. In Proceedings of the CVPR Work-

shops (CVPR-W), pages 224–236.

Duan, Z., Tezcan, O., Nakamura, H., Ishwar, P., and Kon-

rad, J. (2020). Rapid: rotation-aware people detec-

tion in overhead ﬁsheye images. In Proceedings of

the CVPR Workshops, pages 636–637.

Esparza, J., Helmle, M., and J

ahne, B. (2014). Wide base

stereo with ﬁsheye optics: A robust approach for 3d

reconstruction in driving assistance. In Jiang, X.,

Hornegger, J., and Koch, R., editors, Pattern Recog-

nition, pages 342–353, Cham. Springer International

Publishing.

Harris, C. and Stephens, M. (1998). A combined corner

and edge detector. In Alvey vision conference, pages

147–151.

ane, C., Heng, L., Lee, G. H., Sizov, A., and Pollefeys,

M. (2014). Real-time direct dense matching on ﬁsh-

eye images using plane-sweeping stereo. In 2014 2nd

International Conference on 3D Vision, volume 1.

Kumar, V. R., Milz, S., Witt, C., Simon, M., Amende,

K., Petzold, J., Yogamani, S., and Pech, T. (2018).

Near-ﬁeld depth estimation using monocular ﬁsheye

camera: A semi-supervised learning approach using

sparse lidar data. In Proceedings of the CVPR Work-

shops (CVPR-W), volume 7.

Leang, I., Sistu, G., B

urger, F., Bursuc, A., and Yoga-

mani, S. (2020). Dynamic task weighting methods

for multi-task networks in autonomous driving sys-

tems. In 2020 IEEE 23rd International Conference on

Intelligent Transportation Systems (ITSC), pages 1–8.

IEEE.

Li, Y., Wang, S., Tian, Q., and Ding, X. (2015). A survey

of recent advances in visual feature detection. Neuro-

computing, 149.

Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick,

R. B., Hays, J., Perona, P., Ramanan, D., Doll

ar, P.,

and Zitnick, C. L. (2014). Microsoft COCO: common

objects in context. In Proceedings of the European

Conference on Computer Vision (ECCV), pages 740–

755.

Lo, I., Shih, K., and Chen, H. H. (2018). Image stitching

for dual ﬁsheye cameras. In 25th IEEE International

Conference on Image Processing (ICIP), pages 3164–

3168.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision, 60:91–110.

Ma, J., Jiang, X., Fan, A., Jiang, J., and Yan, J. (2021).

Image matching from handcrafted to deep features:

A survey. International Journal of Computer Vision,

129:23–79.

Maddern, W., Pascoe, G., Linegar, C., and Newman, P.

(2017). 1 year, 1000 km: The oxford robotcar

dataset. The International Journal of Robotics Re-

search, 36(1):3–15.

Mikolajczyk, K. and Schmid, C. (2005). A perfor-

mance evaluation of local descriptors. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

27(10):1615–1630.

Pautrat, R. and Sarlin, P.-E. (2021). Github SuperPoint Im-

plementation.

Rashed, H., Mohamed, E., Sistu, G., Kumar, V. R., Eis-

ing, C., El-Sallab, A., and Yogamani, S. (2021). Gen-

eralized object detection on ﬁsheye cameras for au-

tonomous driving: Dataset, representations and base-

line. In Proceedings of the IEEE/CVF WACV.

Ravi Kumar, V., Yogamani, S., Bach, M., Witt, C.,

Milz, S., and Mader, P. (2020). UnRectDepthNet:

Self-Supervised Monocular Depth Estimation using a

Generic Framework for Handling Common Camera

Distortion Models. In International Conference on In-

telligent Robots and Systems (IROS).

Ravikumar, V., Yogamani, S., Rashed, H., Sistu, G., Witt,

C., Leang, I., Milz, S., and Mader, P. (2021). Om-

nidet: Surround view cameras based multi-task visual

perception network for autonomous driving. IEEE

Robotics and Automation Letters, pages 1–1.

Rosten, E. and Drummond, T. (2006). Machine learning

for high-speed corner detection. In Proceedings of the

European Conference on Computer Vision (ECCV).

Rublee, E., Rabaud, V., Konolige, K., and Bradski, G.

(2011). Orb: An efﬁcient alternative to sift or surf. In

Proceedings of the IEEE/CVF International Confer-

ence on Computer Vision (ICCV), pages 2564–2571.

Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich,

A. (2016). Universal correspondence network. In Pro-

ceedings of Conference on Neural Information Pro-

cessing Systems (NIPS).

Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich,

A. (2020). Superglue: Learning feature matching

with graph neural networks. In Proceedings of CVPR,

pages 4938–4947.

Shi, J. and Tomasi, C. (1994). Good features to track. In

Proceedings of CVPR.

Tripathi, N. and Yogamani, S. (2021). Trained trajectory

based automated parking system using Visual SLAM.

In Proceedings of CVPR Workshops.

Uric

ar, M., Ulicny, J., Sistu, G., Rashed, H., Krizek, P.,

Hurych, D., Vobecky, A., and Yogamani, S. (2019).

Desoiling dataset: Restoring soiled areas on automo-

tive ﬁsheye cameras. In Proceedings of the IEEE/CVF

International Conference on Computer Vision Work-

shops.

Yogamani, S., Hughes, C., Horgan, J., Sistu, G., Varley,

P., O’Dea, D., Uric

ar, M., et al. (2019). Wood-

scape: A multi-task, multi-camera ﬁsheye dataset for

autonomous driving. In Proceedings of the IEEE/CVF

International Conference on Computer Vision (ICCV).

FisheyeSuperPoint: Keypoint Detection and Description Network for Fisheye Images

347