Detecting Anomalous 3D Point Clouds Using Pre-Trained Feature

Extractors

Dario Mantegazza

and Alessandro Giusti

Dalle Molle Institute for Artiﬁcial Intelligence (IDSIA),

USI-SUPSI, Lugano, Switzerland

Keywords:

Anomaly Detection, 3D Point Clouds, Deep Learning.

Abstract:

In this paper we explore the status of the research effort for the task of 3D visual anomaly detection; in

particular, we investigate whether it is possible to ﬁnd anomalies on 3D point clouds using off-the-shelf feature

extractors, similar to what is already feasible on images, without the requirement of an ad-hoc one. Our

work uses a model composed of two parts: a feature extraction module and an anomaly detection head.

The latter is ﬁxed and works on the embeddings from the feature extraction module. Using the MVTec-

3D dataset, we contribute a comparison between a 3D point cloud features extractor, a 2D image features

extractor, a combination of the two, and three baselines. We also compare our work with other models on

the dataset’s DETECTION-AUROC benchmark. The experiment results demonstrate that, while our proposed

approach surpasses the baselines and some other approaches, our best-performing model cannot beat purposely

developed ones. We conclude that a combination of dataset size and 3D data complexity is the culprit to a lack

of off-the-shelf feature extractors for solving complex 3D vision tasks.

1 INTRODUCTION

Off-the-shelf pre-trained image feature extractors are

increasingly being used in academic and industrial

research for building deep-learning models to solve

computer vision tasks. Recently, Vision Trans-

former (Dosovitskiy et al., 2021)(ViT) paved a new

road for researchers to build even more complex and

performing models for computer-vision tasks. A no-

table example of ViT-based models is CLIP from

OpenAI (Radford et al., 2021) a very large and com-

plex computer vision model trained on an enormous

corpus of captioned images to solve any kind of vi-

sion task. Due to their size and reliance on large

and complex datasets, models such as CLIP can be

only developed and trained by a limited set of compa-

nies and research labs. However, most of these large

models share an open-source nature with pre-trained

models available online

. By removing the need to

train an ad-hoc feature extractor, researchers can fo-

cus on solving the task at hand using the extracted

feature embeddings; regularly smaller than images

which contain low-level semantic information, the

https://orcid.org/0000-0001-9088-0897

https://orcid.org/0000-0003-1240-0768

https://github.com/openai/CLIP

embeddings encode visual data into high-level seman-

tic features allowing researchers to train computer vi-

sion models with fewer samples or smaller models

(excluding the extractor).

In a similar fashion, we are seeing a wider use, in

both academia and industry, of 3D data through depth

cameras, LIDARs, photogrammetry representation or

Neural Radiance ﬁelds for solving computer vision

tasks. Nonetheless, one 3D task that is still under-

studied (Frittoli, 2022) is Anomaly Detection on 3D

Data. The task of 3D Anomaly Detection has poten-

tial applications in many ﬁelds such as health care,

industrial product inspection, industrial asset mainte-

nance, site surveillance and robotics; currently, all of

these ﬁelds only rely on 2D Anomaly Detection.

A few recent works (Horwitz and Hoshen, 2023;

Rudolph et al., 2023; Wang et al., 2023; Chu et al.,

2023; Masuda et al., 2021; Floris et al., 2022) ap-

proached the task of Anomaly Detection on 3D data,

with most (Horwitz and Hoshen, 2023; Rudolph et al.,

2023; Wang et al., 2023; Chu et al., 2023) focusing on

Segmentation of Anomalies.

In this paper, we ask ourselves if, with the current

state of the art in deep learning models for 3D Point

Clouds, it is possible to solve the task of anomaly de-

tection on 3D point clouds without training an ad-hoc

Mantegazza, D. and Giusti, A.

Detecting Anomalous 3D Point Clouds Using Pre-Trained Feature Extractors.

DOI: 10.5220/0012465600003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 4: VISAPP, pages

373-380

ISBN: 978-989-758-679-8; ISSN: 2184-4321

373

features extractor, similarly to what we achieve in our

previous work (Mantegazza et al., 2023). The use of

pre-trained 3D feature extractors would remove the

need for a difﬁcult-to-develop and train 3d features

extractor, lowering the entrance barrier to 3D visual

data analysis.

2 RELATED WORK

2.1 Dataset

For this work, we use the MVTec-3D

dataset (Bergmann. et al., 2022). To the best of

our knowledge, this is the only existing open-access

dataset for the task of 3D Anomaly Detection, and

more speciﬁcally, 3D Anomaly Segmentation.

The MVTec-3D dataset is built for studying the

task of 3D anomaly segmentation in the context of

industrial mass production; the dataset is composed

of more than 4000 high-resolution point clouds and

RGB images of 10 different objects with 10 differ-

ent anomalies, captured using an industrial 3D sensor.

The dataset is already subdivided into training, valida-

tion and testing sets; all sets contain normal samples,

classiﬁed into different object categories, but only the

testing set contains anomalous samples.

For each anomalous test sample, a precisely anno-

tated ground truth is provided; in Figure 1 a selection

of the dataset is shown.

Figure 1: Examples of samples from the Mvtec3D dataset.

2.2 Models and Approaches

2.2.1 Image Anomaly Detection

Anomaly Detection is a widely researched

topic (Chandola et al., 2009) applied to a vari-

ety of ﬁelds (Ruff et al., 2021). As for other Machine

Learning branches, even Anomaly Detection has been

extensively researched with images as the informa-

tion medium. As explained in their review (Ruff et al.,

2021) by Ruff et al., when operating on images, the

task of anomaly detection requires an understanding

of normal data in order to ﬁnd high-level semantics

anomalies. Thankfully, deep Learning methods have

been used in different ﬁelds to extract high-level

information from complex data, and this applies also

to anomaly detection. Deep learning-based anomaly

detection has been studied for many applications,

from security purposes (Birnbaum et al., 2015)

to healthcare (Schlegl et al., 2017), or industrial

settings (Scime and Beuth, 2018; Haselmann et al.,

2018; Christiansen et al., 2016) and as explained in

the chapters before, robotics (Khalastchi et al., 2015;

Wellhausen et al., 2020).

2.2.2 3D Anomaly Detection

While Image Anomaly Detection has been studied in

different settings, 3D Anomaly Detection, due to the

limitation to only MVTec 3D as the only representa-

tive dataset, is focused only on the topic of anomaly

detection of industrial products.

One of the earlier studies on 3D anomaly detec-

tion, after the release of MVTec-3D, is from Hor-

witz et al (Horwitz and Hoshen, 2023). In their

work, the authors study the task of 3D anomaly detec-

tion and segmentation (3DAD&S) and compare some

non-purposely made models with their proposed ap-

proach on the MVTec dataset. Their objective is to

better understand if, for this task, the 3D data is useful

or not; from their results, it’s clear that while 2D ap-

proaches still beat the 3D purposely built ones, at least

for the latter the 3D data is essential. Then they pro-

vide an analysis of the key properties for successful

3DAD&S representation, leading to their proposed

approach called BTF; while their model achieves very

good segmentation performances, the authors recog-

nize the model limitation on the image level accuracy,

the same task we set to analyse in this work.

Rudolph et al. (Rudolph et al., 2023) pro-

pose an Asymmetric Student Teacher network to

solve the Anomaly segmentation task on both the

MVTec-3D and the original image-only MVTec

dataset (Bergmann et al., 2021). Different to other

student-teacher networks, their approach uses nor-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

374

malizing ﬂow models for the teacher and conventional

feed-forward convolution blocks for the students cre-

ating a discrepancy in the student prediction outside

of the normal data on which the student network is

trained on.

In their paper, Wang et al (Wang et al., 2023)

use the MVTec-3D dataset to propose a multimodal

approach to 3D anomaly detection called Multi-3D-

Memory (M3DM). With M3DM the authors combine

features extracted from both 3D point clouds and im-

ages; ﬁrst, patches of point clouds are produced using

the farthest point sampling, then the points in each

patch are encoded using Point Transformer (Zhao

et al., 2021) and the resulting features are remapped

onto a 2D plane with the same size of the RGB picture

and are then averaged into patch-wise features; at the

same time patch features are extracted from the RGB

image. Given the sets of patch features for both RGB

and point cloud, the authors propose two new learn-

able modules called Unsupervised Feature Fusion and

Decision Layer Fusion that are used, respectively, to

learn the interaction between multimodal features and

to deal with possible information loss that happens

during the information fusion; the latter module uses

multimodal memory banks during inference to pro-

duce the ﬁnal anomaly and segmentation predictions.

Chu et al. (Chu et al., 2023), differently from oth-

ers, propose a shape-guided approach for integrat-

ing the information from both RGB and point clouds.

Their approach uses neural implicit functions to rep-

resent local areas of the point clouds. Similarly to

others, they ﬁrst split the point cloud in 3D patches,

then these patches are passed to a PointNet network

and the resulting features are used by the Neural Im-

plicit Function module, to extract components that

are used to deﬁne signed distance functions that im-

plicitly encode normal local representations; these are

then combined with ResNet extracted RGB features

and used to deﬁne segmentation maps of the anoma-

lies.

2.2.3 3D Feature Extractor

Image Based. A logical approach to 3D feature ex-

traction is to use approaches well-tested on 2D data

and adapt them to the additional dimension. In their

paper (Zhang et al., 2022b), the authors propose to

bridge the gap between a pre-trained CLIP (Rad-

ford et al., 2021) vision transformer and the point

clouds from ModelNet (Wu et al., 2015) and ScanOb-

jectNN (Uy et al., 2019) using point-projection im-

ages of different views of a single object. For each

object, several views are generated and the resulting

set of images is used to extract features. In their work,

the authors note that using a zero-shot approach leads

to poor performances, but with an additional trainable

component after a few-shot training, the performance

on the classiﬁcation task increases.

Dong et al. (Dong et al., 2023) use pre-trained im-

age vision transformers as part of the teacher encoder

to then train a point cloud only student encoder mod-

ule. Their approach, called ACT, uses only x,y,z in-

formation and achieves the best performance on both

point cloud classiﬁcation and semantic segmentation.

Point Cloud Based. In contrast to the aforemen-

tioned approaches, Zhang et al. (Zhang et al., 2022a)

propose a point cloud only approach that does not

use images or image pre-trained models as part of the

pipeline. With Point-M2AE, the authors deﬁne and

train a multi-scale masked autoencoder with the ob-

jective of using it as a zero-shot point cloud encoder.

The model is composed of an encoder and decoder

with skip connections; the encoder is fed with differ-

ently scaled masked point clouds from the same sam-

ple. As for images, the masked autoencoder is trained

using a proxy reconstruction task. In the original pa-

per, the approach achieves promising results across

different tasks; ﬁnally, the authors provide both code

and pre-trained models.

In this work, we set to use a pre-trained version

of the encoder, used in combination with an SVM to

solve a Linear classiﬁcation task. The major draw-

back for Point-M2AE is the limited input size; the

point clouds used for training the encoder are limited

to 1024 points while the MVTec-3D point clouds con-

tain hundreds of thousands of points per sample. For

this reason, part of the code provided by Zhang et al.

has been adapted and a point sampler has been in-

troduced to reduce the MVTec-3D point cloud to the

correct size.

3 EXPERIMENTAL SETUP

While the MVTec 3D dataset is built for 3D anomaly

segmentation, in this work we will limit ourselves to

the binary classiﬁcation task of anomaly detection;

for each point cloud our model will predict if it con-

tains an anomaly or not. One of the metrics that is

used in the dataset benchmark is the DETECTION-

AUROC (in some cases called Image-AUROC); this

metric indicates the AUC for detecting anomalies

in the samples, without considering the segmenta-

tion. We will compare our results to those of the

DETECTION-AUROC benchmark, available on the

Detecting Anomalous 3D Point Clouds Using Pre-Trained Feature Extractors

375

dataset page of papers with code

In this work we use a two-part model, a feature

extractor and an anomaly detection head; the latter re-

ceives as input the feature embedding from the extrac-

tor and produces an anomaly score for each embed-

ding. The detection head is a Real-NVP model and -

excluding adaptation to different embedding sizes - it

is not changed throughout the experiments; thus the

only changing part will be the feature extractor.

3.1 Real-NVP

This model has been already used in multiple recent

papers (Wellhausen et al., 2020; Mantegazza et al.,

2022; Mantegazza et al., 2023) as an anomaly detec-

tor based on latent embeddings; the Real-NVP (Dinh

et al., 2016) is a kind of Normalizing Flow model, a

deep learning model that learns a mapping between

different spaces; in our approach, this is the only

component that is trained. The features extracted are

passed to a Real-NVP model; this model learns a

mapping between the features embedding and a mul-

tivariate Gaussian distribution with identity matrix as

covariance, 0 as mean, and the same number of di-

mensions as the embedding. We explore the Real-

NVP hyper-parameters using an empirical process,

ultimately landing on a similar setup to those used

in the previous work (Mantegazza et al., 2022; Man-

tegazza et al., 2023); the model is composed of four

coupling layers with a single hidden layer with the

same size as the input vector with odds input mask-

ing, this applies for both the translationa and scaling

modules. For more details on the speciﬁc components

please refer to the original paper (Dinh et al., 2016).

The only differences between these experiments and

previous works are the internal size of the layers and

input size; these are input-dependent. Note that dur-

ing training or hyper-parameter search, the Real-NVP

always converged with the mapping, excluding its in-

ﬂuence on the experiment results. All Real-NVPs are

trained for 100 epochs with early stopping, randomly

initialized weights, a starting learning rate of 0.001

and Adam (Kingma and Ba, 2014) optimizer.

3.2 Baselines

We deﬁne two baselines, Random and Ones. The ﬁrst

substitutes the features extractor component with a

random signal sampled from a Normal distribution;

the latter produces a 1s feature vector as input for the

Real-NVP. We use two different baselines to demon-

https://paperswithcode.com/sota/

rgb-3d-anomaly-detection-and-segmentation-on

strate that the Real-NVP component is not relevant to

our experiment.

Handcrafted Baseline. We also deﬁne a set of

handcrafted feature extractors to serve as an addi-

tional baseline. These basic features are heuristics

chosen to be easy to compute and informative enough

to detect macroscopic anomalies (e.g. a large piece of

an object missing). The 11 features are the following:

• ft1: number of points in the point cloud

• ft2 to ft5: number of points in 4 quadrants (i.e.

split the point cloud into 4 quadrants from a top

view)

• ft6 and ft7: maximum and minimum z value for

any point cloud’s points

• ft8 and ft9: maximum and minimum x value for

any point cloud’s points

• ft10 and ft11: maximum and minimum y value for

any point cloud’s points

3.3 XYZ Model

The ﬁrst, non-baseline, approach proposed uses

Point-M2AE as a feature extractor. This XYZ (i.e.

point cloud data) encoder, uses positional informa-

tion from the entire point cloud to produce a feature

vector. The Point-M2AE encoder takes as input a

1024pts point cloud and produces 384 features; these

are passed to the Real-NVP that maps them to a la-

tent space where the ”normality” probability can be

extracted.

3.4 RGB Model

Since the MVTec-3D dataset provides RGB images

of the sample scans, we took the CLIP+Real-NVP

model from our previous work (Mantegazza et al.,

2023) on image anomaly detection, and we used it to

identify anomalies in the dataset. Notice that, while

the RGB images are more informative for some spe-

ciﬁc anomalies (see the anomaly color for the object

foam), the overall information provided to the Real-

NVP is more limited than total information in a point

cloud.

This setup uses the Vision Transformer(ViT) mod-

ule of CLIP (Radford et al., 2021); the ViT takes the

RGB image as input and produces a 512-sized feature

vector. As for the previous approaches, the vector is

then passed to the Real-NVP component.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

376

3.5 RGB+XYZ Model

Finally, we test an RGB+XYZ model by simply con-

catenating the 512 RGB-derived features and the 384

XYZ-derived features in a single vector for each sam-

ple.

Even in this case with an 896 size vector, the Real-

NVP correctly converged and learned a mapping.

4 RESULTS

We report all the AUC results of the 46 runs (exclud-

ing the hyper-parameter searches) in Tables 3,4,5,6

and 2. The results are color-coded, any value of AUC

equal to or lower than 0.5 is colored red; higher val-

ues shift from red to yellow and towards green, which

is the color for values near or equal to 1.

In each table, we report the performances split by

anomaly type and object class, with the addition of

the AUC considering the whole test set as a binary

problem, and the averaged AUC, built by averaging

the AUC of each object class per se.

As detailed by the tables, for all models except

random and ones, we also consider the AUC for the

models trained and tested on samples from a single

object class; for example, all lines with bagle as ob-

ject class, represent models that during training, vali-

dation and testing, only saw bagels. To the best of our

knowledge, we are the ﬁrst to introduce this kind of

experiment for this dataset; our motivation is to study

the effects of each object class’s characteristics (shape

and color) on each speciﬁc model (and thus features

extractor) performance.

The best-performing model is the RGB, CLIP-

based one followed closely by the RGB+XYZ one.

The RGB model, using all objects, achieved an AUC

of 0.69 for the test set. We acknowledge that this per-

formance is surpassed by other, more complex, mod-

els benchmarked on the MVTec-3D dataset, but are

nonetheless better than the baselines and handcrafted.

Moreover, both the RGB and RGB+XYZ beat the per-

formance of the purposely built approaches proposed

in (Bergmann. et al., 2022), namely Voxel VM, Voxel

AE and Voxel GAN.

5 CONCLUSION

In this work, we compared different off-the-shelf pre-

trained feature extractors combined with a Real-NVP

model to solve the task of 3D anomaly detection on

the MVTec-3D dataset.

Table 1: Comparing our approaches to the MVTec-3D

benchmark.

Shape-Guided (Chu et al., 2023) 0.95

M3DM (Wang et al., 2023) 0.95

AST (Rudolph et al., 2023) 0.94

Back To Feat. (Horwitz and Hoshen, 2023) 0.87

RGB (Ours) 0.69

RGB+XYZ (Ours) 0.67

Voxel VM (Bergmann. et al., 2022) 0.61

Voxel AE (Bergmann. et al., 2022) 0.54

XYZ (Ours) 0.53

Voxel GAN (Bergmann. et al., 2022) 0.52

From our experiments, it is clear that all ap-

proaches tested while better than the baselines are not

sufﬁcient for an anomaly detection task. We impute

the limited performances to the features extractors;

while performing excellently on their original tasks,

the models available are still too limited to solve this

task; for example, the XYZ Point-M2AE extractor is

strongly limited by the number of points it accepts

in input and thus losing important local details that

might help to detect small anomalies. Future work in

this direction would be to either this Point-M2AE on

3D patches of the point cloud or to ﬁnd an alternative

model that can accept more points than Point-M2AE.

In addition, we think that the dataset is too small;

this implies that Real-NVP, while correctly converg-

ing, fails to learn a mapping of the normal samples to

correctly identify anomalies.

With our results, we demonstrate that a combina-

tion of a dataset limitation and additional complexity

when dealing with point clouds versus images leads to

a lack of off-the-shelf models for solving complex 3D

vision tasks. This is remarked by the striking differ-

ence in performance when comparing our approach to

the ad-hoc solutions.

We think that the research work for 3D anomaly

detection is just at the beginning; more models and

datasets are needed in order to achieve a performance

and ease of use comparable to the one for 2D vision.

Finally, while we understand the complexity of col-

lecting and labelling a dataset for 3D Anomaly De-

tection, we strongly encourage future works in this

direction.

REFERENCES

Bergmann, P. et al. (2021). The mvtec anomaly detection

dataset: a comprehensive real-world dataset for unsu-

pervised anomaly detection. Int. Journal of Computer

Vision, 129:1038–1059.

Bergmann., P., Jin., X., Sattlegger., D., and Steger., C.

(2022). The mvtec 3d-ad dataset for unsupervised 3d

Detecting Anomalous 3D Point Clouds Using Pre-Trained Feature Extractors

377

Table 2: AUC of the Baseline models.

Baseline Obj seen Bent Color Comb. Contam. Crack Cut Hole Open Thread Test Test

Avg

random all 0.49 0.53 0.48 0.50 0.50 0.51 0.51 0.55 0.56 0.50 0.51

ones all 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

Table 3: AUC of the XYZ model.

Bent Color Comb. Contam. Crack Cut Hole Open Thread Test Test

Avg

all 0.32 0.96 0.51 0.51 0.58 0.52 0.55 0.61 0.68 0.53 0.58

bagel 0.71 0.67 0.71 0.68 0.69 0.69

cable gland 0.65 0.44 0.56 0.56 0.55 0.55

carrot 0.56 0.53 0.65 0.65 0.59 0.60 0.60

cookie 0.70 0.47 0.58 0.48 0.56 0.56

dowel 0.48 0.44 0.45 0.50 0.47 0.47

foam 0.48 0.65 0.49 0.54 0.54 0.54

peach 0.46 0.49 0.41 0.46 0.46 0.46

potato 0.43 0.47 0.33 0.40 0.41 0.41

rope 0.63 0.66 0.96 0.72 0.75

tire 0.30 0.49 0.50 0.46 0.48 0.44

Table 4: AUC of the RGB model.

Bent Color Comb. Contam. Crack Cut Hole Open Thread Test Test

Avg

all 0.63 0.99 0.78 0.64 0.83 0.59 0.72 0.53 0.61 0.69 0.70

bagel 0.92 0.63 1.00 0.79 0.84 0.83

cable gland 0.65 0.73 0.68 0.62 0.67 0.67

carrot 0.77 0.68 0.71 0.59 0.72 0.69 0.69

cookie 0.62 0.56 0.77 0.52 0.62 0.62

dowel 0.87 0.91 0.77 0.72 0.82 0.82

foam 1.00 0.82 0.70 0.75 0.82 0.82

peach 0.60 0.56 0.66 0.49 0.57 0.58

potato 0.60 0.59 0.43 0.41 0.51 0.51

rope 0.63 0.63 0.87 0.69 0.71

tire 0.48 0.54 0.52 0.59 0.55 0.53

Table 5: AUC of the XYZ+RGB model.

Bent Color Comb. Contam. Crack Cut Hole Open Thread Test Test

Avg

all 0.53 1.00 0.75 0.63 0.81 0.60 0.66 0.62 0.61 0.67 0.69

bagel 0.93 0.67 0.99 0.81 0.85 0.85

cable gland 0.74 0.69 0.73 0.69 0.71 0.72

carrot 0.76 0.72 0.70 0.62 0.73 0.70 0.70

cookie 0.67 0.58 0.84 0.58 0.67 0.67

dowel 0.83 0.89 0.76 0.63 0.78 0.78

foam 1.00 0.84 0.67 0.63 0.78 0.78

peach 0.57 0.59 0.60 0.49 0.56 0.56

potato 0.60 0.55 0.43 0.37 0.49 0.49

rope 0.62 0.67 0.87 0.70 0.72

tire 0.30 0.54 0.53 0.55 0.52 0.48

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

378

Table 6: AUC of the Handcrafted features model.

Bent Color Comb. Contam. Crack Cut Hole Open Thread Test Test

Avg

all 0.35 0.78 0.53 0.51 0.58 0.48 0.48 0.30 0.13 0.49 0.46

bagel 0.52 0.44 0.48 0.48 0.48 0.48

cable gland 0.47 0.49 0.64 0.43 0.51 0.51

carrot 0.51 0.42 0.46 0.49 0.35 0.45 0.45

cookie 0.61 0.58 0.67 0.73 0.65 0.65

dowel 0.59 0.52 0.64 0.51 0.57 0.56

foam 0.67 0.60 0.65 0.69 0.65 0.65

peach 0.52 0.60 0.48 0.48 0.52 0.52

potato 0.00 0.00 0.00 0.00 0.00 0.00

rope 0.60 0.59 0.69 0.62 0.62

tire 0.60 0.48 0.58 0.77 0.61 0.61

anomaly detection and localization. In Proceedings of

the 17th International Joint Conference on Computer

Vision, Imaging and Computer Graphics Theory and

Applications - Volume 5: VISAPP,, pages 202–213.

INSTICC, SciTePress.

Birnbaum, Z. et al. (2015). Unmanned aerial vehicle secu-

rity using behavioral proﬁling. In 2015 International

Conference on Unmanned Aircraft Systems (ICUAS),

pages 1310–1319.

Chandola, V., Banerjee, A., and Kumar, V. (2009).

Anomaly detection: A survey. ACM Comput. Surv.,

41(3).

Christiansen, P. et al. (2016). Deepanomaly: Combining

background subtraction and deep learning for detect-

ing obstacles and anomalies in an agricultural ﬁeld.

Sensors, 16(11):1904.

Chu, Y.-M., Liu, C., Hsieh, T.-I., Chen, H.-T., and Liu, T.-L.

(2023). Shape-guided dual-memory learning for 3D

anomaly detection. In Krause, A., Brunskill, E., Cho,

K., Engelhardt, B., Sabato, S., and Scarlett, J., editors,

Proceedings of the 40th International Conference on

Machine Learning, volume 202 of Proceedings of Ma-

chine Learning Research, pages 6185–6194. PMLR.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Density

estimation using real nvp.

Dong, R., Qi, Z., Zhang, L., Zhang, J., Sun, J., Ge, Z., Yi,

L., and Ma, K. (2023). Autoencoders as cross-modal

teachers: Can pretrained 2d image transformers help

3d representation learning? In The Eleventh Interna-

tional Conference on Learning Representations.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Min-

derer, M., Heigold, G., Gelly, S., Uszkoreit, J., and

Houlsby, N. (2021). An Image is Worth 16x16

Words: Transformers for Image Recognition at Scale.

arXiv:2010.11929 [cs].

Floris, A., Frittoli, L., Carrera, D., and Boracchi, G. (2022).

Composite layers for deep anomaly detection on 3d

point clouds.

Frittoli, L. (2022). ADVANCED LEARNING METHODS

FOR ANOMALY DETECTION IN MULTIVARIATE

DATASTREAMS AND POINT CLOUDS. PhD thesis,

Politecnico Milano.

Haselmann, M., Gruber, D. P., and Tabatabai, P. (2018).

Anomaly detection using deep learning based im-

age completion. In 2018 17th IEEE International

Conference on Machine Learning and Applications

(ICMLA), pages 1237–1242.

Horwitz, E. and Hoshen, Y. (2023). Back to the Feature:

Classical 3D Features Are (Almost) All You Need for

3D Anomaly Detection. In IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

2967–2976.

Khalastchi, E., Kalech, M., Kaminka, G. A., and Lin, R.

(2015). Online data-driven anomaly detection in au-

tonomous robots. Knowledge and Information Sys-

tems, 43(3):657–688.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Mantegazza, D., Giusti, A., Gambardella, L. M., and Guzzi,

J. (2022). An outlier exposure approach to im-

prove visual anomaly detection performance for mo-

bile robots. IEEE Robotics and Automation Letters,

pages 1–8.

Mantegazza, D., Xhyra, A., Giusti, A., and Guzzi, J. (2023).

Active Anomaly Detection for Autonomous Robots:

A Benchmark. In Iida, F., Maiolino, P., Abdulali, A.,

and Wang, M., editors, Towards Autonomous Robotic

Systems, Lecture Notes in Computer Science, pages

315–327, Cham. Springer Nature Switzerland.

Masuda, M., Hachiuma, R., Fujii, R., Saito, H., and

Sekikawa, Y. (2021). Toward unsupervised 3d point

cloud anomaly detection using variational autoen-

coder. In 2021 IEEE International Conference on Im-

age Processing (ICIP), pages 3118–3122.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., Krueger, G., and Sutskever, I. (2021). Learning

Transferable Visual Models From Natural Language

Supervision. arXiv:2103.00020 [cs].

Rudolph, M., Wehrbein, T., Rosenhahn, B., and Wandt,

B. (2023). Asymmetric Student-Teacher Networks

for Industrial Anomaly Detection. In Proceedings of

the IEEE/CVF Winter Conference on Applications of

Computer Vision (WACV), pages 2592–2602.

Ruff, L. et al. (2021). A unifying review of deep and shal-

low anomaly detection. Proceedings of the IEEE,

109(5):756–795.

Schlegl, T. et al. (2017). Unsupervised anomaly detec-

Detecting Anomalous 3D Point Clouds Using Pre-Trained Feature Extractors

379

tion with generative adversarial networks to guide

marker discovery. In Information Processing in Med-

ical Imaging, pages 146–157, Cham.

Scime, L. and Beuth, J. (2018). A multi-scale convolu-

tional neural network for autonomous anomaly detec-

tion and classiﬁcation in a laser powder bed fusion

additive manufacturing process. Additive Manufac-

turing, 24:273–286.

Uy, M. A., Pham, Q.-H., Hua, B.-S., Nguyen, T., and

Yeung, S.-K. (2019). Revisiting point cloud clas-

siﬁcation: A new benchmark dataset and classiﬁca-

tion model on real-world data. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion (ICCV).

Wang, Y., Peng, J., Zhang, J., Yi, R., Wang, Y., and Wang,

C. (2023). Multimodal Industrial Anomaly Detection

via Hybrid Fusion. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 8032–8041.

Wellhausen, L., Ranftl, R., and Hutter, M. (2020). Safe

robot navigation via multi-modal anomaly detection.

IEEE Robotics and Automation Letters, 5(2):1326–

1333.

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X.,

and Xiao, J. (2015). 3d shapenets: A deep representa-

tion for volumetric shapes. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 1912–1920.

Zhang, R., Guo, Z., Gao, P., Fang, R., Zhao, B., Wang, D.,

Qiao, Y., and Li, H. (2022a). Point-M2AE: Multi-

scale Masked Autoencoders for Hierarchical Point

Cloud Pre-training. In Koyejo, S., Mohamed, S.,

Agarwal, A., Belgrave, D., Cho, K., and Oh, A., edi-

tors, Advances in Neural Information Processing Sys-

tems, volume 35, pages 27061–27074. Curran Asso-

ciates, Inc.

Zhang, R., Guo, Z., Zhang, W., Li, K., Miao, X., Cui, B.,

Qiao, Y., Gao, P., and Li, H. (2022b). Pointclip: Point

cloud understanding by clip. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 8552–8562.

Zhao, H., Jiang, L., Jia, J., Torr, P., and Koltun, V. (2021).

Point Transformer. arXiv:2012.09164 [cs].

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

380