Detecting Anomalous Regions from an Image based on Deep Captioning

Yusuke Hatae

, Qingpu Yang

, Muhammad Fikko Fadjrimiratno

, Yuanyuan Li

Tetsu Matsukawa

1 a

and Einoshin Suzuki

1 b

ISEE, Kyushu University, Fukuoka, 819-0395, Japan

SLS, Kyushu University, Fukuoka, 819-0395, Japan

liyuanyuan95@yahoo.co.jp, {matsukawa, suzuki}@inf.kyushu-u.ac.jp

Keywords:

Anomaly Detection, Anomalous Image Region Detection, Deep Captioning, Word Embedding.

Abstract:

In this paper we propose a one-class anomalous region detection method from an image based on deep

captioning. Such a method can be installed on an autonomous mobile robot, which reports anomalies from

observation without any human supervision and would interest a wide range of researchers, practitioners, and

users. In addition to image features, which were used by conventional methods, our method exploits recent

advances in deep captioning, which is based on deep neural networks trained on a large-scale data on image

- caption pairs, enabling anomaly detection in the semantic level. Incremental clustering is adopted so that

the robot is able to model its observation into a set of clusters and report substantially new observations as

anomalies. Extensive experiments using two real-world data demonstrate the superiority of our method in

terms of recall, precision, F measure, and AUC over the traditional approach. The experiments also show that

our method exhibits excellent learning curve and low threshold dependency.

1 INTRODUCTION

Anomaly detection refers to the problem of ﬁnding

patterns in data that do not conform to expected be-

havior (Chandola et al., 2009). Its applications are

rich in variety and include fraud detection for credit

cards, insurance, or health care, intrusion detection

for cyber-security, fault detection in safety critical

systems, and military surveillance for enemy activi-

ties (Chandola et al., 2009). Recently we have wit-

nessed a large number of works on detecting anoma-

lous regions from an image, which include image di-

agnosis in medicine (Schlegl et al., 2017), construc-

tion of a patrol robot (Lawson et al., 2017; Lawson

et al., 2016; Kato et al., 2012) or a journalist robot

(Matsumoto et al., 2007; Suzuki et al., 2011), anoma-

lous behavior detection in a crowded scene (Mahade-

van et al., 2011), and classiﬁcation of dangerous sit-

uations including ﬁres, injured persons, and car acci-

dents (Arriaga et al., 2017).

Among the detections methods (Schlegl et al.,

2017; Mahadevan et al., 2011; Arriaga et al., 2017),

we believe that one-class anomaly detection (Schlegl

https://orcid.org/0000-0002-8841-6304

https://orcid.org/0000-0001-7743-6177

et al., 2017), in which the training data contain no

anomalous example, is most valuable and challeng-

ing as it requires no human supervision and assumes

the most realistic environment. The method proposed

by Schlegl et al. (Schlegl et al., 2017) conducts

anomaly detection by using a kind of Deep Neural

Network (DNN) called Generative Adversarial Net-

work (GAN) (Goodfellow et al., 2014). In this work,

GAN learns the probabilistic distribution of a huge

number of training images and is then able to gen-

erate a new image based on the input noise. Since

an anomalous image deviates from the probabilistic

distribution of the images in the training data, it is

difﬁcult for GAN to generate a similar one, resulting

in a large reconstruction error. The method (Schlegl

et al., 2017) relies on the reconstruction error in judg-

ing whether a test image is anomalous.

However, the method (Schlegl et al., 2017) can

hardly learn an accurate probabilistic distribution

if it is employed on an autonomous mobile robot,

which captures images at various positions and an-

gles. Moreover, large intra-object variations pose an

additional challenge, e.g., two women can look highly

dissimilar, though for the purpose of anomaly detec-

tion they might be better recognized as both women.

326

Hatae, Y., Yang, Q., Fadjrimiratno, M., Li, Y., Matsukawa, T. and Suzuki, E.

Detecting Anomalous Regions from an Image based on Deep Captioning.

DOI: 10.5220/0008949603260335

In Proceedings of the 15th Inter national Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

326-335

ISBN: 978-989-758-402-2; ISSN: 2184-4321

We therefore propose a method using image region

captioning, which generates captions to salient re-

gions in a given image (Johnson et al., 2016). Based

on appropriate captions to salient regions, we can

expect higher detection accuracy on anomalous re-

gions in images with large intra-object variations cap-

tured from different viewpoints. For instance, our

approach would be able to associate two regions on

highly-dissimilar women to each other if they had the

same caption “woman is standing”, leading to more

accurate detection of anomalous regions. Note that

such short texts can be effectively handled with word

embedding techniques such as Word2Vec (Mikolov

et al., 2013). We implement our approach by combin-

ing features of DenseCap (Johnson et al., 2016) and

Word2Vec (Mikolov et al., 2013). Since DenseCap

trains a DNN from a huge amount of data on image

- caption pairs, our method exploits the data through

the DNN in its anomaly detection task.

2 TARGET PROBLEM

We solve the target problem of detecting anomalous

regions of the input image by judging the salient re-

gions detected from the image to either normal or

anomalous. Let the target image be H

, . . . , H

, then

by image region captioning we obtain m(i) regions

from H

, which are transformed into m(i) region data

, . . . , b

im(i)

). Here m(i) represents the number of

regions detected from H

Each region data b

consists of two kinds of vec-

tors b

= (r

, c

), where r

= (x

max

, y

max

, x

min

, y

min

)

represents the x and y coordinates of two diagonal

vertices of the region rectangle and c

is the caption

that explains the t-th region.

By deﬁnition anomalous examples are extremely

rare compared with normal examples and rich in va-

riety. This nature makes it hard to collect anomalous

examples and include them in the training data. It

is therefore common to tackle one-class anomaly de-

tection, in which the training data contain no anoma-

lous example. We also adopt this problem setting and

tackle one-class anomaly detection.

As kinds of anomalies, we assume anomalous ob-

jects, anomalous actions, and anomalous positions to

detect from b

in this paper. Here an anomalous ob-

ject represents an object which is highly dissimilar to

the objects in the training data. Note that a detector

has to recognize objects and their similarities to cope

with this kind of anomalies. Similarly an anomalous

action represents an action which is highly dissimilar

to the actions in the training data. Thus for example

the action of talking on a cellular phone is recognized

as an anomaly if few persons do it in the training data.

Finally an anomalous position represents objects lo-

cated at a highly unlikely position in the training data.

For example, a book on the ﬂoor is recognized at an

anomalous position if few books were on the ﬂoor in

the training data.

3 PROPOSED METHOD

3.1 Overview

Figure 1 shows the processing steps of the proposed

method to detect anomalous regions from input image

. In the ﬁrst step of the training phase, image re-

gions and their captions (b

, . . . , b

im(i)

) are generated

from H

. We used DenseCap (Johnson et al., 2016)

for this step.

In the next step of the training phase, each cap-

tion is transformed into feature vectors which are ap-

propriate for anomalous detection. We mainly used

Word2Vec (Mikolov et al., 2013) for this step. As a

better substitute to the x and y coordinates r

of two

diagonal vertices of the region rectangle, we used nor-

malized x and y coordinates r

= (x

center

, y

center

) of

the center point.

center

min

+ x

max

(1)

center

min

+ y

max

, (2)

where w and h are the horizontal and vertical sizes

of the image, respectively. Note that this substitute is

more robust to a change of the distance to the object.

From the m(i) image regions detected in image H

we extract the output vector (V

, . . . , V

im(i)

) which

is normalized with its L2-distance of the penultimate

layer of the Convolutional Neural Network (CNN)

(Krizhevsky et al., 2012) as the image features. Then

we concatenate the caption features, the image fea-

tures, and the normalized coordinates into one vector

for the next step.

In the last step of the training phase, our method

clusters the feature vectors with the clustering method

BIRCH (Zhang et al., 1997). Here BIRCH is used

to model normal examples through clustering in the

training phase, which allows us to detect anomalies in

the test phase.

In the test phase, the feature vector of a test region

is judged anomalous if its distance to the closest clus-

ter is above R. Otherwise it is judged as normal. In the

subsequent sections, we explain each step in detail.

Detecting Anomalous Regions from an Image based on Deep Captioning

327

Figure 1: Processing steps of the proposed method.

3.2 Generating Captions of an Image

Region

As we stated previously, we use DenseCap (John-

son et al., 2016) to generate captions for m(i) regions

from image H

. For simplicity we adopt K regions

, . . . , b

) with the highest conﬁdence scores given

by DenseCap and thus m(i) = K.

We used the model of DenseCap which is trained

with Visual Genome (Krishna et al., 2017) and

available to the public. Visual Genome (Krishna

et al., 2017) comprises 94,000 images and 4,100,000

region-grounded captions. Thus we exploit the data

for our anomaly detection via the deep captioning

model of DenseCap.

3.3 Caption Features based on Word

Embedding

In the next step, we obtain the image region features

from each region based on word embedding. First we

omit stopwords such as articles and prepositions from

the caption c

in the region data b

= (r

, c

). It is

widely known that stopwords have little meaning in

natural language processing. As stopwords we used a

list in nltk library

of Python. We denote the remain-

ing word sequence by c

Then we obtain the distributed representation of

words using Word2Vec. Word2Vec returns respective

vectors U

it1

, . . . , U

itT

of the words w

, . . . , w

, where

T represents the number of words in c

. We set the

https://www.nltk.org/index.html

number of the dimension of the distributed represen-

tation to 300. We also obtain the normalized coordi-

nates r

of the center from the coordinates r

of the

target region with Eqs. (1) and (2). The caption fea-

ture vector F

cap

) of the target region is a concate-

nation of the mean M

of the word distributed rep-

resentation which is normalized with its L2-distance

and dr

∑

j=1

it j

(3)

cap

) = M

⊕ dr

, (4)

where d is a hyper-parameter which controls the in-

ﬂuence of w and h. ⊕ represents the concatenation

operator.

3.4 Combination of the Caption

Features and the Image Features

In addition to the caption features, we also generate

the image features F

) and the combined features

comb

), which is a concatenation of the two kinds

of features.

) = V

⊕ dr

(5)

comb

) = M

⊕V

⊕ dr

, (6)

Note that the combined features F

comb

) corre-

spond to our method while the other two kinds of fea-

tures F

cap

) and F

) serve as baseline methods

in the experiments.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

328

3.5 Unsupervised Anomaly Detection

based on Clustering

In the last step, we detect anomalies based on clus-

tering from feature vectors F(b

) of the target image

region obtained in the previous step. As we stated

previously, we use BIRCH (Zhang et al., 1997) due to

its efﬁciency.

In BIRCH, the feature vector F(b

) is assigned

to the closest leaf node of its CF (Clustering Feature)

tree, which abstracts its observation in the form of a

height-balanced tree. Let the CF vector of this leaf

node be (N

, S

, SS

). When the radius of the addition

of the CF vector of a new example and the CF vector

of its closest leaf exceeds a user-speciﬁed parameter

θ, the new example becomes a new leaf node and its

parent node is reconstructed with a standard proce-

dure for a height-balanced tree (Zhang et al., 1997).

We use the distance between the CF vector of F(b

)

and (N

, S

, SS

) as the degree of anomaly of F(b

The target region is detected as anomalous when the

distance exceeds a user-speciﬁed threshold R.

4 EXPERIMENTS

4.1 Datasets

We conducted experiments with two kinds of datasets,

each of which contains a sequence of images ex-

tracted from a video clip and additional images

some of which include anomalous regions. The se-

quence of images, which contain no anomalous re-

gion, are used for training and the additional images

are used for testing. We used only indoor images

in the experiments as our Turtlebot 2 with Kobuki

(https://www.turtlebot.com/) is recommended to op-

erate indoor. The ﬁrst dataset consists of images taken

by our TurtleBot with Microsoft Kinect for Windows

v2 in a room. It contains 4768 images as the se-

quence sampled every second and additional 358 im-

ages including 15 images containing anomalous re-

gions. The 15 images consist of anomalous actions

such as a person with umbrella in a room and anoma-

lous positions such as books on the ﬂoor as shown in

Fig. 2. Note that our interests are directed toward

building human-monitoring robots, which explains

our use of in-house data. Tackling larger benchmark

data with higher variations would require more accu-

rate image captioning.

The second dataset consists of images taken in a

refresh corner with a VCR recorder. It contains about

16800 images as the sequence and additional 715 im-

ages including 31 images containing anomalous re-

gions. The 31 images consist of anomalous actions

such as a person under a table and anomalous posi-

tions such as a bag on the ﬂoor.

In applying DenseCap, we set K = 10 as the num-

ber of the detected regions for each image. We in-

spected each test image and annotated anomalous re-

gions for evaluating detection methods. In the in-

spection process, an anomalous region was deﬁned as

either an anomalous object, an anomalous action, or

an anomalous position as we explained in Section 2.

The annotation was based on images only and thus

captions were neglected in the process. Figs. 2 and

3 show examples of images in the ﬁrst and second

datasets, respectively. The left and middle images in

Fig. 3 represent examples of an anomalous action of

hiding under a table

and an anomalous position of a

bag on the ﬂoor. As the result, the normal and anoma-

lous examples in the test data of the ﬁrst dataset are

3545 and 35, respectively. On the other hand, they are

7103 and 47 in the second dataset.

4.2 Design of the Experiments

We conducted ﬁve kinds of experiments. The ﬁrst two

were for performance evaluation: one with all data

and the other for plotting the learning curves. The

next two were for investigating the dependencies on

parameters: one for the threshold parameter θ of the

radius of a leaf node in building the CF tree and the

other for the threshold R of anomaly detection. The

last one was an ablation study of the coordinate infor-

mation dr

. The run-time of the employed anomaly

detection methods was negligible compared to the

sampling time of one second. In the ﬁrst kind of ex-

periments, the detection performance was measured

in terms of precision, recall, F measure, and AUC

(Area under the ROC curve). In the second kind of

experiments, a varying proportion of the training data

were selected randomly, and for each proportion a de-

tection method is applied 10 times to different data.

We report the average performance in AUC.

As baseline methods, we used F

cap

) and

). To obtain V

, we used VGG-16 (Simonyan

and Zisserman, 2015)

. Since our training data is

unlabeled, we used a public model VGG-16 which

was trained with ImageNet (Deng et al., 2009). The

4096-dimensional image features were obtained with

VGG-16 for each region detected with DenseCap and

resized to 224 × 224 pixels. In this sense, the base-

line method with F

) also exploits DenseCap for

Schools in Japan teach students to take this action un-

der strong shakes during an earthquake.

https://keras.io/ja/applications/

Detecting Anomalous Regions from an Image based on Deep Captioning

329

Figure 2: Examples of images in the ﬁrst dataset. The left and middle images contain anomalous regions (a man with an

umbrella and a book on the ﬂoor, respectively) while the right one does not.

Figure 3: Examples of images in the second dataset. The left and middle images contain anomalous regions (a woman under

a table and a bag on the ﬂoor, respectively) while the right one does not.

Table 1: Recall, precision, and F measure in the ﬁrst data.

Precision Recall F measure

Caption 0.500 0.303 0.377

Image 0.464 0.394 0.426

Image + Caption 0.533 0.485 0.508

Table 2: Recall, precision, and F measure in the second data.

Precision Recall F measure

Caption 0.0165 0.0638 0.0262

Image 0.0404 0.0851 0.0548

Image + Caption 0.0654 0.1489 0.0909

Figure 4: ROC curve and AUC in the ﬁrst data.

detecting salient regions but not the generated cap-

tions. In most of the experiments, we used R = 0.7,

R = 1.0 and R = 1.2 for the caption features F

cap

the image features F

), and their combinations

comb

), respectively, which were determined by

parameter tuning. We are going to investigate the

inﬂuence of R in the fourth kind of experiments.

Throughout the experiments, the horizontal and ver-

tical sizes of the images were w = 720 and h = 404.

DenseCap also shrunk the original image size 1920

× 1080 pixels to approximately half, i.e., 720 × 404

pixels.

Except in the third and ﬁfth kinds of experiments,

for the threshold θ to build the CF tree, we used θ =

0.1 for all features. As for the hyper-parameter d, we

set d = 2 except in the ﬁfth kind of experiments.

4.3 Results of the Experiments

4.3.1 Performance Evaluation

Table 1 shows that the combined features outperform

the remaining features in precision, recall, and F mea-

sure. Fig. 4 shows that our method, the combined

features, outperforms the other two kinds of features

in AUC.

The learning curve of our method obtained by the

second kind of experiments for the ﬁrst dataset is

shown in Fig. 5. Note that we adopted logscale for

the training data proportion throughout investigation.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

330

Figure 5: Learning curve of AUC in the ﬁrst data.

We see that the combined features always outperform

other kinds of features, and the superiority is larger

for smaller samples. We see that our method is still

relatively effective even with 1% of the training data.

Figure 6: ROC curve and AUC in the ﬁrst data (50%).

Figure 7: ROC curve and AUC in the ﬁrst data (10%).

Figs. 6-9 show the ROC curves of the three meth-

ods for 50%, 10%, 1%, and 0.1% of the data. We

see that the tendency of the superiority our method is

consistent with that in Fig. 5, and these Figures show

more detailed information.

Table 2 again shows that the combined features

outperforms the remaining two in F measure. Note

that the performance is much lower than in the ﬁrst

dataset due to the challenging nature of the second

dataset. Figure 10 shows that our combined features

Figure 8: ROC curve and AUC in the ﬁrst data (1%).

Figure 9: ROC curve and AUC in the ﬁrst data (0.1%).

Figure 10: ROC curve and AUC in the second data.

outperform the remaining two in AUC.

Figure 11: Learning curve of AUC in the second data.

Detecting Anomalous Regions from an Image based on Deep Captioning

331

Compared with the results of the ﬁrst dataset, the

performance deteriorated substantially, indicating the

difﬁculty of this dataset. The learning curve of our

method obtained by the second kind of experiments

for the second dataset is shown in Fig. 11. Though

AUCs are lower than those for the ﬁrst dataset, we

again see the superiority of the combined features

over the remaining two.

Figs. 12-15 show the ROC curves of the three

methods for 50%, 10%, 1%, and 0.1% of the data.

We again see that the tendency of the superiority our

method is consistent with that in Fig. 11.

Figure 12: ROC curve and AUC in the second data (50%).

Figure 13: ROC curve and AUC in the second data (10%).

Figure 14: ROC curve and AUC in the second data (1%).

Figure 15: ROC curve and AUC in the second data (0.1%).

4.3.2 Parameter Dependencies

Figure 16: Dependency on parameter θ for the ﬁrst dataset.

Figure 17: Dependency on parameter θ for the second

dataset.

Figure 16 shows the results of the third kind of ex-

periments to investigate the dependency on parameter

θ for the ﬁrst dataset. The horizontal axis is in log

scale. Figure 16 shows that the performance is stable

in some range but degrades substantially from the end

of the range. We attribute its reason to the fact that

when the value of θ exceeds those of the features all

samples are contained in the same clusters. Figure 17

shows the results for the second dataset. This Figure

shows the same tendency as in Fig. 16, which jus-

tify our analysis above. The combined features have

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

332

a much wider range of the best values than the base-

line method, which shows that our method is much

less affected by the parameter setting. We again see

the deterioration of the performance compared with

the ﬁrst dataset, which again indicates the difﬁculty

of this dataset.

Figure 18: Dependency on parameter R for the ﬁrst dataset.

Figure 19: Dependency on parameter R for the second

dataset.

For the fourth kind of experiments to investigate

the dependency on the threshold R, we show the re-

sults on the ﬁrst dataset in Fig. 18. The Figure shows

that the results of the three kinds of features exhibit

different peaks although they are all normalized. One

possible reason is their different dimentionalities: the

image feature has a much higher dimensionality than

the caption feature and thus the pairwise distances are

much larger for the former, which requires larger val-

ues of the threshold for an optimal performance. An-

other possible reason stems from the fact that captions

of two similar images are identical or similar, which

results in smaller pairwise distances and thus the op-

timal values of the threshold are much smaller than

those for the image features. We show the results on

the second dataset in Fig. 19, which show similar ten-

dencies.

Figure 20: Ablation study of the coordinate information (d

= 0) for the ﬁrst dataset.

Figure 21: Ablation study of the coordinate information (d

= 0) for the second dataset.

4.3.3 Ablation Study and Examples

For the ﬁfth kind of experiments on the inﬂuence of

using the coordinate information, we show the results

with the ﬁrst dataset in Fig. 20. We set d = 0 to ignore

the inﬂuence and measured the performance by vary-

ing θ as in the third kind of experiments. Compared

with Fig. 16, the performance of the caption feature

deteriorated while those of the other two features not.

The reason could be attributed to the existence of the

position information in the latter two unlike in the

caption feature. Figure 21 shows the results of the

second dataset, which shows similar tendencies.

To investigate the difference in the caption fea-

tures and the image features, we show in Fig. 22

an example (the purple rectangle) of anomalous re-

gions detected by the caption feature method and

overlooked by the image feature method

. To the

region, DenseCap generated a caption “a basket on

the back of the chair”, which helped the caption fea-

tures by providing semantic information not easily

obtained from the image features. Note that even the

caption is wrong in our sense, it was useful in our

Our combined feature method successfully detected

this example.

Detecting Anomalous Regions from an Image based on Deep Captioning

333

Figure 22: Example of the anomalies detected by the cap-

tion feature method and overlooked by the image feature

method (the purple rectangle).

Figure 23: Example of the anomalies detected by the im-

age feature method and overlooked by the caption feature

method (the red rectangle).

task of detecting anomalous image regions. We also

show in Fig. 23 an example of anomalous regions (the

red rectangle) detected by the method with the image

features and overlooked by the method with the cap-

tion features

. To the region, DenseCap generated a

caption “a man playing a game”, which “fooled” the

caption feature method by providing wrong informa-

tion. Though there is no image in which a person is

playing a game in the training data, DenseCap gener-

ated by error a caption “a man playing a video game”

to several regions in training images. The region in

the training data is the cause of the overlook by the

caption feature method.

5 CONCLUSIONS

We proposed an anomalous region detection method

from an image based on deep captioning. Deep cap-

Our combined feature method failed to detect this ex-

ample.

tioning allows us to exploit the domain knowledge in

Visual Genome (Krishna et al., 2017), which consists

of a set of pairs of image regions and their captions, in

our task. By processing the captions with a word em-

bedding method Word2Vec, our anomalous detection

is conducted at the semantic level. Our experiments

show the superiority of our method over the baseline

methods which rely on either image features or the

caption features. Recent experiments further showed

that our method is also effective against unseen ob-

jects in the training data and misclassiﬁed objects by

image captioning to some extent.

Our ongoing work includes ﬁnalizing an au-

tonomous mobile robot for anomaly detection from

its observation. Such a robot is able to integrate vi-

sual information with verbal information and thus has

a large potential in a variety of tasks. The challenge

compared with multi-modal DNNs (Ngiam et al.,

2011) is how to exploit deep captioning model trained

on other data, though our approach can be applied to

domains with much less data. Equipping deep rein-

forcement learning on the robot (Zhu et al., 2017) is

one of our next goals. Integration with our human

monitoring on skeletons (Deguchi and Suzuki, 2015;

Deguchi et al., 2017) and facial expressions (Kondo

et al., 2014; Fujita et al., 2019) are also promising.

Using high-level feedbacks from humans is another

important issue, which would overcome the inefﬁ-

ciency of online learning with lower-level rewards.

ACKNOWLEDGEMENTS

A part of this work was supported by JSPS KAK-

ENHI Grant Number JP18H03290.

REFERENCES

Arriaga, O., Pl

oger, P. G., and Valdenegro, M. (2017). Im-

age Captioning and Classiﬁcation of Dangerous Situ-

ations. arXiv preprint arXiv:1711.02578.

Chandola, V., Banerjee, A., and Kumar, V. (2009).

Anomaly Detection: A Survey. ACM Computing Sur-

veys, 41(3).

Deguchi, Y. and Suzuki, E. (2015). Hidden Fatigue Detec-

tion for a Desk Worker Using Clustering of Succes-

sive Tasks. In Ambient Intelligence, volume 9425 of

LNCS, pages 263–238. Springer-Verlag.

Deguchi, Y., Takayama, D., Takano, S., Scuturici, V.-M.,

Petit, J.-M., and Suzuki, E. (2017). Skeleton Clus-

tering by Multi-Robot Monitoring for Fall Risk Dis-

covery. Journal of Intelligent Information Systems,

48(1):75–115.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

334

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Li,

F.-F. (2009). ImageNet: A Large-Scale Hierarchical

Image Database. In Proc. CVPR, pages 248–255.

Fujita, H., Matsukawa, T., and Suzuki, E. (2019). Detecting

Outliers with One-Class Selective Transfer Machine.

Knowledge and Information Systems. (accepted for

publication).

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A. C., and Ben-

gio, Y. (2014). Generative Adversarial Nets. In Proc.

NIPS, pages 2672–2680.

Johnson, J., Karpathy, A., and Fei-Fei, L. (2016). Dense-

Cap: Fully Convolutional Localization Letworks for

Dense Captioning. In Proc. CVPR, pages 4565–4574.

Kato, H., Harada, T., and Kuniyoshi, Y. (2012). Visual

Anomaly Detection from Small Samples for Mobile

Robots. In Proc. IROS, pages 3171–3178.

Kondo, R., Deguchi, Y., and Suzuki, E. (2014). Developing

a Face Monitoring Robot for a Deskworker. In Am-

bient Intelligence, volume 8850 of LNCS, pages 226–

241. Springer-Verlag.

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K.,

Kravitz, J., Chen, S., Kalantidis, Y., Li, L. J., Shamma,

D. A., Bernstein, M., and Fei-Fei., L. (2017). Vi-

sual Genome: Connecting Language and Vision Us-

ing Crowdsourced Dense Image Annotations. Inter-

national Journal of Computer Vision, 123(1):32–73.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

ageNet Classiﬁcation with Deep Convolutional Neu-

ral Networks. In Proc. NIPS, volume 1, pages 1097–

1105.

Lawson, W., Hiatt, L., and K.Sullivan (2016). Detecting

Anomalous Objects on Mobile Platforms. In Proc.

CVPR Workshop.

Lawson, W., Hiatt, L., and K.Sullivan (2017). Finding

Anomalies with Generative Adversarial Networks for

a Patrolbot. In Proc. CVPR Workshop.

Mahadevan, V., Li, W., Bhalodia, V., and Vasconcelos, N.

(2011). Anomaly Detection in Crowded Scenes. In

Proc. CVPR.

Matsumoto, R., Nakayama, H., Harada, T., and Kuniyoshi,

Y. (2007). Journalist Robot: Robot System Making

News Articles from Real World. In Proc. IROS, pages

1234–1241.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Ef-

ﬁcient Estimation of Word Representations in Vector

Space. In Proc. ICLR.

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng,

A. Y. (2011). Multimodal Deep Learning. In Proc.

ICML, pages 689–696.

Schlegl, T., Seeb

ock, P., Waldstein, S. M., Schmidt-Erfurth,

U., and Langs, G. (2017). Unsupervised Anomaly

Detection with Generative Adversarial Networks to

Guide Marker Discovery. In Proc. International Con-

ference on Information Processing in Medical Imag-

ing.

Simonyan, K. and Zisserman, A. (2015). Very Deep Convo-

lutional Networks for Large-scale Image Recognition.

In Proc. ICLR.

Suzuki, T., Bessho, F., Harada, T., and Kuniyoshi, Y.

(2011). Visual Anomaly Detection under Temporal

and Spatial Non-Uniformity for News Finding Robot.

In Proc. IROS, pages 1214–1220.

Zhang, T., Ramakrishnan, R., and Livny, M. (1997).

BIRCH: A New Data Clustering Algorithm and its

Applications. Data Min. Knowl. Discov., 1(2):141–

182.

Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-

Fei, L., and Farhadi, A. (2017). Target-Driven Visual

Navigation in Indoor Scenes Using Deep Reinforce-

ment Learning. In Proc. ICRA, pages 3357–3364.

Detecting Anomalous Regions from an Image based on Deep Captioning

335