Detection of Human Rights Violations in Images: Can Convolutional

Neural Networks Help?

Grigorios Kalliatakis

, Shoaib Ehsan

, Maria Fasli

, Ales Leonardis

, Juergen Gall

and Klaus D. McDonald-Maier

School of Computer Science and Electronic Engineering, University of Essex, Colchester, U.K.

School of Computer Science, University of Birmingham, Birmingham, U.K.

Institute of Computer Science III, University of Bonn, Bonn, Germany

{gkallia, sehsan, mfasli, kdm}@essex.ac.uk, a.leonardis@cs.bham.ac.uk, gall@iai.uni-bonn.de

Keywords:

Convolutional Neural Networks, Deep Representation, Human Rights Violations Recognition.

Abstract:

After setting the performance benchmarks for image, video, speech and audio processing, deep convolutional

networks have been core to the greatest advances in image recognition tasks in recent times. This raises the

question of whether there are any beneﬁt in targeting these remarkable deep architectures with the unattempted

task of recognising human rights violations through digital images. Under this perspective, we introduce a

new, well-sampled human rights-centric dataset called Human Rights Understanding (HRUN). We conduct

a rigorous evaluation on a common ground by combining this dataset with different state-of-the-art deep

convolutional architectures in order to achieve recognition of human rights violations. Experimental results

on the HRUN dataset have shown that the best performing CNN architectures can achieve up to 88.10%

mean average precision. Additionally, our experiments demonstrate that increasing the size of the training

samples is crucial for achieving an improvement on mean average precision principally when utilising very

deep networks.

1 INTRODUCTION

Human rights violations continue to take place in

many parts of the world today, while they have been

ongoing during the entire human history. These days,

organizations concerned with human rights are in-

creasingly using digital images as a mechanism for

supporting the exposure of human rights and interna-

tional humanitarian law violations. However, utilising

current advances in technology for studying, prose-

cuting and possibly preventing such misconduct from

occurring have not yet made any progress. From this

perspective, supporting human rights is seen as one

scientiﬁc domain that could be strengthened by the

latest developments in computer vision. To support

the continued growth of images and videos in human

rights and international humanitarian law monitoring

campaigns, this study examines how vision based sys-

tems can support human rights monitoring efforts by

accurately detecting and identifying human rights vi-

olations utilising digital images.

This work is made possible by recent progress

in Convolutional Neural Networks (CNNs) (LeCun

et al., 1989), which has changed the landscape for

well-studied computer vision tasks, such as image

classiﬁcation and object detection (Wang et al., 2010;

Huang et al., 2011), by comprehensively outperform-

ing the initial handcrafted approaches (Donahue et al.,

2014; Sharif Razavian et al., 2014; Sermanet et al.,

2013). These state-of-the-art architectures are now

ﬁnding their way into a number of vision based ap-

plications (Girshick et al., 2014; Oquab et al., 2014;

Simonyan and Zisserman, 2014a).

A major contribution of our paper is a new, well-

sampled human rights-centric dataset, called the Hu-

man Rights Understanding (HRUN) dataset, which

consists of 4 different categories of human rights vi-

olations and 100 diverse images per category. In this

paper, we formulate the human rights violation recog-

nition problem as being able to recognise a given

input image (from the HRUN dataset) as belonging

to one of these 4 categories of human rights viola-

tions. See Figure 1 for examples of our data. We use

this data for human rights understanding by evaluat-

ing different deep representations on this new dataset,

while we perform experiments that illustrate the effect

of network architecture, image context, and training

data size on the accuracy of the system.

Kalliatakis G., Ehsan S., Fasli M., Leonardis A., Gall J. and McDonald-Maier K.

Detection of Human Rights Violations in Images: Can Convolutional Neural Networks Help?.

DOI: 10.5220/0006133902890296

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 289-296

ISBN: 978-989-758-226-4

289

Figure 1: Examples from all 4 categories of the Human Rights Understanding (HRUN) Dataset.

In summary, our contribution is two-fold. Firstly

we introduce a new human rights-centric dataset

HRUN. Secondly, motivated by the great success of

deep convolutional networks, we conduct a large set

of rigorous experiments for the task of recognising

human rights violations. As part of our tests, we delve

into the latest, top-performing pre-trained deep con-

volutional models, allowing a fair, unbiased compar-

ison on a common ground; something that has been

largely missing so far in the literature. The remain-

der of the paper is organised as follows. Section 2

looks into prior works on database construction and

image understanding with deep convolutional net-

works. Section 3 describes the methodology utilised

for building our pioneer human rights understand-

ing dataset. Section 4 demonstrates the classiﬁcation

pipeline used for the experiments, while the evalu-

ation results are presented in Section 5, alongside a

thorough discussion. Finally, conclusions and future

directions are given in Section 6.

2 PRIOR WORK

2.1 Database Construction

Challenging databases are important for many areas

of research, while large-scale datasets combined with

CNNs have been key to recent advances in computer

vision and machine learning applications. While

the ﬁeld of computer vision has developed several

databases to organize knowledge about object cate-

gories (Deng et al., 2009; Grifﬁn et al., 2007; Tor-

ralba et al., 2008; Fei-Fei et al., 2007), scenes (Zhou

et al., 2014; Xiao et al., 2010) or materials (Liu et al.,

2010; Sharan et al., 2009; Bell et al., 2015) a well-

inspected dataset of images depicting human rights

violations does not currently exist. The ﬁrst reference

point in standardized dataset of images and annota-

tions was the VOC2010 dataset (Everingham et al.,

2010), which was constructed by utilizing images col-

lected by non-vision/machine learning researchers, by

querying Flickr with a number of related keywords,

including the class name, synonyms and scenes or sit-

uations where the class is likely to appear. Similarly,

an extensive scene understanding (SUN) database

was introduced by (Xiao et al., 2010), containing 899

environments and 130,519 images. The primary ob-

jectives of this work were to build the most complete

dataset of scene image categories. Microsoft’s work

in regard to detection and segmentation of objects tak-

ing place in their natural context, was marked with the

introduction of common objects in context (Lin et al.,

2014) (MS COCO) dataset including 328,000 images

of complex everyday scenes consisted of 91 different

object categories and 2.5 million labelled instances.

More recently (Yu et al., 2015) presented their ﬁrst

version of a scene-centric database (LSUN) with mil-

lions of label images in each category alongside an

integrated framework which makes use of deep learn-

ing techniques in order to achieve large-scale image

annotation. To our knowledge, this particular work

is the ﬁrst attempt to construct a well-sampled image

database in the domain of human rights understand-

ing.

2.2 Deep Convolutional Networks

For decades, traditional machine learning systems de-

manded accurate engineering and signiﬁcant domain

expertise in order to design a feature extractor capa-

ble of converting raw data (such as the pixel values

of an image) into a convenient internal representa-

tion or feature vector from which a classiﬁer could

classify or detect patterns in the input. Today, rep-

resentation learning methods and principally CNNs

(LeCun et al., 1989) are driving advances at a dra-

matic pace in the computer vision ﬁeld after enjoying

a great success in large-scale image recognition and

object detection tasks (Krizhevsky et al., 2012; Ser-

manet et al., 2013; Simonyan and Zisserman, 2014a;

Tompson et al., 2015; Taigman et al., 2014; LeCun

et al., 2015). The key aspect of deep learning repre-

sentations is that the layers of features are not man-

ually hand-crafted, but are learned from data using a

generic-purpose learning scheme. The architecture of

a typical deep-learning system can be considered as

a multilayer stack of simple modules, each one trans-

forming its input to increase both the selectivity and

the invariance of the representation as stated in (Le-

Cun et al., 2015). In the last few years vision tasks

became feasible due to high-performance computing

systems such as GPUs, extensive public image repos-

itories (Deng et al., 2009), a new regularisation tech-

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

290

nique called dropout (Srivastava et al., 2014) which

prevents deep learning systems from overﬁtting, rec-

tiﬁed linear units (ReLU) (Nair and Hinton, 2010),

softmax layer and techniques able to generate more

training examples by deforming the existing ones.

Since (Krizhevsky et al., 2012) ﬁrst used an eight

layer CNN (also known as AlexNet) trained on Im-

ageNet to perform 1000-way object classiﬁcation, a

number of other works have used deep convolutional

networks (ConvNets) to elevate image classiﬁcation

further (Simonyan and Zisserman, 2014b; He et al.,

2015a; Szegedy et al., 2015; He et al., 2015b; Chat-

ﬁeld et al., 2014). (Simonyan and Zisserman, 2014b)

use a very deep CNN (also known as VGGNet) with

up to 19 weight layers for large-scale image clas-

siﬁcation. They demonstrated that a substantially

increased depth of a conventional ConvNet (LeCun

et al., 1989; Krizhevsky et al., 2012) can result in

state-of-the-art performance on the ImageNet chal-

lenge dataset (Deng et al., 2009). They also per-

form localization for the same challenge by training

a very deep ConvNet to predict the bounding box

location instead of the class scores at the last fully

connected layer. Another deep network architecture

that has been recently used to great success is the

GoogLeNet model of (Szegedy et al., 2015) where an

inception layer is composed of a shortcut branch and

a few deeper branches in order to improve utilization

of the computing resources inside the network. The

two main ideas of that architecture are: (i) to create a

multi-scale architecture capable of mirroring correla-

tion structure in images and (ii) dimensional reduction

and projections to keep their representation sparse

along each spatial scale. Most recently (He et al.,

2015a) announced the even deeper residual network

(also known as ResNet), featured 152 layers, which

has considerably improved the state-of-the-art perfor-

mance of ImageNet (Deng et al., 2009) classiﬁcation

and object detection on PASCAL (Everingham et al.,

2010). Residual networks are inspired by the observa-

tion that neural networks lean towards gaining higher

training errors as the depth of the network increases

to very large values. The authors argue that although

the network gains more parameters by increasing its

depth, the network becomes inferior at function ap-

proximation because of the gradients and training sig-

nals loss when they are propagated through numerous

layers. Therefore, they give convincing theoretical

and practical evidence that residual connections (re-

formulated layers for learning residual functions with

reference to the layer input) are inherently necessary

for training very deep convolutional models.

Outside of the aforementioned top-performing

networks, other works worth mentioning are: (Chat-

Figure 2: Constructing HRUN Dataset.

ﬁeld et al., 2014) where a rigorous evaluation study

on different CNN architectures for the task of object

recognition was conducted and (Zhou et al., 2014)

where a brand-new scene-centric database called

Places was introduced and established state-of-the-

art results on different scene recognition tasks, by

learning deep representations from their extensive

database. Despite these impressive results, human

rights advocacy is one of the high proﬁle domains

which remain broadly missing from the curated list

of problems which were beneﬁted from the continu-

ing growth of deep convolutional networks. We build

on this body of work in deep learning to solve the

untrodden problem of recognising human rights vio-

lations utilising digital images.

3 HRUN DATASET

Recent achievements in computer vision can be

mainly ascribed to the ever growing size of visual

knowledge in terms of labelled instances of objects,

scenes, actions, attributes, and the dependent rela-

tionships between them. Therefore, obtaining effec-

tive high-level representations has become increas-

ingly important, while a key question arises in the

context of human rights understanding: how will we

gather this structured visual knowledge?

This section describes the image collection proce-

dure utilised for the formulation of the HRUN dataset,

as captured by Figure 2.

Initially, the keywords, with a view to formulate

the query terms, were collected in collaboration with

specialists in the human rights domain. This hap-

pens in order to include more than one query term

for every ‘targeted class’. For instance, for the class

police violence the queries ‘police violence’, ‘police

brutality’ and ‘police abuse of force’ were all used

for retrieving results. Work commenced with the

Flickr photo-sharing website, but in a short time, it

Table 1: Image Collection Analysis from Search Engines.

Retrieved Images Relevant Images

Query Term

Google Bing Google Bing Manually

HRUN

Child labour 99 137 18 5 77 100

Child Soldiers 176 159 31 13 56 100

Police Violence 149 232 10 16 74 100

Refugees 111 140 10 39 51 100

Detection of Human Rights Violations in Images: Can Convolutional Neural Networks Help?

291

Figure 3: Side by side examples of irrelevant images with

their respective query term which were eliminated during

the ﬁltering process.

became apparent that its limitations resulted in a huge

number of irrelevant results returned for the given

queries. This happens because Flickr users are au-

thorised to tag their uploaded images without restric-

tion. Subsequently there have been situations where

the given keyword was ‘armed conﬂict’ and the ma-

jority of the returned images had to do with military

parades. Another similar example was with the given

keyword ‘genocide’ where the returned results in-

cluded protesting campaigns against genocide, some-

thing that may be consider close to the keyword, but

it can not serve our purpose by any means. An-

other shortcoming was the case when people mas-

sively tagged an image deliberately incorrectly in or-

der to acquire an increased number of hits on the

photo-sharing website. Consequently, Google and

Bing search engines were chosen as a better alterna-

tive. Images were downloaded for each class using a

python interface to the Google and Bing application

programming interfaces (APIs), with the maximum

number of images permitted by their respective API

for each query term. All exact duplicate images were

eliminated from the downloaded image set, alongside

images regarded as inappropriate during the ﬁltering

step as illustrated by Figure 3. Nonetheless, the num-

ber of ﬁltered images generated was still insufﬁcient

as shown in Table 1.

For this reason, there were manually added other

suitable images in order to reach the ﬁnal structure of

the HRUN dataset. We ﬁnally ended up with a to-

tal of four different categories, each one containing

100 distinct images of human rights violations cap-

tured in real world situations and surroundings. With

this ﬁrst attempt, our main intention was to produce

a high quality dataset for the task in hand. For that

reason, the number of categories was kept to a certain

degree for the time being. Expanding the dataset both

in categories and number of images has already been

included in our actual future plans and many other on-

line repositories that might be related to human rights

violations are being checked into thoroughly.

4 LEARNING DEEP

REPRESENTATIONS FOR

HUMAN RIGHTS VIOLATIONS

RECOGNITION

4.1 Transfer Learning

Our goal is to train a system that recognises different

human rights violations from a given input image of

the HRUN dataset. One high-priority research issue

in our work is how to ﬁnd a good representation for in-

stances in such a unique domain. More than that, hav-

ing a dataset of sufﬁcient size is problematic for this

task as described in the previous section. For those

two reasons, a conventional alternative to training a

deep ConvNet from the very beginning, is to use a

pre-trained model and then use the ConvNet as a ﬁxed

feature extractor for the task of interest. This method,

referred to as transfer learning(Donahue et al., 2014;

Zeiler and Fergus, 2014), is implemented by taking a

pre-trained CNN, replacing the fully-connected layers

(and potentially the last convolutional layer), and con-

sider the rest of the ConvNet as a ﬁxed feature extrac-

tor for the relevant dataset. By freezing the weights of

the convolutional layers, the deep ConvNet can still

extract general image features such as edges, while

the fully connected layers can take this information

and use it to classify the data in a way that is applica-

ble to the problem.

4.2 Pipeline for Human Rights

Violations Recognition

The entire pipeline used for the experiments is de-

picted in Figure 4, and detailed further below. In this

pipeline, every block is ﬁxed except the feature ex-

tractor as different deep convolutional networks are

plugged in, one at a time, to compare their perfor-

mance utilizing the mean average precision (mAP)

metric.

Given a training dataset T

consisting of m human

rights violation categories, a test dataset T

compris-

ing unseen images of the categories given in T

, and

a set of n pre-trained CNN architectures (C

,...C

the pipeline operates as follows: The training dataset

is used as input to the ﬁrst CNN architecture C

The output of C

, as described above, is then uti-

lized to train m SVM classiﬁers. Once trained, the

test dataset T

is employed to assess the performance

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

292

Figure 4: An overview of the human rights violations recognition pipeline used here. Different deep convolutional models are

plugged into the pipeline one at a time, while the training and test samples taken from the HRUN dataset remain ﬁxed. Mean

average precision(mAP) metric is used for evaluating the results.

of the pipeline using mAP. The training and testing

procedures are then repeated after replacing C

with

the second CNN architecture C

to evaluate the per-

formance of the human rights violation recognition

pipeline. For a set of n pre-trained CNN architec-

tures, the training and testing processes are repeated

n times. Since the entire pipeline is ﬁxed (includ-

ing the training and test datasets, learning procedure

and evaluation protocol) for all n CNN architectures,

the differences in the performance of the classiﬁca-

tion pipeline can be attributed to the speciﬁc CNN ar-

chitectures used. For comparison, 10 different deep

CNN architectures were identiﬁed, grouped by the

common paper which they were ﬁrst made public:

a) 50-layer ResNet, 101-layer ResNet and 152-layer

ResNet presented in (He et al., 2015a); b) 22-layer

GoogLeNet (Szegedy et al., 2015); c) 16-layer VGG-

Net and 19-layer VGG-Net introduced in (Simonyan

and Zisserman, 2014b) ; d) 8-layer VGG-S, 8-layer

VGG-M and 8-layer VGG-F displayed in (Chatﬁeld

et al., 2014); and e) 8-layer Places (Zhou et al., 2014),

as they represent the state-of-the-art for image classi-

ﬁcation tasks. For further design and implementation

details for these models, please refer to their respec-

tive papers. To ensure a fair comparison, all the stan-

dardised CNN models used in our experiments are

based on the opensource Caffe framework (Jia et al.,

2014) and are pre-trained on 1000 ImageNet (Deng

et al., 2009) classes with the exception of Places CNN

(Zhou et al., 2014) which was trained on 205 scenes

categories of Places database. For the majority of the

networks, the dimensionality of the last hidden layer

(FC7) leads to a 4096x1 dimensional image represen-

tation. Since the GoogLeNet (Szegedy et al., 2015)

and the ResNet (He et al., 2015a) architectures do not

utilise fully connected layers at the end of their net-

works, the last hidden layers before average pooling

at the top of the ConvNet are exploited with 1024x7x7

and 2048x7x7 feature maps respectively, to counter-

balance the behaviour of the pool layers, which pro-

vide downsampling regarding the spatial dimensions

of the input.

5 EXPERIMENTS AND RESULTS

5.1 Evaluation Details

The evaluation process is divided into two different

sets of scenarios, each one making use of an explicit

split of images between the training and testing sam-

ples of the pipeline. For the ﬁrst scenario, a split of

70/30 was utilised, while for the second scenario the

split was adjusted to 50/50 for training and testing

images respectively. Additionally, three distinct se-

ries of tests were conducted for each scenario, each

and every one assembled with a completely arbitrary

shift of the entire image set for every category of the

HRUN dataset. This approach ensures an unbiased

comparison with a rather limited dataset like HRUN

at present. The compound results of all three tests are

given in Table 2 and Table 3 and analysed below.

5.2 Results and Discussion

It is evident from Table 2 and Table 3 that the Slow

CNN architecture performs the best for the child

labour category for both scenarios. VGG with 16 lay-

ers performs the best in the case of child soldiers with

scenario 1, while on the other hand, scenario’s 2 best

performing architecture is Places with VGG-16 com-

ing genuinely close. Places was also the best perform-

ing architecture for the category of police violence for

the two scenarios. Lastly, regarding refugees cate-

gory, the Slow version of VGG was the dominant ar-

chitecture for both scenarios. Since our work is the

Detection of Human Rights Violations in Images: Can Convolutional Neural Networks Help?

293

Table 2: Human rights violations classiﬁcation results with a 70/30 split for training and testing images. Mean average

precision (mAP) accuracy for different CNNs. Bold font highlights the leading mAP result for every experiment.

Dimensional

Model

Representation

mAP Child Labour Child Soldiers Police Violence Refugees

ResNet 50 100K 42.59 41.12 43.69 43.81 41.73

ResNet 101 100K 42.07 40.48 44.78 42.56 40.48

ResNet 152 100K 45.80 44.27 44.11 48.08 46.73

GoogLeNet 50K 48.62 42.72 40.71 61.91 49.16

VGG 16 4K 77.46 70.79 77.71 83.46 77.87

VGG 19 4K 47.01 31.69 50.98 73.79 31.57

VGG - M 4K 67.93 59.52 62.96 81.45 67.80

VGG - S 4K 78.19 80.17 64.46 87.46 80.68

VGG - F 4K 64.15 45.42 63.20 84.78 63.21

Places 4K 68.59 55.67 65.60 93.17 59.92

Table 3: Human rights violations classiﬁcation results with a 50/50 split for training and testing images. Mean average

precision (mAP) accuracy for different CNNs. Bold font highlights the leading mAP result for every experiment.

Dimensional

Model

Representation

mAP Child Labour Child Soldiers Police Violence Refugees

ResNet 50 100K 70.94 73.15 68.07 70.44 72.09

ResNet 101 100K 68.46 69.50 66.90 68.34 69.09

ResNet 152 100K 76.20 80.60 73.07 72.00 79.12

GoogLeNet 50K 55.92 41.48 60.21 55.52 66.48

VGG 16 4K 84.79 79.15 87.94 89.47 82.59

VGG 19 4K 60.39 35.72 72.67 83.10 50.08

VGG - M 4K 78.94 68.71 82.32 89.99 74.74

VGG - S 4K 88.10 84.84 88.14 91.92 87.50

VGG - F 4K 73.46 53.57 78.78 90.41 71.08

Places 4K 81.40 62.04 89.97 95.70 77.90

ﬁrst effort in the literature to recognise human rights

violations, we are not able to compare our experimen-

tal results with other works.

However the results are unquestionably promising

and reveal that the best performing CNN architectures

can achieve up to 88.10% mean average precision

when recognising human rights violations. On the

other hand, some of the regularly top performing deep

ConvNets, such as GoogLeNet and ResNet, fell short

for this particular task compared to the others. Such

weaker performance occurs primarily because of the

limited dataset size, whereby learning millions of pa-

rameters of those very deep convolutional networks is

usually impractical and may lead to over-ﬁtting. An-

other interpretation could be due to the inadequate

structure of the image representation deducted from

the last hidden layer before average pooling compared

to the FC7 layer of the others. Furthermore, it is

clear that by utilising the 50/50 split of images in the

course of scenario 2, there is a considerable boost in

performance of the human rights violations recogni-

tion pipeline as compared to the ﬁrst scenario when

a split of 70/30 was employed for training and test-

ing images respectively. Figure 5 depicts the effect

of two varying training data sizes (scenario 1 vs sce-

nario 2) on the performance of different deep convolu-

tional networks. Remarkably with scenario 2, where

the half and half split was applied, accomplishes a no-

table improvement on mean average precision which

spans from 4.03% up to 36.33% across all four HRUN

categories which were tested. Only on two occas-

sions scenario 2 was outperformed by scenario 1, both

of them while GoogLeNet was selected for the cate-

gories of ‘child labour’ and ‘police violence’. This

observation strengthens the point of view discussed

above relative to the last hidden layer of this model.

Nonetheless, in all instances a mean average preci-

sion greatly above 40% was achieved, which can be

regarded as an impressive outcome given the uncon-

ventional nature of the problem, the limited dataset

which was adopted for learning deep representations

and the transfer learning approach that was employed

here.

6 CONCLUSIONS

Recognising human rights violations through digital

images is a new and challenging problem in the area

of computer vision. We introduce a new, open hu-

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

294

Figure 5: Comparison of deep convolutional networks performance, with reference to mAP, for the two diverse scenarios

appearing in our experiments. The number on the left side of the slash denotes the training proportion of images, while the

name on the right implies the testing percentage.

man rights understanding dataset, HRUN, designed

to represent human rights and international human-

itarian law violations found in the real world. Using

this innovative dataset we conduct an evaluation of re-

cent deep learning architectures for human rights vi-

olations recognition and achieve results that are com-

parable to prior attempts on other long-standing hall-

mark tasks of computer vision in the hope that it

would provide a scaffold for future evaluations, and

good benchmark for human rights advocacy research.

The following conclusions have derived: Digital im-

ages that can be rated as appropriate for human rights

monitoring purposes are rare and characterising them

requires great effort, expertise and vast time. Utilis-

ing transfer learning for the task of recognising hu-

man rights violations can provide very strong results

by employing a straightforward combination of deep

representations and a linear SVM. Deep convolutional

neural networks are constructed to beneﬁt and learn

from massive amounts of data. For this reason and

in order to obtain even higher quality recognition

results, training a deep convolutional network from

scratch on an expanded version of the HRUN dataset

is likely to further improve results. Inspired by the

high-standard characteristics of legal evidence, in the

future we would like to have the means to clarify three

different questions set by every human rights monitor-

ing mechanism: what, who and how, and expand our

dataset to a wider range of categories in order to in-

clude them. We also presume that further analysis of

joint object recognition and scene understanding will

be beneﬁcial and lead to improvements in both tasks

for human rights violations understanding.

ACKNOWLEDGEMENTS

We acknowledge MoD/Dstl and EPSRC for provid-

ing the grant to support the UK academics (Ales

Leonardis) involvement in a Department of Defense

funded MURI project. This work was also sup-

ported in part by EU H2020 RoMaNS 645582, EP-

SRC EP/M026477/1 and ES/M010236/1.

REFERENCES

Bell, S., Upchurch, P., Snavely, N., and Bala, K. (2015).

Material recognition in the wild with the materials in

context database. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 3479–3487.

Chatﬁeld, K., Simonyan, K., Vedaldi, A., and Zisserman,

A. (2014). Return of the devil in the details: Delv-

ing deep into convolutional nets. In British Machine

Vision Conference, BMVC 2014, Nottingham, UK,

September 1-5, 2014.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical image

database. In Computer Vision and Pattern Recogni-

tion, 2009. CVPR 2009. IEEE Conference on, pages

248–255. IEEE.

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N.,

Tzeng, E., and Darrell, T. (2014). Decaf: A deep con-

volutional activation feature for generic visual recog-

nition. In ICML, pages 647–655.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. (2010). The pascal visual object

classes (voc) challenge. International journal of com-

puter vision, 88(2):303–338.

Fei-Fei, L., Fergus, R., and Perona, P. (2007). Learning gen-

erative visual models from few training examples: An

incremental bayesian approach tested on 101 object

categories. Computer Vision and Image Understand-

ing, 106(1):59–70.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detec-

tion and semantic segmentation. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 580–587.

Grifﬁn, G., Holub, A., and Perona, P. (2007). Caltech-256

object category dataset.

He, K., Zhang, X., Ren, S., and Sun, J. (2015a). Deep

residual learning for image recognition. CoRR,

abs/1512.03385.

He, K., Zhang, X., Ren, S., and Sun, J. (2015b). Delving

deep into rectiﬁers: Surpassing human-level perfor-

mance on imagenet classiﬁcation. In Proceedings of

Detection of Human Rights Violations in Images: Can Convolutional Neural Networks Help?

295

the IEEE International Conference on Computer Vi-

sion, pages 1026–1034.

Huang, Y., Huang, K., Yu, Y., and Tan, T. (2011). Salient

coding for image classiﬁcation. In Computer Vision

and Pattern Recognition (CVPR), 2011 IEEE Confer-

ence on, pages 1753–1760. IEEE.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S., and Darrell, T. (2014).

Caffe: Convolutional architecture for fast feature em-

bedding. In Proceedings of the 22nd ACM inter-

national conference on Multimedia, pages 675–678.

ACM.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-

ing. Nature, 521(7553):436–444.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard,

R. E., Hubbard, W., and Jackel, L. D. (1989). Back-

propagation applied to handwritten zip code recogni-

tion. Neural computation, 1(4):541–551.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Euro-

pean Conference on Computer Vision, pages 740–755.

Springer.

Liu, C., Sharan, L., Adelson, E. H., and Rosenholtz, R.

(2010). Exploring features in a bayesian framework

for material recognition. In Computer Vision and Pat-

tern Recognition (CVPR), 2010 IEEE Conference on,

pages 239–246. IEEE.

Nair, V. and Hinton, G. E. (2010). Rectiﬁed linear units

improve restricted boltzmann machines. In Proceed-

ings of the 27th International Conference on Machine

Learning (ICML-10), pages 807–814.

Oquab, M., Bottou, L., Laptev, I., and Sivic, J. (2014).

Learning and transferring mid-level image represen-

tations using convolutional neural networks. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 1717–1724.

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus,

R., and LeCun, Y. (2013). Overfeat: Integrated recog-

nition, localization and detection using convolutional

networks. CoRR, abs/1312.6229.

Sharan, L., Rosenholtz, R., and Adelson, E. (2009). Mate-

rial perception: What can you see in a brief glance?

Journal of Vision, 9(8):784–784.

Sharif Razavian, A., Azizpour, H., Sullivan, J., and Carls-

son, S. (2014). Cnn features off-the-shelf: an as-

tounding baseline for recognition. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition Workshops, pages 806–813.

Simonyan, K. and Zisserman, A. (2014a). Two-stream con-

volutional networks for action recognition in videos.

In Advances in Neural Information Processing Sys-

tems, pages 568–576.

Simonyan, K. and Zisserman, A. (2014b). Very deep con-

volutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: a simple way

to prevent neural networks from overﬁtting. Journal

of Machine Learning Research, 15(1):1929–1958.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 1–9.

Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).

Deepface: Closing the gap to human-level perfor-

mance in face veriﬁcation. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 1701–1708.

Tompson, J., Goroshin, R., Jain, A., LeCun, Y., and Bregler,

C. (2015). Efﬁcient object localization using convo-

lutional networks. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 648–656.

Torralba, A., Fergus, R., and Freeman, W. T. (2008). 80

million tiny images: A large data set for nonpara-

metric object and scene recognition. IEEE transac-

tions on pattern analysis and machine intelligence,

30(11):1958–1970.

Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong,

Y. (2010). Locality-constrained linear coding for

image classiﬁcation. In Computer Vision and Pat-

tern Recognition (CVPR), 2010 IEEE Conference on,

pages 3360–3367. IEEE.

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba,

A. (2010). Sun database: Large-scale scene recogni-

tion from abbey to zoo. In Computer vision and pat-

tern recognition (CVPR), 2010 IEEE conference on,

pages 3485–3492. IEEE.

Yu, F., Zhang, Y., Song, S., Seff, A., and Xiao, J. (2015).

LSUN: construction of a large-scale image dataset us-

ing deep learning with humans in the loop. CoRR,

abs/1506.03365.

Zeiler, M. D. and Fergus, R. (2014). Visualizing and under-

standing convolutional networks. In European Con-

ference on Computer Vision, pages 818–833. Springer.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva,

A. (2014). Learning deep features for scene recog-

nition using places database. In Advances in neural

information processing systems, pages 487–495.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

296