Evaluation of 3D Vision Systems for Detection of Small Objects in

Agricultural Environments

Justin Le Louedec

, Bo Li

and Grzegorz Cielniak

1 a

Lincoln Centre for Autonomous Systems, University of Lincoln, U.K.

University of West England, U.K.

Keywords:

Machine Vision for Agriculture, Machine Learning, 3D Sensing, 3D Vision.

Abstract:

3D information provides unique information about shape, localisation and relations between objects, not found

in standard 2D images. This information would be very beneﬁcial in a large number of applications in agri-

culture such as fruit picking, yield monitoring, forecasting and phenotyping. In this paper, we conducted a

study on the application of modern 3D sensing technology together with the state-of-the-art machine learning

algorithms for segmentation and detection of strawberries growing in real farms. We evaluate the performance

of two state-of-the-art 3D sensing technologies and showcase the differences between 2D and 3D networks

trained on the images and point clouds of strawberry plants and fruit. Our study highlights limitations of the

current 3D vision systems for detection of small objects in outdoor applications and sets out foundations for

future work on 3D perception for challenging outdoor applications such as agriculture.

1 INTRODUCTION

Thanks to recent advances in 3D sensing technology

and rapidly growing data-driven algorithms, the 3D

vision has attracted considerable attention in recent

years. Compared to 2D images, 3D information pro-

vides additional depth cues critical for estimating pre-

cise location and assessing shape properties of var-

ious objects in the environment. So far, the main

focus in the 3D vision community has been centred

around benchmark datasets captured in controlled en-

vironments with large and rigid objects and fairly sta-

ble lighting conditions (e.g. (Dai et al., 2017; Armeni

et al., 2017)). There is, however, a strong case for

deploying such systems in real-life scenarios and cur-

rently it is not clear how well the current state of the

art in 3D vision translates into the challenging situ-

ations posed by applications such as in agriculture.

As part of our research project, we propose a study

on the application of modern 3D sensing technology

together with the state-of-the-art machine learning al-

gorithms for segmentation and detection of strawber-

ries growing in real farms. The precise information

about strawberry fruit location and shape description

have a range of applications such as yield monitor-

ing and forecasting, phenotyping or robotic picking.

https://orcid.org/0000-0002-6299-8465

The challenges posed by such a scenario include vari-

able lighting conditions, reﬂections, occlusions, non-

rigid structure of the strawberry plants and relatively

small size of the fruit. Since, the current 3D sens-

ing technology has not been deployed widely in such

scenarios and most of the modern machine learning

algorithms were designed and trained speciﬁcally for

large and rigid objects, our study aims to assess the

usefulness of the sensing and learning methodology

for the proposed application.

In particular, our paper provides the following

contributions:

• assessment of two competing 3D sensing tech-

nologies (i.e. projected stereo IR pattern and

Time-of-Flight sensors) for the problem of detec-

tion of small objects (i.e. fruit) in outdoor envi-

ronments;

• assessment of the current state of the art in 3D ma-

chine learning and of the required modiﬁcations

enabling their use for the proposed application;

• comparison of the accuracy for 2D image-based

and full 3D approaches;

• validation of the sensing and learning pipeline on

data collected from real strawberry farms.

This paper is organised as follows: we start with a

brief overview of the related work in 3D sensing and

682

Louedec, J., Li, B. and Cielniak, G.

Evaluation of 3D Vision Systems for Detection of Small Objects in Agricultural Environments.

DOI: 10.5220/0009182806820689

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

682-689

ISBN: 978-989-758-402-2; ISSN: 2184-4321

vision followed by the description of our methodol-

ogy (Sec. 3) and its experimental evaluation in Sec-

tion 4. The paper is concluded in Section 5.

2 BACKGROUND

2.1 3D Vision in Agriculture

In agricultural applications, 3D information can pro-

vide important object characteristics such as crop

size, shape or location. The most common approach

to recognise such objects is based on a combination

of 2D images for crop segmentation and detection

and 3D information for augmenting the shape and

location information. For example, (Lehnert et al.,

2018; Lehnert et al., 2017) describe a perception sys-

tem for harvesting sweet peppers. After scanning an

reconstructing the scene using a robotic arm, they

use colour information in the point cloud to detect

the pepper. They then use the 3D projection of the

segmented peduncle to estimate a pose and the opti-

mal grasping/cutting point. (Barnea et al., 2016) also

present a perception system for pepper harvesting but

use a colour agnostic method to detect fruits using

a combined colour and depth information (i.e. pro-

vided by RGBD cameras). By using highlights in the

image, and 3D plane-reﬂective symmetries, their sys-

tem is able to detect pepper fruit based on shape, in

heavily occluded situations. (Yoshida et al., 2018)

use RGBD cameras and a two resolution voxelisa-

tion of the 3D space to detect tomatoes peduncles and

the optimal cutting point for harvesting. They ﬁrst

identify regions corresponding to tomatoes whilst us-

ing the dense voxelisation on the selected regions and

to establish the optimal cutting points on the pedun-

cle. The DeepFruit system (Bargoti and Underwood,

2016) uses Deep Learning networks for the detection

of fruits in images taken from the orchards. The sys-

tem achieves good detection accuracy but by only re-

gressing bounding boxes over instances, it does not

consider spatial information and therefore is prone to

missed detections when the fruits overlap.

2.2 Deep Learning for 3D Information

3D information is represented by so-called point

clouds - a collection of unconnected and unordered

points in 3D space which originate from sensors such

as 3D laser range ﬁnders or Time-of-Flight cameras

(i.e. RGBD cameras) and can be augmented by ad-

ditional information such as colour, reﬂectivity, etc.

The core of machine learning methods using deep net-

works is applied to standard images and therefore is

based on 2D convolutions. This can be realised ef-

fectively in 2D but poses problems in 3D. First of all,

the convolution operation requires a discretisation of

space into so-called voxels, which is typically associ-

ated with some loss of information and large memory

requirements. Secondly, 3D convolution operations

are computationally expensive which renders them

unusable in most of the real-life scenarios. Therefore,

there is a big interest in developing methods which

can cope with these challenges. Point cloud pro-

cessing using Deep Learning has been revolutionised

by PointNet (Qi et al., 2017a) and subsequently by

its improved variant PointNet++ (Qi et al., 2017b).

The PointNet architecture can be directly applied to

a point cloud, through a prior segmentation/grouping

of points in space using clustering algorithms such as

K-Nearest Neighbours (KNN) or ball query. Convo-

lutions are not applied to the organised representation

of space, but rather to the features extracted from the

clustered regions, which can be performed efﬁciently.

PointNet++ is using a multi-scale approach for par-

titioning point clouds and relies on two main layers

used for encoding and decoding information. The ﬁrst

layer (Set Abstraction (SA)) extracts features from

points by considering their neighbourhood deﬁned by

a radius. The second layer (Feature Propagation (FP))

interpolates the features and learns their decoding into

the dimension of the previous SA layer, up until the

same size as the input point cloud. For the classiﬁca-

tion task, the latent representation of the point cloud

features after the succession of SA layers is used and

passed through a multi-layer perceptron to predict the

classiﬁcation of individual points. For segmentation

and other task requiring features associated for each

point of the point cloud, each SA layer is associated

with an FP layer in charge of decoding the resulting

features up to the input size. A multi-layer perceptron

is then used to predict per point class.

The basic PointNet architecture has been used

to develop further improvements. For example,

PointSIFT (Jiang et al., 2018) uses SIFT-like fea-

tures extracted from the immediate neighbourhood of

a point in 8 directions. On the other hand, PointCNN

(Li et al., 2018) is applying a convolution-like opera-

tor on points grouped and transformed based on pre-

deﬁned properties, ordering points and transforming

their features into a latent and canonical space, fol-

lowed by a standard convolution operation. However,

most of the described improvements, whilst increas-

ing the network’s discriminatory abilities, suffer from

higher computational demands and do not scale well

to real-time applications in realistic scenarios.

Evaluation of 3D Vision Systems for Detection of Small Objects in Agricultural Environments

683

2.3 3D Datasets

The majority of the existing 3D machine learning al-

gorithms were validated on datasets acquired in in-

door environments. The examples of such datasets in-

clude ScanNet (Dai et al., 2017) with scans obtained

using surface reconstruction and crowd-sourced la-

belling for annotation. Another example includes

Stanford 2D-3D-Semantics Dataset(Armeni et al.,

2017) which was created with an RGBD camera. The

S3DIS offers a very complete scene description, ad-

ditional colour information, depth, 2D segmentation,

meshes and 3D segmentation with extracted normals.

There is a very limited amount of publicly available

3D datasets in outdoor spaces. Notable examples in-

clude semantic3D.net (Hackel et al., 2017) which is a

very large scale dataset (4 billion points per scene)

of various outdoor locations. Another example in-

cludes the KITTI (Geiger et al., 2012) dataset, cap-

tured aboard a driving car in various streets, providing

depth and colour information with annotated bound-

ing boxes and instance segmentation. All the above-

mentioned datasets feature large, rigid objects which

are not typical for agricultural environments.

3 METHODOLOGY

3.1 3D Sensing

3D sensing is based on capturing the depth or distance

from the camera to each point in the scene. Capturing

devices for outdoor use can be divided into three main

categories: stereo cameras, Time-of-Flight (ToF) de-

vices and Lidar range ﬁnders. Stereo sensors are

based on capturing two images from two image sen-

sors apart from each other and matching their features

to create a depth map based on epipolar lines between

the two sensors. In the case of wrongly matched

points or a lack of similarity, surfaces reconstructed in

this way, are often distorted or ﬂat with blended edges

and objects. This is especially evident with very small

objects. Alternative sensing solutions use light wave-

lengths outside of the visible spectrum (e.g. infra-

red) which are less prone to changing lighting condi-

tions and more robust matching points. Examples of

commercial stereo cameras used in research are ZED

cameras and Intel RealSense (Georg Halmetschlager-

Funek and Vincze, 2018). Time-of-Flight devices

are based on light beams which are being projected

into the scene and reﬂected back to the sensor. The

depth is estimated from the time taken for the light to

come back. This technology results in more precise

depth measurements, but more prone to noise caused

by reﬂective objects. The Microsoft Kinect One (i.e.

v2) (Georg Halmetschlager-Funek and Vincze, 2018)

and the Pico Zense (Noraky and Sze, 2018) are a good

example of recent innovations in this technology. Li-

dar is a particular example where the beam of light is

replaced by a laser pulsed at the scene. We do not con-

sider Lidar technology in our work, however, since its

intrinsic properties and resolution are not suited for

the detection of small objects in occluded scenarios

such as strawberries. (Kazmi et al., 2014) offers a

comprehensive study of both sensors applied in dif-

ferent situations.

In this paper, we compare stereo and Time-of-

Flight sensing technologies based on their perfor-

mance in sensing of strawberries in their natural

growing conditions. The two selected cameras were

the Intel Realsense D435 (IR stereo) and the Pico

Zense (ToF). The detailed experimental comparison

of these two technologies on 3D data of strawberries

collected from their natural growing environment is

presented in Section 4.3.

3.2 3D Vision

In our study, we are interested in the feasibility of

modern 3D sensing and machine learning for a prob-

lem of detecting strawberries in their natural environ-

ment. To this end, we select a reliable, robust and

popular (see recent applications (Wang et al., 2017;

Pham et al., 2019; Yang et al., 2019; Wang et al.,

2019)) deep learning architecture PointNet++ (Qi

et al., 2017b). The PointNet++ provides segmenta-

tion results, i.e. per point classiﬁcation which, if suc-

cessful, would enable instance detection of individual

strawberries. In this work, however, we assess both

sensing technology on segmentation problem only.

3.2.1 Data Pre-processing

Before the 3D data can be used by the network, it

needs to be pre-processed so that the point clouds pro-

vided by both types of sensors are of similar char-

acteristics. The point clouds generated by the sen-

sors are already augmented with registered colour in-

formation (RGB). The sensors generate point clouds

of different density and number: ∼920K points for

the stereo camera and ∼230K points for the ToF de-

vice. Since the sensors are placed in a similar distance

to the strawberry plants (i.e.∼60cm) the point clouds

can be downsampled to match the spatial resolution of

around 3 mm, which is a limiting factor constrained

by the depth resolution of both sensors. This results

in point clouds of ∼25K points which are still too

large for the PointNet architecture and therefore we

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

684

partition them into smaller subsets following a proce-

dure employed in (Qi et al., 2017b). The procedure

is using a sliding box over the point cloud using K-

Nearest Neighbors (KNN) algorithm to guarantee the

same size input (8000 points) and allowing us to train

the algorithms using mini-batches. During the predic-

tion phase, we use a maximum vote strategy for points

belonging to multiple blocks.

3.2.2 PointNet++

The original implementation of PointNet++ requires

modiﬁcations to make it suitable for the scale and res-

olution of our problem. PointNet++ offers two ap-

proaches for segmentation: the multi-scale approach

with one-hot encoding for the classiﬁcation and a sin-

gle scale of grouping with an increased number of

layers and complexity of the feature space. The ini-

tial experimentation on a subset of our data iden-

tiﬁed minimal differences in accuracy provided by

these two methods. The second one, however, re-

quires signiﬁcantly lower execution times and mem-

ory consumption and therefore it was selected for fur-

ther study.

Table 1: Conﬁguration of an encoder used in our Point-

Net++ implementation.

Layer Type #points radius mlps

SA 4096 0.1 [16, 16, 32]

SA 2048 0.1 [32,32,32]

SA 1024 0.1 [32,32,64]

SA 256 0.2 [64,64,128]

SA 64 0.4 [128,128,256]

SA 16 0.8 [256,256,512]

Considering the size of our inputs and the size

of strawberries, we augmented the number of layers

for the decoding part of the network and adding their

counterpart in the decoder. This provides several ben-

eﬁts. Firstly, the point cloud is subdivided progres-

sively into broader versions, with ﬁne-scale features

learnt at the start of the training and global features

learnt towards the end, which compensates for the

lack of an intra-layer multi-scale component. Sec-

ondly, we also adapted the radius for the ball-query

for points which can be adjusted for the selected den-

sity/resolution of points. This second modiﬁcation is

directly linked to the network’s output and loss func-

tion used. In our implementation, we decided to pre-

dict the class of each point (fruit/background) rather

than using speciﬁc class (ripe or unripe fruits and

background). This enabled a better convergence of

the learning process and a stronger emphasis on the

shape features rather than colour. Following nota-

tion introduced in (Qi et al., 2017b), our implemen-

tation features a topology of the encoder and decoder

as summarised in Table 1 and 2, which summarise the

number of points per layer (#points), radius for the

ball-query (radius), feature length (features) and con-

ﬁguration of the multi-layer perceptron (mlps).

Table 2: Conﬁguration of a decoder used in our PointNet++

implementation.

Layer Type features

FP 256,256

FP 256,128

FP 128,128,128

FP 128,128,64

MLP [64,128]

MLP [128,2]

3.2.3 2D vs 3D Segmentation

To evaluate the usefulness of the 3D information for

segmentation of small objects, we also select a stan-

dard 2D image-based architecture for our compar-

isons. The Convolutional Neural Networks (CNN)s

have proven to be very effective for the object/scene

segmentation tasks although lacking depth informa-

tion. For this purpose, we select a state-of-the-art

popular network architecture called SegNet (Badri-

narayanan et al., 2015), which is very similar to Point-

Net++. SegNet is a feed-forward network using, sim-

ilarly to PointNet++, the auto-encoder principle, en-

coding the feature space down to a speciﬁc size (e.g.

512) before decoding it back to the original size of

the input image. For each pixel, we can either pre-

dict a score for each class (strawberry/background)

followed by a softmax function to get the predicted

class, or predict a probability for each class. Each of

the convolutions is followed by a normalisation and

each convolution block by a max-pooling operation.

The decoder uses a max-unpool layer as upsampling

step and transposed convolutions.

4 EXPERIMENTS

4.1 Data Collection

To support the main goal of our application, which

is applying a modern 3D vision system for the detec-

tion and localisation of strawberry fruit, we collected

a dataset from the real environment. To that end, we

have deployed our data acquisition system at a mini

Evaluation of 3D Vision Systems for Detection of Small Objects in Agricultural Environments

685

Figure 1: The strawberry farm, with a robot roaming in the

tabletop rows collecting data (left). The sensor set-up used

for data collection (right).

version of a real strawberry farm, located at Rise-

holme campus of the University of Lincoln. The farm

features two polytunnels of 6 tabletop rows, 24 meter

long with an industrial variety of strawberries (ever-

bearer Driscolls Amesti) as depicted in Fig. 1. The

data capture setup featuring the Realsense and Pico

Zense sensors was mounted on an agricultural robot

Thorvald (Grimstad and From, 2017). The robot au-

tonomously navigated the polytunnel rows, stopping

every 20 cm to collect a snapshot from both views

(see Figure 1). The capturing session took place in

October 2019 and resulted in colour images and point

clouds representing different growth stage of plants

and fruit. The datasets were then manually anno-

tated to indicate location of strawberry fruits result-

ing in 139 labelled point clouds with around 1900 in-

stances of ripe strawberries for ToF data and 64 point

clouds for around 1000 instances for stereo data (see

Table 3).

Table 3: The summary of datasets collected.

sensor stereo ToF

# point clouds 64 139

resolution 1280x720 1280x720

range 20cm-65m 20cm-70cm

# instances ∼ 1000 ∼1900

% strawberry points ∼6% ∼ 6%

4.2 Evaluation Methodology

To evaluate our trained models, we use standard se-

mantic segmentation metrics such as Accuracy and

mean Intersection over Union (mean IoU) and also

Kappa-Cohen (Cohen, 1960) which is particularly

useful when unbalanced number of class instances is

used - in our case, background represents the majority

of points when compared to strawberries.

The Accuracy is the most used metrics for major-

ity of machine learning systems and measures how ac-

curate the prediction is compared to the ground truth,

without taking in consideration the balance of classes

and positives/negatives:

Acc =

T P +TN

P + N

. (1)

Mean IoU is the overlap of the output predicted by

the algorithm with the ground truth and averaged for

every class and samples:

IoU =

T P

(T P + FP + FN)

. (2)

Kappa-Cohen coefﬁcient is particularly useful for un-

balanced data, where one class is more represented

than the others. This measure provides a better as-

sessment of the real discriminatory power of the clas-

siﬁer and takes observed and expected accuracies into

account:

K =

(Acc

obs

− Acc

exp

)

1 − Acc

exp

. (3)

The Observed Accuracy is the number of instances

correctly classiﬁed through point cloud, and the Ex-

pected Accuracy is what any random classiﬁer should

be expected to achieve over the point cloud.

We also use precision-recall curves to evaluate the

performance of the trained classiﬁers as in (Evering-

ham et al., 2010). Precision represents a ratio of

T P

T P+FP

whilst recall is a ratio of

T P

T P+FN

. The preci-

sion and recall values are computed over a range of

conﬁdence score thresholds of the classiﬁer.

4.3 Results

4.3.1 Quality of Acquisition

We compare the data acquired using Intel Realsense

D435 and Pico Zense cameras which both pro-

vide RGBD information. We do not use the post-

processing ﬁlters offered by the Realsense device, as

they were proven to be not reliable outdoors with sen-

sitive light and exposition settings. The only built-in

enhancement enabled is the spatial ﬁlter which is also

used in the ToF sensor.

The following qualitative assessment is based on

observing the depth maps captured as shown in Fig-

ure 2. The stereo camera works using infrared spec-

trum (third image from the top in Figure 2, the depth

obtained is very sensitive to variations in exposure

and stereo-matching between both infrared images

captured. The stereo sensor also captures visible in-

formation as far as possible, reducing the amount of

detailed information for small objects and their sur-

faces. Overall the data captured using stereo cam-

era provides inferior depth information compared to

the ToF, and shape information for strawberries is de-

graded. This is mostly due to the lack of features for

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

686

stereo-matching between the two IR cameras. One

can also notice the absence of the IR pattern projected

by the camera. This pattern is supposed to improve

depth with more reliable features to match between

images, but it is here completely dispersed by the

sun’s natural infrared spectrum. The Time-of-Flight

Figure 2: Comparison of depth captured using Stereo and

Time-of-Flight technologies.

camera, on the other hand, can be programmed to cap-

ture information only in a given range depending on

the application. In our data collection, the camera was

set to near-range setting, improving quality of depth

for short ranges (three ranges are usable: Near, Mid,

Far), which suits our application, since the interesting

information is found up to 70 cm away from the cam-

era. Despite this fact, the quality of ToF depth maps

is visually better. This sensor offers a better coverage

of the depth information. The only limitation comes

from the light spectrum used, which corresponds to

some part of the solar infrared spectrum. Flat sur-

Figure 3: Precision-recall curves for the different networks

indicating also area under the curve (AUC).

faces, which are the most reﬂective, appear slightly

deformed and in some cases with deformations of the

scale matching small strawberries. These qualitative

ﬁndings suggest that Time-of-Flight technology suits

better outdoor environments and is especially bene-

ﬁcial for the detection and shape analysis of small

objects such as fruits. Concerning the quality of the

RGB images from both cameras, the stereo camera

provides clearer and higher quality images overall in

an outdoor context.

Figure 4: The segmentation results for PNet

colour

trained on

data from the stereo camera: the original point cloud (top),

segmentation results (bottom). The colours indicate: TP in

green, FP in orange, FN in purple and TN in black.

Evaluation of 3D Vision Systems for Detection of Small Objects in Agricultural Environments

687

Table 4: Comparison of different networks trained on data for stereo and ToF sensors.

Model Camera Dimension Accuracy Kappa Cohen mean IoU

SegNet stereo 2D 98.5% 0.71 0.83

SegNet ToF 2D 99.0% 0.79 0.67

PNet stereo 3D 91.1% 0.43 0.14

PNet ToF 3D 90.0% 0.38 0.15

PNet

colour

stereo 3D 95.4% 0.66 0.48

PNet

colour

ToF 3D 92.5% 0.54 0.39

Figure 5: The segmentation results for PNet

colour

trained

on data from the ToF camera: the original point cloud (top),

segmentation results (bottom). The colours indicate: TP in

green, FP in orange, FN in purple and TN in black.

4.3.2 Network Comparison

To compare both sensors directly, we use the cap-

tured data for training the selected machine learning

networks (see Sec. 3) for each sensor separately. In

addition, we use the following conﬁgurations of the

networks and type of input data: SegNet on colour

images, PointNet++ on 3D point clouds only (PNet)

and PointNet++ with additional colour information

(PNet

colour

). The results are presented in Table 1

whilst the precision-recall curves in Figure 3. There is

a clear difference between networks trained with and

without colour information and between networks

trained on ToF and stereo datasets with performances

signiﬁcantly improved for both datasets. This can be

explained by a greater difference using colour space

between strawberries and background than using only

shape information. The difference in results for dif-

ferent sensors seems to be ampliﬁed when colour in-

formation is used and would come from the unre-

liable readings from large surfaces and shapes with

ToF cameras leading to many false positives (FP). The

precision-recall curves conﬁrm these ﬁndings. The

high amount of FP is negatively affecting the net-

works trained with 3D information only but improved

signiﬁcantly for networks trained in colour. With a

very low area under the precision-recall curve, PNet

is the worst performing classiﬁer than PNet

colour

our datasets. A 2D network SegNet performs sig-

niﬁcantly better than any of the 3D variants for both

datasets. To illustrate these ﬁndings, we provide ex-

ample outputs from PNet

colour

illustrating the quality

of segmentation for both sensors. The stereo dataset

(Fig.4) is characterised by more omissions of straw-

berries with a high number of FN, but less false detec-

tions. The main red and distinct strawberries are how-

ever well segmented. The ToF example (Fig. 5), all

the strawberries are mostly segmented out and there

are few omissions. However, there is a high number

of false positives, especially on the brown/red leaves

of the scene with round shape. These examples pro-

vide additional support for ﬁndings based on the nu-

merical values in Table 4.

The superior performance of detectors based on

2D CNNs for our application can be associated with

the structured nature of 2D images, maturity of the

developed networks and also low quality of the depth

data. Also, through post-processing, sufﬁcient spa-

tial information can be retrieved using the depth map

and the 2D segmentation mask, making these algo-

rithms preferable for our application at the time being.

The presented 3D approaches, however, offer an ad-

vantage in direct localisation of the objects, although

their localisation accuracy is a subject of future work.

The real-time suitability of the 3D methods is also

promising, achieving 5 FPS (each frame a point cloud

of ∼64k points) compared to 13 FPS (each frame an

image of 1280 × 720 px) for SegNet.

5 CONCLUSIONS

Capturing 3D data and processing it for different tasks

such as detection, segmentation or classiﬁcation is

a challenging task especially in the agricultural con-

text presented in this paper. Our study evaluated two

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

688

3D sensing technologies for that purpose and com-

pared 3D and 2D variants of state-of-the-art neural

networks trained on the data collected from a straw-

berry growing farm. These results show encouraging

performance but also allow us to highlight the limita-

tions of current technologies and algorithms. Time-

of-Flight technology, despite its superior quality of

point clouds and shape information, struggles with re-

ﬂective surfaces resulting in a large number of false

detections, while stereo technology, lacking detail in

acquired depth, fails to detect numerous fruits. Tradi-

tional 2D image-based convolutional neural networks

still outperform the 3D networks for the task of fruit

segmentation and therefore are more suited for this

task. This work can be treated as a baseline for fu-

ture work on 3D information for outdoor applications

such as robotic fruit picking and should encourage re-

searchers to pursue more experimentation in such dif-

ﬁcult to counteract limitations found in the paper and

bridge the gap with state-of-the-art techniques in per-

ception for 2D information.

REFERENCES

Armeni, I., Sax, A., Zamir, A. R., and Savarese, S. (2017).

Joint 2D-3D-Semantic Data for Indoor Scene Under-

standing. ArXiv e-prints.

Badrinarayanan, V., Kendall, A., and Cipolla, R. (2015).

SegNet: A Deep Convolutional Encoder-Decoder Ar-

chitecture for Image Segmentation. arXiv e-prints.

Bargoti, S. and Underwood, J. (2016). Deep Fruit Detection

in Orchards. arXiv e-prints.

Barnea, E., Mairon, R., and Ben-Shahar, O. (2016). Colour-

agnostic shape-based 3d fruit detection for crop har-

vesting robots. Biosystems Engineering, 146:57 – 70.

Special Issue: Advances in Robotic Agriculture for

Crops.

Cohen, J. (1960). A coefﬁcient of agreement for nominal

scales. Educational and Psychological Measurement,

20(1):37–46.

Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser,

T., and Nießner, M. (2017). ScanNet: Richly-

annotated 3D Reconstructions of Indoor Scenes.

In Proc. Computer Vision and Pattern Recognition

(CVPR), IEEE.

Everingham, M., Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. (2010). The Pascal Visual Ob-

ject Classes (VOC) Challenge. Int. J. Comput. Vision,

88(2):303–338.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready

for Autonomous Driving? The KITTI Vision Bench-

mark Suite. In Conference on Computer Vision and

Pattern Recognition (CVPR).

Georg Halmetschlager-Funek, Markus Suchi, M. K. and

Vincze, M. (2018). An empirical evaluation of ten

depth cameras. IEEE Robotics and automation maga-

zine.

Grimstad, L. and From, P. J. (2017). The Thorvald II Agri-

cultural Robotic System. Robotics, 6(4).

Hackel, T., Savinov, N., Ladicky, L., Wegner, J. D.,

Schindler, K., and Pollefeys, M. (2017). SEMAN-

TIC3D.NET: A new large-scale point cloud classiﬁca-

tion benchmark. In ISPRS Annals of the Photogram-

metry, Remote Sensing and Spatial Information Sci-

ences, volume IV-1-W1, pages 91–98.

Jiang, M., Wu, Y., Zhao, T., Zhao, Z., and Lu, C. (2018).

PointSIFT: A SIFT-like Network Module for 3D Point

Cloud Semantic Segmentation. arXiv e-prints.

Kazmi, W., Foix, S., Aleny, G., and Andersen, H. J. (2014).

Indoor and outdoor depth imaging of leaves with time-

of-ﬂight and stereo vision sensors: Analysis and com-

parison. ISPRS Journal of Photogrammetry and Re-

mote Sensing, 88:128–146.

Lehnert, C., English, A., McCool, C., Tow, A., and Perez,

T. (2017). Autonomous Sweet Pepper Harvesting for

Protected Cropping Systems. arXiv e-prints.

Lehnert, C., McCool, C., Sa, I., and Perez, T. (2018). A

Sweet Pepper Harvesting Robot for Protected Crop-

ping Environments. arXiv e-prints.

Li, Y., Bu, R., Sun, M., and Chen, B. (2018). PointCNN:

Convolution On X-Transformed Points. arXiv

preprint arXiv:1801.07791.

Noraky, J. and Sze, V. (2018). Low Power Depth Estimation

of Rigid Objects for Time-of-Flight Imaging. arXiv e-

prints.

Pham, Q.-H., Thanh Nguyen, D., Hua, B.-S., Roig, G.,

and Yeung, S.-K. (2019). JSIS3D: Joint Semantic-

Instance Segmentation of 3D Point Clouds with Multi-

Task Pointwise Networks and Multi-Value Condi-

tional Random Fields. arXiv e-prints.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017a). Point-

Net: Deep learning on point sets for 3d classiﬁcation

and segmentation. Proc. Computer Vision and Pattern

Recognition (CVPR), IEEE, 1(2):4.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. (2017b). Point-

Net++: Deep hierarchical feature learning on point

sets in a metric space. In Advances in Neural Infor-

mation Processing Systems, pages 5099–5108.

Wang, W., Yu, R., Huang, Q., and Neumann, U. (2017).

SGPN: Similarity Group Proposal Network for 3D

Point Cloud Instance Segmentation. arXiv e-prints.

Wang, X., Liu, S., Shen, X., Shen, C., and Jia, J. (2019).

Associatively Segmenting Instances and Semantics in

Point Clouds. arXiv e-prints.

Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham,

A., and Trigoni, N. (2019). Learning Object Bounding

Boxes for 3D Instance Segmentation on Point Clouds.

arXiv e-prints.

Yoshida, T., Fukao, T., , and Hasegawa, T. (2018). Fast De-

tection of Tomato Peduncle Using Point Cloud with

a Harvesting Robot. Journal of Robotics and Mecha-

tronics, 30(2):180–186.

Evaluation of 3D Vision Systems for Detection of Small Objects in Agricultural Environments

689