Detecting Object Defects with Fusioning Convolutional Siamese Neural

Networks

Amr M. Nagy

1,2

and L

aszl

o Cz

uni

Faculty of Information Technology, University of Pannonia, Egyetem u. 10, Veszpr

em, Hungary

Faculty of Computers and Artiﬁcial Intelligence, Benha University, Egypt

Keywords:

Visual Inspection, Defect Detection, Siamese Neural Network.

Abstract:

Recently, the combination of deep learning algorithms with visual inspection technology allows differentiating

anomalies in objects mimicking human visual inspection. While it offers precise and persistent monitoring

with a minimum amount of human activity but to apply the same solution to a wide variety of defect types is

challenging. In this paper, a new convolutional siamese neural model is presented to recognize different types

of defects. One advantage of the proposed convolutional siamese neural network is that it can be used for new

object types without re-training with much better performance than other siamese networks: it can generalize

the knowledge of defect types and can apply it to new object classes. The proposed approach is tested with

good results on two different data sets: one contains trafﬁc signs of different types and different distortions,

the other is a set of metal disk-shape castings with and without defects.

1 INTRODUCTION

Recently, deep learning algorithms are widely used

in many areas of computer vision and have proved to

achieve human level accuracy or even better. They be-

came the favourite methods for many various vision

tasks including object detection, recognition, pose es-

timation, and image segmentation.

In order to meet industrial expectations, there is

a strong need to achieve high performance in auto-

mated visual inspection which means observation of

the same type of objects repeatedly to detect anoma-

lies. There is a wide ﬁeld of such applications includ-

ing automated product manufacturing, railway indus-

try, casting or welding, healthcare. A general taxon-

omy of the different defects was presented in (Cz-

immermann et al., 2020): those detectable by only

visual methods (e.g. contamination, color or shape

errors) and palpable (detectable by touch and vision,

e.g. cracks, bumps). In this paper we are interested

with visible defects such as errors of disk castings or

common problems with trafﬁc signs (fading, occlu-

sions, scribbles, and their combinations). See Figure

1 and Figure 6 for illustration of the defects under in-

vestigation.

The defect detection process can be formulated as

either an object detection or a segmentation task. In

the object detection approach the goal is to detect each

Figure 1: Trafﬁc sings, as examples for the 7 distortion

classes, under investigation: faded, covered, scribbled, cor-

rect, covered and faded, covered and scribbled, faded and

scribbled.

defect in the image and classify it to one of the pre-

deﬁned classes. In the image segmentation approach

the problem is essentially solved by pixel classiﬁca-

tion, where the goal is to classify each image pixel as

part of a defect or not. In general, object detection and

instance segmentation are difﬁcult tasks, as the num-

ber of instances in a particular image is unknown and

often unbounded. Additionally, due to a wide range of

products to be assembled, sensors cannot easily adapt

to different materials and shapes of the products to

be inspected, variations in the object’s position, light-

ing, and background cause additional challenges to

this task.

A neural network is often used to learn to predict

object or attribute classes. When we need to add new

or remove classes we have to re-train the network on

the modiﬁed dataset. In addition, we need a signif-

icant amount of new training data to obtain satisfac-

tory performance. One solution for this issue is to

use siamese neural networks (SNNs) (Bromley et al.,

1994). SNNs take two inputs, on both images run the

Nagy, A. and Czúni, L.

Detecting Object Defects with Fusioning Convolutional Siamese Neural Networks.

DOI: 10.5220/0010263301570163

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

157-163

ISBN: 978-989-758-488-6

157

same neural processing parallel, then combine the two

branches to implement a similarity function between

the inputs. In case of proper training SNNs can learn

the features which are proper to differentiate the ob-

jects or some of their features. However, it can be a

key question what is the maximal variability of input

images where the same similarity function is still suf-

ﬁcient.

In our paper we propose to train special SNNs to

predict if the pairs of the images belong to the same

defect class or not. We assume that, in case of satis-

factory training data, our network can generalize the

visual appearance of visual defects, thus we can ap-

ply the same network for new object classes with-

out re-training. The main contribution of our paper

is a siamese type network which, besides comput-

ing the difference of the features, contains the con-

catenation and further processing of these features.

In the article we refer to it as Fusioning Convolu-

tional Siamese Neural Network (FCSNN). Moreover,

we test the generalization properties of our network to

learn the latent defect speciﬁc features by predicting

the errors of new untrained object classes with differ-

ent appearance.

Our paper is organized as follows. In the next Sec-

tion we overview related papers then in Section 3 we

explain the proposed method. In 3.2 the used datasets

are introduced while the experiments and evaluations

can be found in Sections 4. Finally, we conclude our

article in the last Section.

2 RELATED WORKS

It is possible to classify the visual inspections ap-

proaches into low-level image processing approaches

such as statistical (Zhu et al., 2015), structural (Cao

et al., 2015; Yun et al., 2019), ﬁlter-based (Kang et al.,

2015), model-based (Xi et al., 2017; Zhang et al.,

2020), and high-level image processing approaches

such as supervised (He et al., 2019) unsupervised or

semi-supervised classiﬁers (Mei et al., 2018).

Statistical methods concentrate on analyzing the

spatial distribution of pixel values in an image. In

(Zhu et al., 2015), they proposed a new algorithm

by combining the autocorrelation function with the

grey level co-occurrence matrix. First, autocorrela-

tion function is used to determine the pattern period

then the size of detection window can be obtained

thus co-occurrence can be computed. In order to dis-

tinguish defective and defect-free images, Euclidean

distance is computed between templates and queries.

Structural approaches primarily concentrate on

the texture elements’ spatial position. Such elements

can be extracted from the texture and deﬁned as tex-

ture primitives. Simple grey-scale areas, line seg-

ments or individual pixels are often the texture prim-

itives. (Cao et al., 2015) deals with fabric defect in-

spection: they proposed prior knowledge guided least

squares regression (PG-LSR) to combine the global

structure of texture feature space and the prior from

local similarity. This combination helps to generate

a more clear irregularity map and to identify various

defects accurately and robustly.

Images are often described by detected features,

such as edges, textures and regions. In (Gai, 2016)

a banknote defect detection algorithm is presented to

detect cracks and scratches on banknote images us-

ing a quaternion wavelet transform and edge intensity.

The banknote image is ﬁrst registered using the least

squares method under the quaternion wavelet decom-

position framework. The defective features are ex-

tracted using edge difference between the reference

image and the test image. Naturally, this traditional

framework can be used in case of very similar sample

images. In (Xi et al., 2017) also a differential ﬁlter

is used to distinguish the defect among the textures,

but here a quantitative model characterizing the im-

pact of illumination on the image is developed, based

on which the non-uniform brightness in the image can

be effectively removed. By comparing the model out-

put against the captured image the illumination effect

can be successfully removed.

In our paper we are interested in deep learning so-

lutions, which are good to create a general framework

handling all the above mentioned aspects. We suggest

the reader to consider (Wang et al., 2018), where the

authors provide comprehensive survey of commonly

used deep learning algorithms and discussed their ap-

plications. In (Weimer et al., 2016) some design con-

siderations for deep learning networks are discussed

for visual inspection problems. There are many simi-

lar solutions for the recognition of visual defects: the

ﬁrst step is feature extraction in a number of layers,

then using fully connected layers for classiﬁcation

can be considered now as a ’traditional’ approach: In

(Faghih-Roohi et al., 2016) identiﬁcation of rail sur-

face defects, in (Jinsong Zhu, 2020) the detection of

bridge defects, in (Hoskere et al., 2018) a multiscale

CNN architecture for the detection of post-earthquake

structural distortions use similar approaches. An al-

ternative solution is the segmentation of defects where

autoencoder based networks are popular. Inspection-

Net (Yang et al., 2019), applied for the analysis of

wall cracks, is one example. In order to build a scal-

able visual inspection framework, that could be used

in solving a variety of inspection tasks within a manu-

facturing context, (P. Liatsis, 2009) uses higher-order

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

158

Figure 2: The proposed fusioning convolutional siamese neural network (FCSNN) architecture.

neural networks (HONNs) extracting geometric fea-

tures invariant to translation and rotation. A major

issue with higher-order neural networks is the combi-

natorial explosion of the higher-order terms. Authors

addressed this problem with an alternative image rep-

resentation strategy of coarse coding. They developed

a genetic algorithm tool for the automated determina-

tion of the optimal number of hidden units in the neu-

ral network. The generic applicability of the method

is tested in two applications: the inspection of ax-

isymmetric components and rivets is discussed, also

including the effect of different noises on the input

images.

Besides the ’traditional’ and autoencoder ap-

proaches siamese networks can also be alternatives.

In (Deshpande et al., 2020) a siamese network is ap-

plied to detect manufacturing defects in steel surfaces.

While we also build on a siamese structure to detect

features, our architecture differs much in the second

part (following the siamese branches) of the network.

The testing use case is also different: our main ques-

tion is how the network can generalize its detection

abilities when the visual appearance of object classes

are relatively high.

3 THE PROPOSED METHOD

Siamese networks (Bromley et al., 1994) are neu-

ral networks containing two or more sub-networks

that are connected by a layer which is typically re-

sponsible for the comparison of the features of the

branches. The sub-networks are identical: they have

the same parameters and weights trained simultane-

ously. The main idea behind siamese networks is to

learn the proper similarity function needed for the ef-

ﬁcient comparison of input images in a speciﬁc task.

In our paper we propose a siamese neural network

where not only the differences of features but the fu-

sion of the features is also processed with several fully

connected layers.

3.1 Proposed Architecture

For feature extraction we used the ImageNet pre-

trained VGG16 model (Russakovsky et al., 2015) as

the parallel sub-networks. The pair of input images

i,a

and X

i,b

) are passed through the VGG16 net-

works to generate the ﬁxed length feature vectors,

then we added a fully connected classiﬁer type layer

to each branch to learn how to interpret the extracted

features on an our dataset. Thus, we have two vec-

tors of length 4096. We utilized these vectors in two

different ways. First, we computed the absolute dif-

ference between the two feature vectors by L

dis-

tance. In the second branch, we concatenated the

two feature vectors into one vector and fed it to three

fully connected layers and two dropout layers with

dropout ratio of 0.2. At the end, we concatenate the

two branches into one vector and fed it to a fully con-

nected sigmoid layer to generate the similarity score

output. The model was compiled using the Adam

optimizer and the binary cross entropy loss function.

The learning rate was set 0.0004. The architecture is

illustrated in Figure 2.

Detecting Object Defects with Fusioning Convolutional Siamese Neural Networks

159

3.2 Datasets

For the better illustration of the training and eval-

uation procedure ﬁrst we introduce the datasets.

We evaluated the proposed method on two different

datasets: our own tafﬁc signs dataset and a dataset

with disk castings from the Kaggle website.

3.2.1 Trafﬁc Signs Dataset

There are several trafﬁc sign datasets available but

those contain no information about distortions. Our

training trafﬁc signs dataset includes 21 different

types of trafﬁc signs, see Figure 4 for sample images.

The 6016 images were captured by dashboard cam-

eras. We classiﬁed the dataset into 8 defect distortion

classes (Faded, Covered, Scribbled, No error, Cov-

ered & Faded, Covered & Scribbled, Faded & Scrib-

bled, Covered & Faded & Scribbled). Each defect

class contain a different number of trafﬁc signs see

Figure 3 and 5 for the exact number in each defect

class. This dataset was used for training. For testing

we used a separate dataset introduced in Section 4.2.

Figure 3: Distribution of trafﬁc signs of the training data

with each error class.

3.2.2 Castings Dataset

Casting dataset includes only two different defect

classes: defective or errorless as illustrated by Fig-

ure 6. It contains 7348 images with the top view of

submersible pump impellers. The size of the gray-

scale images is 300 by 300 pixels. This dataset is split

into training and testing folders. For training we have

3758 defected and 2875 correct product images. For

testing the corresponding numbers are 453 and 262.

https://www.kaggle.com/ravirajsinh45/real-life-

industrial-dataset-of-casting-product

Figure 4: Trafﬁc signs in the training dataset with their class

codes.

Figure 5: Percentage of images for each defect class in the

training dataset.

Figure 6: Example images from the castings dataset. First

two columns show objects with defects, the other two

columns are free from errors.

3.3 Training

To train a siamese network, we must create pairs of

images: there can be pairs where both images are

from the same error class and others where the two

images are from different error classes. Figure 7

shows a few examples of how these pairs can look,

good pairs (pairs with identical defect attribute) will

be given label 1 and distinct pairs are labeled 0. We

will generate these pairs randomly from all the defect

classes in the training data, thus the dataset contains

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

160

pairs of (X

) where Y

is the required output (1 or

0). Recall that the input to our system will be a pair

of images and the output will be a similarity score

between 0 and 1.

Figure 7: Examples of training pairs with their labels.

3.4 N-way One Shot Classiﬁcation

We will perform N-way one shot classiﬁcation (Lake

et al., 2013) as a strategy in order to evaluate the mod-

els. In one-shot classiﬁcation the elements of the sup-

port set X

i,b

are compared to each image of the test

set X

i,a

(the queries). The query image is paired to N

images: one pair has label 1, all the others have label

0. N implies how many images the support set will

contain, we tested the model with N = 20 (20-way

one-shot classiﬁcation).

The trained model should predict 1 for image

pairs with the same error class (high conﬁdence) and

should give 0 for pairs from different error classes.

Thus after ordering the pairs of the supports test in de-

creasing order, the ﬁrst class of the ﬁrst element X

i,b

is considered as the decision of the network:

i,b

= argmax

i,b

Con f idence(X

) (1)

In case of correct prediction the elements of the

pair with the maximum conﬁdence probability value

will be from the same error class. During testing the

generation of the support set and the evaluation of the

prediction is repeated k times:

CorrectPrecentage = 100 × n

correct

/k (2)

where k is total number of trials and n

correct

equals the

number of correct predictions.

In Figure 8 we show an example for 5-way one

shot classiﬁcation. It is expected that the pair of im-

ages in the ﬁrst line will have the highest conﬁdence

since they have identical defects.

Figure 8: An example for 5-way one shot testing. Since the

pair with the same defect has the highest conﬁdence, the

network made a correct classiﬁcation.

4 EXPERIMENTAL RESULTS

The proposed FCSNN architecture was used to es-

timate whether two images are from the same error

class or not. The testing method itself is quite strict

since in case of 20-way one shot classiﬁcation the

recognition is evaluated as correct if the right pair

has the highest conﬁdence from the 20 comparisons.

While in case of castings all images look similar, the

trafﬁc signs tests contain objects with large differ-

ences in visual appearance.

4.1 Testing on the Castings Dataset

In Table 1 we list the test results of four different neu-

ral networks, where the proposed architecture is de-

noted as FCSNN. Since in this experiment FCSNN

Detecting Object Defects with Fusioning Convolutional Siamese Neural Networks

161

was evaluated with 20-way one shot classiﬁcation,

the result is the average of 10 such tests. Our pro-

posed network outperformed the siamese network of

(Koch, 2015) and a ﬁne tuned VGG16 network but it

is slightly worse than ResNet34.

Table 1: Accuracy on the castings dataset with different

deep learning algorithms from Kaggle website, with the

siamese network of (Koch, 2015), and our network (FC-

SNN).

Method Accuracy

Fine Tuned VGG16 98.89%

ResNet34 99.70%

Siamese (Koch, 2015) 97.90%

FCSNN 99.50%

4.2 Testing on Untrained Classes of

Trafﬁc Signs

Trafﬁc signs could be considered as good test objects

since we have lots of types of them. Unfortunately,

it is not easy to collect large number of real trafﬁc

sign images with different defects. We created a test-

ing dataset with 25 trafﬁc sign classes which were not

used in the training of the FCSNN. This dataset con-

tains 490 images loaded with one of the following 4

defects: faded, covered, scribbled, errorless. See Fig-

ure 9 for examples of the new object classes and Fig-

ure 10 for the number of images in each defect class.

We evaluated the defect classiﬁcation accuracy of

SNN (Koch, 2015), SNN with VGG16, and FCSNN

with the 20-way one shot classiﬁcation technique.

The average results of 10 experiments are given in

Table 2. 3 test cases, for the error classes faded,

covered, and scribbled, were evaluated independently,

their weighted average is also given. The performance

of SNN is the lowest. To be able to measure the con-

tribution of our proposal we replaced the feature ex-

traction part of (Koch, 2015) with VGG16. This mod-

iﬁed network gave signiﬁcantly better results. Our

proposal (FCSNN) outperformed this variation with

circa 9.5% in average. The reason for the big vari-

ance between the accuracy of the three error classes

is to be discovered in future.

5 CONCLUSIONS

In our paper we proposed a new siamese neural net-

work architecture to recognize the defects of differ-

ent objects. The previously proposed networks were

extended by several layers and the original features,

besides computing the difference, were retained for

Figure 9: Examples images for the untrained trafﬁc sign

types.

Figure 10: Distribution of the untrained trafﬁc signs with

each error class.

fully connected layers. Two datasets were used for

evaluation with 20-way one shot testing. These tests

show that the proposed architecture performs signif-

icantly better than previous solutions for such cases,

when untrained types of objects are tested. In future

we plan to extend our datasets with other types of ob-

jects and would like to enhance the ability to learn

latent information for unknown object classes.

ACKNOWLEDGEMENTS

We acknowledge the ﬁnancial support of the

echenyi 2020 program under the project EFOP-

3.6.1-16-2016-00015, and the Hungarian Research

Fund, grant OTKA K 120367. We are grateful to

the NVIDIA Corporation for supporting our research

with GPUs obtained by the NVIDIA GPU Grant Pro-

gram. Also we acknowledge the help of our col-

leagues in data collection, namely Zsolt V

osh

azi,

Katalin T

ordi,

Agnes Lipovits, N

andor Szoll

at,

and D

aniel Zajzon.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

162

Table 2: Accuracy of three siamese networks on untrained trafﬁc signs for three independent tests.

Test cases (distribution of images) SNN (Koch, 2015) SNN (Koch, 2015)

with VGG16 features

FCSNN

12 trafﬁc sign classes, 66 faded and 162 errorless 43.3% 80.6% 92%

21 trafﬁc sign classes, 92 covered and 195 errorless 7.5% 27.9% 28.9%

6 trafﬁc sign classes, 34 scribbled and 54 errorless 9.4% 11.5% 25.8%

Weighted average 22.32% 43.12% 52.81%

REFERENCES

Bromley, J., Guyon, I., LeCun, Y., S

ackinger, E., and Shah,

R. (1994). Signature veriﬁcation using a” siamese”

time delay neural network. In Advances in neural in-

formation processing systems, pages 737–744.

Cao, J., Zhang, J., Wen, Z., Wang, N., and Liu, X. (2015).

Fabric defect inspection using prior knowledge guided

least squares regression. Multimedia Tools and Appli-

cations, 76:4141–4157.

Czimmermann, T., Ciuti, G., Milazzo, M., Chiurazzi, M.,

Roccella, S., Oddo, C. M., and Dario, P. (2020).

Visual-based defect detection and classiﬁcation ap-

proaches for industrial applications—a survey. Sen-

sors, 20(5):1459.

Deshpande, A. M., Minai, A. A., and Kumar, M. (2020).

One-shot recognition of manufacturing defects in steel

surfaces. arXiv preprint arXiv:2005.05815.

Faghih-Roohi, S., Hajizadeh, S., N

nez, A., Babuska, R.,

and De Schutter, B. (2016). Deep convolutional neu-

ral networks for detection of rail surface defects. In

2016 International joint conference on neural net-

works (IJCNN), pages 2584–2589. IEEE.

Gai, S. (2016). New banknote defect detection algorithm

using quaternion wavelet transform. Neurocomputing,

196:133–139.

He, D., Xu, K., and Zhou, P. (2019). Defect detection of hot

rolled steels with a new object detection framework

called classiﬁcation priority network. Computers &

Industrial Engineering, 128:290–297.

Hoskere, V., Narazaki, Y., Hoang, T., and Spencer Jr, B.

(2018). Vision-based structural inspection using mul-

tiscale deep convolutional neural networks. arXiv,

pages arXiv–1805.

Jinsong Zhu, Chi Zhang, H. Q. . Z. L. (2020). Vision-based

defects detection for bridges using transfer learning

and convolutional neural networks. Structure and In-

frastructure Engineering, pages 1037–1049.

Kang, X., Yang, P., and Jing, J. (2015). Defect detec-

tion on printed fabrics via gabor ﬁlter and regular

band. Journal of Fiber Bioengineering and Informat-

ics, 8(1):195–206.

Koch, G. R. (2015). Siamese neural networks for one-shot

image recognition.

Lake, B. M., Salakhutdinov, R. R., and Tenenbaum, J.

(2013). One-shot learning by inverting a composi-

tional causal process. In Advances in neural infor-

mation processing systems, pages 2526–2534.

Mei, S., Yang, H., and Yin, Z. (2018). An unsupervised-

learning-based approach for automated defect inspec-

tion on textured surfaces. IEEE Transactions on In-

strumentation and Measurement, 67(6):1266–1277.

P. Liatsis, J.Y. Goulermas, X.-J. Z. . E. M. (2009). A ﬂexible

visual inspection system based on neural networks.

International Journal of Systems Science, pages 173–

186.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., et al. (2015). Imagenet large scale visual

recognition challenge. International journal of com-

puter vision, 115(3):211–252.

Wang, J., Ma, Y., Zhang, L., Gao, R. X., and Wu, D. (2018).

Deep learning for smart manufacturing: Methods

and applications. Journal of Manufacturing Systems,

48:144–156.

Weimer, D., Scholz-Reiter, B., and Shpitalni, M. (2016).

Design of deep convolutional neural network archi-

tectures for automated feature extraction in industrial

inspection. CIRP Annals, 65(1):417–420.

Xi, J., Shentu, L., Hu, J., and Li, M. (2017). Automated

surface inspection for steel products using computer

vision approach. Applied optics, 56(2):184–192.

Yang, L., Li, B., Yang, G., Chang, Y., Liu, Z., Jiang, B., and

Xiaol, J. (2019). Deep neural network based visual in-

spection with 3d metric measurement of concrete de-

fects using wall-climbing robot. In 2019 IEEE/RSJ In-

ternational Conference on Intelligent Robots and Sys-

tems (IROS), pages 2849–2854.

Yun, J. P., Lee, S. J., Koo, G., Shin, C., and Park, C.

(2019). Automatic defect inspection system for steel

products with exhaustive dynamic encoding algorithm

for searches. Optical Engineering, 58(2):023107.

Zhang, Y., Xing, Y., Gong, Y., Jin, D., Li, H., and Liu, F.

(2020). A variable-level automated defect identiﬁca-

tion model based on machine learning. Soft Comput-

ing, 24(2):1045–1061.

Zhu, D., Pan, R., Gao, W., and Zhang, J. (2015). Yarn-dyed

fabric defect detection based on autocorrelation func-

tion and glcm. Autex research journal, 15(3):226–232.

Detecting Object Defects with Fusioning Convolutional Siamese Neural Networks

163