High Resolution Mask R-CNN-based Damage Detection on Titanium

Nitride Coated Milling Tools for Condition Monitoring by using a New

Illumination Technique

Mühenad Bilal

1 a

, Sunil Kancharana

1 b

, Christian Mayer

, Daniel Pfaller

, Leonid Koval

1 c

Markus Bregulla

, Rafal Cupek

2 d

and Adam Zi˛ebi

nski

2 e

Technische Hochschule Ingolstadt, Esplanade 10, Ingolstadt 85057, Germany

Silesian University of Technology, Institute of Informatics, Gliwice, Poland

Keywords:

Predictive Maintenance, Machine Learning, Damage Detection, Illumination Source, Mask R-CNN.

Abstract:

The implementation of intelligent software in the manufacturing industry is a technology of growing impor-

tance and has highlighted the need for improvement in automatization, production, inspection, and quality

assurance. An automated inspection system based on deep learning methods can help to enhance inspec-

tion and provide a consistent overview of the production line. Camera-based imaging systems are among the

most widely used tools, replacing manual industrial quality control tasks. Moreover, an automatized damage

detection system on milling tools can be employed in quality control during the coating process and to sim-

plify measuring tool life. Deep Convolutional Neural Networks (DCNNs) are state-of-the-art methods used

to extract visual features and classify objects. Hence, there is great interest in applying DCNN in damage

detection and classiﬁcation. However, training a DCNN model on Titanium-Nitride coated (TiN) milling tools

is extremely challenging. Due to the coating, the optical properties such as reﬂection and light scattering on

the milling tool surface make image capturing for computer vision tasks quite challenging. In addition to the

reﬂection and scattering, the helical-shaped surface of the cutting tools creates shadows, preventing the neural

network from efﬁcient training and damage detection. Here, in the context of applying an automatized deep

learning-based method to detect damages on coated milling tools for quality control, the light has been shed on

a novel illumination technique that allows capturing high-quality images which makes efﬁcient damage detec-

tion for condition monitoring and quality control reliable. The method is outlined along with results obtained

in training a ResNet 50 and ResNet 101 model reaching an overall accuracy of 83% from a dataset containing

bounding box annotated damages. For instance and semantic segmentation, the state-of-the-art framework

Mask R-CNN is employed.

1 INTRODUCTION

Machining Process Monitoring(MPM) (Liang et al.,

2004) plays an important role in reducing cost, en-

suring greater product variability, and improving

manufacturing productivity and reliability (Caggiano,

2018). Monitoring of the production process (Cu-

pek et al., 2015), production variants (Cupek et al.,

2018; Yli-Ojanperä et al., 2019) and other parameters

https://orcid.org/0000-0003-4065-8467

https://orcid.org/0000-0002-0718-6480

https://orcid.org/0000-0003-4845-6579

https://orcid.org/0000-0001-8479-5725

https://orcid.org/0000-0003-4554-6667

such as current supply (Grzechca et al., 2017; Yingjie,

2014) and even speed of the engines are of growing

importance to provide real-time data for manufactur-

ers. Moreover, there is a vital demand for Tool Con-

dition Monitoring (TCM) (Short and Twiddle, 2019),

especially when it comes to evaluating the milling

process regarding tool wear and the resultant surface

roughness.

• Break-in

• Normal wear

• Abnormal wear

Coating can increase the durability of cutting tools

by 10-12 times (Spišák and Majernikova, 2017). In

Bilal, M., Kancharana, S., Mayer, C., Pfaller, D., Koval, L., Bregulla, M., Cupek, R. and Zi˛ebi

nski, A.

High Resolution Mask R-CNN-based Damage Detection on Titanium Nitride Coated Milling Tools for Condition Monitoring by using a New Illumination Technique.

DOI: 10.5220/0010781800003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

305-314

ISBN: 978-989-758-555-5; ISSN: 2184-4321

305

almost all micro-size industries, digital image pro-

cessing techniques are used as a measurement of

quality assurance (Chen and Lee, 2010). Using an

Imaging System(IS) for damage detection on coated

milling tools is accompanied by many difﬁculties

such as high damage density, low contrast intensity,

in-homogeneity, and damage shape variations. Also,

weak boundaries and strong gradient on the tool con-

tours that overlap with the damages can decrease the

detection accuracy. Additionally, due to the complex

geometry of the cutting tools, a deﬁcient illumination

uniformity results in large intensity variation of dif-

ferent image regions, making training a DCNN model

insufﬁcient for inspection applications. To overcome

these difﬁculties, a new illumination technique to en-

sure uniform illumination for capturing high qual-

ity images for computer visions tasks such as object

detection and semantic segmentation was developed.

Additionally, using these images can improve and ac-

celerate training DCNN models, increasing damage

detection accuracy. To our best knowledge this is

the ﬁrst work, in which an object with optical crit-

ical properties is inserted into a Cylindrical Shaped

Enclosure (CSE) to capture high quality images for

object detection, instance segmentation and pixelwise

damage detection tasks.

The object detection algorithms have been contin-

uously improved by the computer vision community.

Parts of this advanced technique have been driven

by popular object detection algorithms like SSD (Liu

et al., 2016), R-CNN (Girshick et al., 2014), Fast R-

CNN (Girshick, 2015), Faster R-CNN (Ren et al.,

2015) and YOLO (Redmon et al., 2016). For the

automatic damage recognition and localization, the

state-of-the-art target detection framework Mask-R-

CNN was employed, which extends Faster R-CNN by

adding a branch for predicting segmentation masks on

each Region of Interest (RoI) and a branch for classi-

ﬁcation and bounding box simultaneously. The mask

is a fully convolutional network that takes an image of

arbitrary size as input and produces sized output with

efﬁcient inference and learning for each RoI, predict-

ing a segmentation mask in a pixel-to-pixel manner by

adding only a small fraction of computational over-

head to Faster R-CNN. This enables a fast system and

rapid experimentation.

In the ﬁrst stage, Mask R-CNN uses a Region Pro-

posal Net (RPN) network (Girshick et al., 2014) to

generate a sparse set of rectangle proposals (Faster,

2015). Each proposal represents a RoI on the feature

maps indicating whether there is a target or not. Using

RoI-Pooling in the next step, the feature extraction of

each proposal from a CNN feature map is performed.

Finally, the two processing branches mentioned above

classify the object and predict the masks. The mask

prediction indicates whether the pixels lies in the pre-

dicted bounding box of the objects or not.

Additionally, the Faster R-CNN includes an Fea-

ture Pyramid Networks (FPN), which combines low-

resolution, semantically strong features with high-

resolution, semantically weak features via a top-down

architecture with lateral connection to build an in-

network feature pyramid from single scale input. This

results in excellent gains in both accuracy and speed

(Tsung-Yi Lin et al., 2017). Moreover, the FPN can

enhance small damage detection below 30 µ by just

using a standard commercial camera system.

This paper is focused on coated cutting tools,

which are widely used in the milling industry. Due

to the increased demand for all kinds of high preci-

sion and high accuracy cutting tools determining the

wear or damages of cutting edges is of great impor-

tance (Schulz and Moriwaki, 1992). For this purpose,

a DCNN based tool measuring and inspection sys-

tem for determining the wear condition of the cutting

edges and coating homogeneity will deﬁnitely sup-

port tool manufactures as well as machining process.

The main steps and aim of this paper can be sum-

marized as follows:

1. Use of Cylindrical Shaped Enclosure (CSE) for

capturing high-quality images to avoid unwanted

reﬂection and ensure homogeneous illumination

on the optical-critical components.

2. Preprocessing the data by cropping each image

into 36 small image fractions to improve the per-

formance of the DCNN model. In addition to this,

the cropped images will support the FPN to detect

small damages.

3. Due to the high image quality, few annotated im-

ages are used to train the model and perform high-

accuracy damage detection.

4. Predicting damages on the cropped images and

merging them to reconstruct the original image

with the corresponding damages.

5. Fine tune the model by modifying the hyper pa-

rameter such as the area of anchor boxes with var-

ious backbones.

The rest of the paper is structured as follows: The

measurement apparatus is described in section 2. Sec-

tion 3 explains the method using Mask R-CNN algo-

rithm followed by section 4 which includes experi-

mental results and analysis. The paper is ended with

conclusion as section 5.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

306

2 IMAGE ACQUISITION OF

MILLING TOOLS WITH

CYLINDRICAL SHAPED

ENCLOSURE MEASUREMENT

SETUP

The proposed measurement setup (see Figure 1) has

been ﬁled for an (EU) European patent. It consists

of a cylindrical shaped enclosure (CSE) whose inner

walls are coated with Barium Sulfate (BaSO4) to en-

hance multi-light scattering. This idea is inspired by

a conventional integrating sphere, which is used as a

light source with a uniform luminance ﬁeld at the exit

port and also as a uniform illumination ﬁeld at var-

ious distances for photo metric and radiometric ap-

plications (Liu et al., 2015). For uniform distribu-

tion of light 14 multi-spectral Light Emitting Diodes

(LED) are distributed uniformly around the circum-

ference of CSE. Diffusion disks in front of the LED

are mounted. The measurement setup also consists

of a camera system which has a commercial camera

along with a slider. The slider helps in adjusting the

focal length of the lens according to the tool length.

This unique and innovative light source can be used

for various computer vision tasks such as object de-

tection and semantic segmentation. A rotation plate

is located below CSE to ensure that images are cap-

tured in a sequence of 15° so that, the entire 360° view

of the tool is obtained.

Figure 1: Measurement setup for image acquisition of com-

ponents with high reﬂection co-efﬁcient and complex heli-

cal shaped structures.

The proposed measurement allows to capture

high-quality images without any reﬂections. To

prove the mentioned point, a TiN coated milling tool

was captured using the proposed measurement setup

and normal illumination conditions without any con-

trolled environment. The difference can be observed

in the Figure 2

Figure 2: Comparing the images of the same tool captured

using normal illumination conditions(left) and the proposed

measurement setup.

Figure 3: The Mask R-CNN framework for instance seg-

mentation used for high resolution damage detection on

milling tools.

3 METHOD USING MASK R-CNN

ALGORITHM

The objective of this work is to develop a method to

capture high quality images of milling tools with TiN

coating to detect and segment damages on these tools

by using the state-of-the-art segmentation and object

detection framework Mask R-CNN.

Figure 3 illustrates the architecture of Mask R-

CNN. Mask R-CNN has three outputs, a class la-

bel, bounding box, and object mask. It consists of a

backbone network for generating multi-scale features

maps, FPN to enhance extracting semantic and ab-

stract information from the feature maps, RPN mod-

ule for generating a plenty of region proposals for re-

ﬁning bounding boxes and a mask head for generating

binary masks of the objects in out cases the damages

occur on the drilling tools.

The working principle of our proposed Mask R-

CNN based algorithm can be described step-wise as

follows:

1. Capture high quality images by using our mea-

surement setup, described in Figure 1.

2. Generate a data set by cropping each image to

small fraction of 36 images and assign each im-

age to an element of an 9x4 matrix. This step pre-

High Resolution Mask R-CNN-based Damage Detection on Titanium Nitride Coated Milling Tools for Condition Monitoring by using a

New Illumination Technique

307

vents losing of spatial information and enhances

training feature extraction by FPN and reduces the

training and test time.

3. Feed the cropped image to a residual network

ResNet101 or ResNet50 with FPN for enhancing

feature extraction to generate feature maps.

4. The feature map is then scanned by the RPN

network with a sliding window, looking for the

potential candidate for generating proposals with

different sizes and aspect ratios.

5. Now the feature maps obtained from the RPN pos-

sess large number of framed candidates as pro-

posals. The next step is to use softmax classiﬁer,

frame regression and non-maximum suppression

to discard inaccurate proposals and remain only

top-scoring predictions as RoI’s for the next step.

6. The remaining Region of Interest (RoI) on the fea-

ture maps are then sent to the Region of Inter-

est Alignment layer (RoIAlign layer) to perform

pooling and quantization on RoI thereby a ﬁxed

size of feature map for each proposal is generated.

7. The new feature map undergoes the two branches

as mentioned above. The ﬁrst one is a fully con-

nected layer for object classiﬁcation and frame re-

gression and the second branch is a fully convolu-

tional network for pixel segmentation and mask

prediction.

8. Finally, the damages on each cropped image were

obtained. Since the cropped images have been as-

signed to a 4×9 matrix the cropped images can

be merged together to depict the damages marked

with bounding boxes, scores and masks.

3.1 Related Work

In this section, an introduction of the DCNN with

special emphasis on the Region Based Object Detec-

tion (RBOD) and Semantic Segmentation (SS) meth-

ods is provided. Several research studies have been

undertaken to develop an DCNN for locating class-

speciﬁc and class agnostic bounding boxes (Szegedy

et al., 2013; Szegedy et al., 2014; Erhan et al., 2014).

Fully-Connected layer (FC) has been used to train a

model for predicting a box with special coordinates

to localize single objects and for detecting multiple

class-speciﬁc tasks (Sermanet et al., 2014). These

techniques have been employed for the region-based

CNN (R-CNN) object detection approach (Girshick

et al., 2014). Using R-CNN Girshick et al were able to

present a simple and scalable detection algorithm that

improves mAP on PASCAL VOC 2012 dataset by

more than 30%. R-CNN lacks of computation shar-

ing, resulting in slow convolutional operations perfor-

mance in forward pass for each object, resulting in a

high training time and test time as well. R-CNN com-

bined with spatial pyramid pooling networks (SPP-

nets) can speed up R-CNN by sharing computation

power up to 100 times at test time and 3 times at train-

ing time (He et al., 2015). SPPnets has still some

drawbacks. During ﬁne-tuning the SPPnets cannot

update the convolutional layers that proceed the spa-

tial pyramid pooling and decrease the accuracy of

DCNN. To overcome the disadvantages of R-CNN

and SPPnet Girshick et al. introduced Fast R-CNN

(Girshick, 2015) as an extension of R-CNN, which

then extended by Faster R-CNN in 2017 (Ren et al.,

2017). Fast R-CNN is faster than R-CNN and pre-

cedes training on VOC07 dataset 9 times faster than

R-CNN (Girshick, 2015). Faster R-CNN is ﬂexible

and robust two-stage system and considered to be the

leading frame work in several benchmarks (Tsung-

Yi Lin et al., 2017; Shrivastava et al., 2016). The

common idea behind Faster R-CNN is to use con-

volutional feature map generated by a DCNN (e.g.

Resnet) to determine region proposals with different

anchor sizes by using sliding windows for feature ex-

traction, whereas Fast R-CNN takes an input as an

entire image with a set of object proposals, which are

extracted by a region of interest pooling layers.

Applying instance segmentation and object detec-

tion tasks simultaneously proves to be challenging,

because it requires correct detection of all objects in

the image and segmenting each instance of the ob-

ject consecutively. The computer vision community

has improved beside object detection semantic seg-

mentation tasks separately. In large part, this have

been driven by powerful baseline systems, which are

based on segment proposals methods (Girshick et al.,

2014; Hariharan et al., 2014; Hariharan et al., 2017).

Jonathan Long et al. deﬁned a fully convolutional net-

work (FCN) for segmentation. FCN combines layers

of the feature hierarchy and reﬁnes the spatial preci-

sion of the output at the same time (Shrivastava et al.,

2016). Deep Mask (DM) model with two branches

has been introduced by (Pinheiro et al., 2015). For

high quality object segmentation the masks use only

the upper-layer to extract CNN features and predict

the likelihood of that segmented object. To improve

the object segmentation masks, and increase the pixel

segmentation accuracy a deep learning approaches

based on augmentation feedforward networks with

top-down reﬁnement, called SharpMask, has been

proposed in 2016 (Pinheiro et al., 2016).

Mask- R-CNN has been introduced by (He et al.,

2017; Nur Ömero

glu et al., 2019) to extend Faster R-

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

308

CNN by adding a second branch for predicting object

mask beside the exiting branch for bounding box re-

gression, adding only a small overhead to Faster R-

CNN. Since that time, Mask-R-CNN based methods

for object detection and classiﬁcation task have been

widely used to determine the category and localiza-

tion of multiclass objects, e.g., to identify and seg-

ment polyps in the colonoscopy images (Kang and

Gwak, 2019), detect ships on high resolution sensing

images (You et al., 2019), quantiﬁcation of blueber-

ries in the wilds (Gonzalez et al., 2019) and for a vari-

ety of practical damage detection application (Zhang

et al., 2020).

3.2 Different Backbones of Mask

R-CNN

In this section the ResNet Backbone used in Mask R-

CNN is discussed. The residual backbone networks

ResNet101 and ResNet50 have been used. While

ResNet101 consists of 101 layers, ResNet50 consists

of 50 layers. Both networks combine FPN to get fea-

ture maps of four levels P2, P3, P4, P5 corresponding

to last residual block for the conv2, conv3, conv4 and

conv5 outputs (Tsung-Yi Lin et al., 2017). Thus the

proposed backbone enhances in extracting damages

with different scale.

3.3 The Improvement of Detection

Accuracy by Adjusting RPN

The region-based detector RPN has been used in Fast

R-CNN, Faster R-CNN and Mask R-CNN to gener-

ate initial regions proposals at various scales and as-

pect ratios. This is done by using appropriate multiple

anchor boxes as shown in ﬁgure 4. The RPN takes

different size of feature maps generated by the FPN

module and provides outputs of object region bound-

ary and their associated object scores. The scores

specify the likelihood of each proposed region con-

taining an RoI to determine the level of the feature

pyramid in which the sliding (red window) is per-

formed. The regions scanned by the sliding window

are called anchors. Anchors are boxes centered at the

sliding window and are associated with different sizes

and aspects ratios distributed over the whole feature

map. The vector undergoes two 1×1 convolutional

layer for box regression and box classiﬁcation. At

each sliding-window location multiple region propos-

als are predicted with maximal proposals k referred

to the anchor boxes. The box regression layer outputs

4k coordinates and the classiﬁcation layer outputs 2k

proposals with probability score to estimate whether

Figure 4: The RPN architecture. The red box represents the

sliding window.

an object exist or not. If an anchor box has an In-

tersection over Union ratio (IoU) with ground truth

greater than 0.8, it is considered as positive label, oth-

erwise it considered as negative label. Therefore, the

scale and the size of the anchor boxes were tuned and

adjusted to improve the damage detection accuracy.

To reduce redundancy, Non-Maximum Suppression

(NMS) was applied to suppress low scored proposals.

3.4 Region of Interest Alignment Layer

(RoIAlign)

As mentioned above, image segmentation at pixel

level is applied by the mask branch to determine

whether a given pixel is a part of the target (here the

damage) or not. During the convolutional and polling

operations accompanied by quantization, the image

sizes changes and causes a positional offset on the

RoI. This process affects the accuracy of the small tar-

gets. Therefore, (Region of interest alignment layer)

RoIAlign is applied, in which the sampling points are

increased to calculate each sampling point by a bilin-

ear interpolation to derive the value of the entire RoI

with less offset and error. RoIAlign improve the aver-

age precision highly (He et al., 2020).

3.5 Training and Loss Function

During the training process, optimizing the loss func-

tion plays an import role for both object detection

and semantic segmentation. The training process

contains forward propagation and backward propa-

gation. Forward propagation starts with extracting

the feature map and has three branches for calculat-

ing the general loss: The mask loss, the classiﬁca-

tion loss and the location regression loss, respectively.

The Back-propagation updates the parameters of each

layer in the network and minimizes the loss func-

tion by momentum optimization algorithm (Sutskever

et al., 2013; Rumelhart et al., 1986). The RPN mod-

ule is trained by object/non-object binary classiﬁca-

High Resolution Mask R-CNN-based Damage Detection on Titanium Nitride Coated Milling Tools for Condition Monitoring by using a

New Illumination Technique

309

tion to each anchor. A positive label is assigned for

the anchor with the highest IoU overlapped with the

ground-truth box or higher than 0.8 overlapped with

the any ground-truth box. Following the multi-task

loss in Fast R-CNN (Girshick, 2015) the loss function

of the ﬁrst branch for the classiﬁcation and regression

is given by:

L (

{

}

{

}

) =

cls

∑

cls

, p

∗

) + λ

reg

∑

∗

reg

, t

∗

)

(1)

Where i is the index of an anchor in a mini-batch

and p

is the predicted probability of anchor being an

object or not. The ground-truth label p

∗

is 1 if the

anchor is positive and 0 if the anchor is negative. The

vector t

contains 4 coordinates of the ground-truth

boxes and t

∗

assigned represents the coordinates of

the predicted bounding box.

The ﬁrst term of equation 1 L

cls

is the log loss

over the binary classiﬁcation and assigned to the two

classes, namely damage or no damage. The location

regression L

reg

is the smooth L1 (Faster, 2015) loss

between the vector t

∗

and t

. The loss t

is only ac-

tivated, when the anchor is positive and is balanced

by λ (Pinheiro et al., 2016). The outputs are marked

with bounding boxes assigned to the localization of

the damage with a probability of being there a dam-

age or not. Since in this work the object detection and

the semantic segmentation are combined to classify

each pixel assigned to the damages, the mask branch

outputs a Km

dimensional matrix for each RoI cor-

responds to a K binary masks of m × m dimension for

each of the K classes. Similar to (He et al., 2020)

mask

is deﬁned for the k

mask of the RoI associated

with the ground truth class k as the average binary

cross entropy loss:

mask

∑

1≤i, j≤m

i j

log( ˆy

i j

+ (1 − y

i j

))log(1 − ˆy

i j

)]

(2)

Where y

i j

is the label of a cell (i, j) in the true

mask for the region of size m × m and ˆy

i j

is the pre-

dicted value of the same cell in the mask learned for

the ground-truth class k. The multitasking loss func-

tion of Mask R-CNN is therefore given by:

L = L (

{

}

{

}

) + L

mask

(3)

4 EXPERIMENTAL RESULTS

AND ANALYSIS

The dataset of cutting tools captured from the mea-

surement system mentioned in Figure 1 was used.

The measurement system generates homogeneous il-

luminated images of the TiN coated cutting tools with

high contrast and low noise. The high-quality images

allow us to achieve reliable results using small data

sets. The network of (Waleed Abdulla, 2017) was

modiﬁed and implemented in this study.

4.1 Dataset

There is a lack of datasets of TiN coated milling tool,

especially for damage detection applications. Captur-

ing valuable data of optical critical objects is chal-

lenged by a lot of difﬁculties. One of them is avoiding

reﬂection and shadow in the images. The exposure of

helical shaped cutting tool makes it even more chal-

lenging since the complex shape of the cutting tool

tend to have a variation of brightness, contrast and ar-

tifacts. To overcome these challenges, a new illumi-

nation technique to capture high quality images was

developed, which was allowing to use only a few im-

ages as a training data set and getting reliable results

with less than 25 training epochs. For full inspection

24 images from different angles were captured by ro-

tating the milling tool. Each image was cropped to 36

small fragments of a ﬁxed size 512×512 pixels, mak-

ing a total of 864 images. Around 144 damages were

annotated by experts.

4.1.1 Data Augmentation

The goal was to achieve high performance with only

a few manually annotated images. Therefore, the fol-

lowing data augmentation technique were applied to

increase the training data set from 518 to 5180 im-

ages.

1. The images were randomly ﬂipped (horizontally

and vertically).

2. The images were randomly rotated in a degree

range between -90 to 90.

3. The images have been scaled from 50% to 150%

of their original size.

4.1.2 Cropping Images

The Mask R-CNN and other current instance segmen-

tation methods are designed for supervised learning.

Typically a large amount of labeled data for training

are required to obtain good results. In this work it

has been shown that only a few images in combina-

tion with transfer learning and appropriate data aug-

mentation can generate high resolution damage detec-

tion, reaching an Average Precision (AP) of higher

than 0.83. For the experiment, 24 high quality im-

ages are used, where each image is cropped into 36

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

310

small fragments, resulting in total 864 images to fully

utilize the spatial information. For damage detection

database ground truth annotations of 144 images have

been done manually by drawing a bounding box over

the damages.

4.2 Implementation Details

The algorithm was implemented in Python and all of

the experiments were performed using NVIDIA Tesla

K80 24GB, Linux operating system, 2 virtual CPUs

with capacity of 2GHz and 7680MiB system memory.

The Mask-R-CNN was trained using ResNet101 and

ResNet50 as a backbone architecture for 25 epochs

using a learning momentum of 0.9, a learning rate of

0.001, weights decayed by 0.0001, batch-size of 4 im-

ages per GPU. ResNet101 took 4 hours 15 minutes

for training whereas ResNet50 took only 3 hours 58

minutes.

4.3 Used Evaluation Metrics

The anchors with IoU higher than 0.7 for all of the

Ground Truth (GT) boxes are assigned to positive

labels, whereas anchors with IoU less than 0.3 for

all of the GT boxes are assigned to negative labels.

For evaluating the performance of prepared models,

the standard metrics Intersection-over-Union (overall

IoU) and the precision haven been used. The average

precision AP over different IoU thresholds has been

considered from 0.5 to 0.95 at an interval step of 0.05

(0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95)

(Ren et al., 2017). The precision metric is indicated

as: (AP, AP

, AP

, ...AP

), here the AP

indicates the Average Precision at IoU threshold of

0.5 and AP

indicates the Average Precision at IoU

threshold of 0.55 and so on.

Precision =

T P

T P + FP

(4)

Precision represents the exactness as the ratio be-

tween the number of correctly detected pixels and the

total number of detected pixels.

• True Positives (TP): The number of pixels cor-

rectly identiﬁed as a mask (white pixels).

• True Negatives (TN): The number of pixels cor-

rectly identiﬁed as not part of a mask (black pix-

els).

• False Positives (FP): The number of pixels incor-

rectly identiﬁed as a mask.

• False Negatives (FN): The number of pixels incor-

rectly identiﬁed as not part of a mask.

Figure 5: Results of a high-resolution damage detection and

instance segmentation of an TiN coated milling tool. On the

right the damages are marked with the bounding boxes and

the prediction probability.

4.4 Damage Detection Result

The result presented in the current paper is based on

864 images. 60% of the whole dataset was used for

training, 20% used for validation and the remaining

20% was used for testing the model.

4.4.1 High Resolution Milling Tool Damage

Detection Result

Figure 5. displays an example of an TiN milling tool

image captured by the proposed novel measurement

setup (left side). On right side the result after applying

the Mask-R-CNN algorithm to detected damages can

be observed. The damages are depicted and marked

with a bounding boxes and prediction probability. It

can be clearly seen in Figure 5 that almost all damages

have been detected. Semantic segmentation can be

used to determine the size, the shape and localization

of the damages.

4.4.2 ResNet101 vs ResNet50

Generally, deep learning requires a huge amount of

data and in most cases, it is difﬁcult to ﬁnd the data

sets especially for optical critical components such as

drilling or milling tools. So, due to this reason the

training data was augmented.

At ﬁrst the detection stage was performed using

ResNet101 and it was followed by ResNet50 back-

bone. Several tests have been done by using different

training parameters such as:

High Resolution Mask R-CNN-based Damage Detection on Titanium Nitride Coated Milling Tools for Condition Monitoring by using a

New Illumination Technique

311

Figure 6: Comparison of average precision by using the

backbones ResNet50 and ResNet101.The AP as a function

of epochs of both models with different backbone architec-

ture (ResNet101 and ResNet50).

• Number of epochs

• RPN anchor scales

• AP at differnt IoU (0.50, 0.55, 0.60, . . . , 0.95)

Firstly, the average precision was compared by us-

ing the backbones ResNet50 and ResNet101 as seen

in Figure 6. To compare the detection and learning

performance the Mask-R-CNN was trained using the

ResNet101 and ResNet50 architecture for 25 epochs

by simultaneously calculating the AP as a function of

epochs. Figure 6 shows the AP@50 for both models

as a function of epochs. An AP of 0.83, was achieved

with the ResNet50 backbone architecture at epoche

21 whereas with the ResNet101 backbone architec-

ture an AP of only 0.71 at epoche 23 was achiev-

able. The ResNet50 has a smaller number of lay-

ers, which helps avoiding overﬁtting. The model ﬁle

size of ResNet50 is about 180 MB compared with 250

MB of the ResNet101, making ResNet50 more effec-

tive for a variety of applications with less computa-

tional complexity. In both cases the risk of overﬁtting

has been increased after the 23 epochs. Although the

damages differ in shape and size, the necessary com-

plexity and depth of an appropriate neural network

cannot be easily determined. Due to the small data

size and only two classes (damage or not a damage)

ResNet 50 seems to perform better than ResNet 101.

In the context of instance detection and instance seg-

mentation, the damage identiﬁcation and its position

location must be done. The Intersection over Union

(IoU) measures the overlap between the predicted

boundary and the ground of truth boundary. Thus,

the average precision for different IoU was calculated.

An appropriate scales of anchor boxes can improve

the efﬁciency and accuracy of the region proposal

generation and hence improve the overall object de-

tection accuracy. Therefore, the AP for different IoU

and different scales of anchor boxes was evaluated

Figure 7: AP at differnt IoU (0.50, 0.55, 0.60, . . . ,

0.95). The ResNet50 architecture performs better than the

ResNet101 architecture.

as shown in Figure 7. It was found, that the model

with the ResNet50 backbones architecture performs

better than ResNet101, especially by adopting the an-

chor boxes scales {16

, 32

, 64

, 128

, 256

} with the

aspect ratio {1 : 1, 1 : 2, 2 : 1}.

5 CONCLUSION

The results show that high-quality and good-

resolution images that are captured using the pro-

posed measurement setup are capable of achieving su-

perior results with the help of deep convolution neu-

ral networks. For training the network, each image

has been divided into 36 fragments to ensure high

resolution damage detection by utilizing the highest

capability of the FPN. Both in instance and seman-

tic based image segmentation promising results have

been achieved using few images combined with data

augmentation, which pave ways for new opportuni-

ties in inspection applications. To identify the dam-

ages, Mask R-CNN which consists of feature extrac-

tion of the images followed by other convolutional

layers was implemented. ResNet 50 and ResNet 101

architectures were ﬁne tuned for feature extraction.

The segmentation using ResNet 50 has achieved bet-

ter results with less computational time when com-

pared to ResNet 101.

As the future scope for the current work, a big-

ger dataset will be generated that includes different

cutting tools with different coating and variants. An-

other area of research would focus on surface rough-

ness estimation of the tools using the images from the

developed measurement setup.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

312

REFERENCES

Caggiano, A. (2018). Cloud-based manufacturing pro-

cess monitoring for smart diagnosis services. Inter-

national Journal of Computer Integrated Manufactur-

ing, 31(7):612–623.

Chen, J.-Y. and Lee, B.-Y. (2010). Development of a sim-

pliﬁed machine for measuring geometric parameters

of end mills.

Cupek, R., Erdogan, H., Huczala, L., Wozar, U., and

Ziebinski, A. (2015). Agent based quality manage-

ment in lean manufacturing. In Núñez, M., Nguyen,

N. T., Camacho, D., and Trawi

nski, B., editors, Com-

putational Collective Intelligence, volume 9329 of

Lecture Notes in Computer Science, pages 89–100.

Springer International Publishing, Cham.

Cupek, R., Zi˛ebi

nski, A., Drewniak, M., and Fojcik, M.

(2018). Improving kpi based performance analysis in

discrete, multi-variant production. In Nguyen, N. T.,

Hoang, D. H., Hong, T.-P., Pham, H., and Trawi

nski,

B., editors, Intelligent Information and Database Sys-

tems, volume 10752 of Lecture Notes in Computer

Science, pages 661–673. Springer International Pub-

lishing, Cham.

Erhan, D., Szegedy, C., Toshev, A., and Anguelov, D.

(2014). Scalable object detection using deep neural

networks. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 2147–

2154.

Faster, R. (2015). Towards real-time object detection with

region proposal networks. Advances in neural infor-

mation processing systems, page 9199.

Girshick, R. (2015). Fast r-cnn. In 2015 IEEE International

Conference on Computer Vision (ICCV), pages 1440–

1448. IEEE.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detec-

tion and semantic segmentation. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 580–587.

Gonzalez, S., Arellano, C., and Tapia, J. E. (2019). Deep-

blueberry: Quantiﬁcation of blueberries in the wild

using instance segmentation. IEEE Access, 7:105776–

105788.

Grzechca, D., Zi˛ebi

nski, A., and Rybka, P. (2017). En-

hanced reliability of adas sensors based on the obser-

vation of the power supply current and neural network

application. In Nguyen, N. T., Papadopoulos, G. A.,

J˛edrzejowicz, P., Trawi

nski, B., and Vossen, G., ed-

itors, Computational Collective Intelligence, volume

10449 of Lecture Notes in Computer Science, pages

215–226. Springer International Publishing, Cham.

Hariharan, B., Arbeláez, P., Girshick, R., and Malik, J.

(2014). Simultaneous detection and segmentation.

In European Conference on Computer Vision, pages

297–312.

Hariharan, B., Arbelaez, P., Girshick, R., and Malik,

J. (2017). Object instance segmentation and ﬁne-

grained localization using hypercolumns. IEEE trans-

actions on pattern analysis and machine intelligence,

39(4):627–639.

He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017).

Mask r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 2961–2969.

He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2020).

Mask r-cnn. IEEE transactions on pattern analysis

and machine intelligence, 42(2):386–397.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Spatial pyra-

mid pooling in deep convolutional networks for visual

recognition. IEEE transactions on pattern analysis

and machine intelligence, 37(9):1904–1916.

Kang, J. and Gwak, J. (2019). Ensemble of in-

stance segmentation models for polyp segmentation in

colonoscopy images. IEEE Access, 7:26440–26447.

Liang, S. Y., Hecker, R. L., and Landers, R. G. (2004). Ma-

chining process monitoring and control: the state-of-

the-art. J. Manuf. Sci. Eng., 126(2):297–310.

Liu, L., Zheng, F., Zhu, L., Li, Y., Huan, K., Shi, X.,

and Liu, G. (2015). Luminance uniformity of inte-

grating sphere light source. In 2015 International

Conference on Optoelectronics and Microelectronics

(ICOM), pages 265–268.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. 9905:21–37.

Nur Ömero

glu, A., Kumbasar, N., Argun Oral, E., and

Ozbek, I. Y. (2019). Mask r-cnn algoritması ile hangar

tespiti hangar detection with mask r-cnn algorithm.

27th Signal Processing and Communications Appli-

cations Conference (SIU), 31(7):1–4.

Pinheiro, P. O., Collobert, R., and Dollar, P. (2015). Learn-

ing to segment object candidates.

Pinheiro, P. O., Lin, T.-Y., Collobert, R., and Dollàr, P.

(2016). Learning to reﬁne object segments.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 779–

788.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. Advances in neural information

processing systems, 28:91–99.

Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster

r-cnn: Towards real-time object detection with re-

gion proposal networks. IEEE transactions on pattern

analysis and machine intelligence, 39(6):1137–1149.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986).

Learning representations by back-propagating errors.

nature, 323(6088):533–536.

Schulz, H. and Moriwaki, T. (1992). High-speed machin-

ing. CIRP annals, 41(2):637–643.

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus,

R., and LeCun, Y. (2014). Overfeat: Integrated recog-

nition, localization and detection using convolutional

networks. 2nd international conference on learning

representations, iclr 2014. In 2nd International Con-

ference on Learning Representations, ICLR 2014.

Short, M. and Twiddle, J. (2019). An industrial digitaliza-

tion platform for condition monitoring and predictive

High Resolution Mask R-CNN-based Damage Detection on Titanium Nitride Coated Milling Tools for Condition Monitoring by using a

New Illumination Technique

313

maintenance of pumping equipment. Sensors (Basel,

Switzerland), 19(17).

Shrivastava, A., Gupta, A., and Girshick, R. (2016). Train-

ing region-based object detectors with online hard ex-

ample mining. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

761–769.

Spišák, E. and Majernikova, J. (2017). Increasing of dura-

bility of cutting tools. Advances in Science and Tech-

nology Research Journal, 11:141–146.

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013).

On the importance of initialization and momentum in

deep learning. In International conference on machine

learning, pages 1139–1147.

Szegedy, C., Reed, S., Erhan, D., Anguelov, D., and Ioffe, S.

(2014). Scalable, high-quality object detection. arXiv

preprint arXiv:1412.1441 [Titel anhand dieser ArXiv-

ID in Citavi-Projekt übernehmen].

Szegedy, C., Toshev, A., and Erhan, D. (2013). Deep neural

networks for object detection.

Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He,

Bharath Hariharan, and Serge Belongie (2017). Fea-

ture pyramid networks for object detection.

Waleed Abdulla (2017). Mask r-cnn for object detection

and instance segmentation on keras and tensorﬂow.

Yingjie, Z. (2014). Energy efﬁciency techniques in machin-

ing process: a review. The International Journal of

Advanced Manufacturing Technology, 71(5-8):1123–

1132.

Yli-Ojanperä, M., Sierla, S., Papakonstantinou, N., and Vy-

atkin, V. (2019). Adapting an agile manufacturing

concept to the reference architecture model industry

4.0: A survey and case study. Journal of Industrial

Information Integration, 15(5):147–160.

You, Y., Cao, J., Zhang, Y., Liu, F., and Zhou, W. (2019).

Nearshore ship detection on high-resolution remote

sensing image via scene-mask r-cnn. IEEE Access,

7:128431–128444.

Zhang, Q., Chang, X., and Bian, S. B. (2020). Vehicle-

damage-detection segmentation algorithm based on

improved mask rcnn. IEEE Access, 8:6997–7004.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

314