fficientNet network to extract image features, which
were then merged via the BiFPN network and
analyzed using DETR, enhancing the detection
efficiency in tasks involving the inspection of anti-
vibration hammers on power transmission lines.
Representative models such as DETR and YOLOv5
have proven their effectiveness in a broad range of
object detection tasks.
DETR’s self-attention mechanism endows it with
strong global contextual awareness, beneficial for
fully considering the relationships between ice/snow
targets and the aircraft, and operates without the need
for preset anchor boxes, thus flexibly adapting to ice
formations of varying sizes and shapes. Given these
advantages, applying the DETR model to the task of
detecting ice on aircraft surfaces holds great potential.
However, during its application in detecting ice and
snow accumulation on aircraft surfaces, issues such
as positioning deviations, especially for small targets
like icicles and clear ice, were noted, indicating that
the model’s localization performance needs
enhancement. In 2023, Chen Y (Chen, 2023). from
the Institute of Artificial Intelligence, Chinese
Academy of Sciences and the University of Chinese
Academy of Sciences proposed a localization
optimization network tailored to the DETR model and
its derivatives. This network extracts multi-scale
features from DETR’s Resnet backbone using a
Feature Pyramid Network (FPN) and uses these
features alongside Ground Truth to correct predicted
bounding boxes, thereby improving the localization
accuracy of the DETR model. This development is
significant for addressing the aforementioned
application issues. Employing advanced deep
learning techniques for detecting ice on aircraft
surfaces to enhance flight safety and efficiency
provides maintenance personnel with a precise,
efficient, and automated icing detection method,
offering reliable decision support and further
elevating flight safety standards.
2 IMPROVED DETR MODEL
BASED ON REFINEBOX
2.1 DETR Model
DETR (Detection Transformer) is an object detection
network based on the Transformer architecture.
Unlike traditional object detection methods, DETR
adopts an end-to-end approach, outputting the classes
and positions of objects directly through the
Transformer network, thus accomplishing the task of
object detection.
DETR provides a novel approach to end-to-end
object detection algorithms by combining CNNs and
the Transformer model to predict the class
information of N objects, including both targets and
background, in parallel. Leveraging the
Transformer's focus on global features, the DETR
model possesses powerful global feature learning
capabilities (Zhang et al, 2022). Specifically, DETR
first encodes the input image into feature vectors via
a CNN, which are then combined with positional
encodings. The computation of positional encodings
is as follows (Chen et al, 2023):
2/
(,2)
2/
(,21)
sin( /10000 )
cos( /10000 )
id
pos i
id
pos i
PE pos
PE pos
+
=
=
(1)
In the formula, "pos" represents the position of the
image block; "d" represents the dimension of the
vector; and "2i" and "2i+1" represent the even and
odd dimensions within "d", respectively. After the
positional encoding, the feature vector is processed
by the encoder which modifies the feature map.
Through a linear layer and a multi-head self-attention
mechanism, DETR generates a set of encoded vectors
of specific sizes, representing the objects present in
the image. These encoded vectors are matched with
known category vectors, thus determining the
probability distribution of classes for each object.
The Encoder in DETR receives feature vectors
and processes them through a series of self-attention
layers and feedforward neural networks, encoding
them to extract high-level feature representations.
These representations are then passed to the Decoder,
serving as inputs for subsequent processes. The
Decoder receives these feature representations from
the Encoder and generates the object detection results.
Typically, the Decoder is composed of a series of
self-attention layers and feedforward neural networks,
which allow it to merge and process features at
different levels. In each Decoder layer, the model
generates new predictions based on the current
feature representations and previous prediction
outcomes. These predictions include information
about the object's class and location.
The interaction between the Encoder and Decoder
is facilitated by a cross-layer multi-head self-attention
mechanism. This mechanism allows the model to
exchange information across different levels, thereby
better capturing the global context and the
relationships between objects (Fan and Ma, 2023;
Vaswani, 2017; Chen et al, 2018). The design of the
Encoder and Decoder in DETR aims to utilize the
Transformer's self-attention mechanism and
feedforward neural networks to accomplish end-to-
end object detection tasks. This architecture not only