Table 2: Comparison with Faster R-CNN on PASCAL
VOC 2017 dataset.
Model
The number
of parameters
mAP
Faster R-CNN 166M 65.7
Transformer 41M 68.8
Table 2 illustrates that with less parameter, the
transformer model achieves higher mAP which
outperforms Faster R-CNN by 3.0. Also in the
meantime, the AP acquired on PASCAL VOC is
strictly higher than that of COCO 2007 which shows
the effectiveness for transfer learning.
4 CONCLUSIONS
This study proposes a new object detection based on
transformer modelling. In addition, this paper sets up
a bidirectional matching loss for prediction. The
model contains a Resnet-101 model as a backbone, an
encoder part with an attention mechanism, a decoder
with an object query input, and a feedforward
network. The loss function is a two-step set prediction
loss carefully designed for object detection. In
addition, migration learning techniques are invoked
to demonstrate the effectiveness of improving model
performance through two baseline object detection
datasets. The paper then conducts various
experiments to analyse the performance of the model
on these two datasets. The authors implement Faster
R-CNN model for comparison. On both datasets, the
transformer model outperforms the Faster R-CNN
and has a higher AP by 3.0. Meanwhile, the
transformer trained on the PASCLA VOC maintains
AP of 68.8, which is significantly higher than that of
COCO 2007. the effectiveness of transfer learning is
well demonstrated. This redesigned approach to the
detection system presents a number of challenges,
particularly in the areas of training, optimization, and
small-object performance. Previous detection models
have been improved over the years to address similar
problems. In the future, semantic segmentation tasks
for transformers will be considered as the next phase
of research.
REFERENCES
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017).
Attention is all you need. Advances in neural
information processing systems, p. 30.
Bahdanau, D., Cho, K., Bengio, Y., (2014). Neural machine
translation by jointly learning to align and translate.
arXiv preprint:1409.0473.
Wang, X., Girshick, R., Gupta, A., et al. (2018). Non-local
neural networks. Proceedings of the IEEE conference
on computer vision and pattern recognition. pp: 7794-
7803.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020).
An image is worth 16x16 words: Transformers for
image recognition at scale. arXiv preprint
arXiv:2010.11929.
Carion, N., Massa, F., Synnaeve, G., et al. (2020). End-to-
end object detection with transformers. European
conference on computer vision. Cham: Springer
International Publishing, pp: 213-229.
Graves, A., Graves, A., (2012). Long short-term memory.
Supervised sequence labelling with recurrent neural
networks, pp: 37-45.
Cai, Z., Vasconcelos, N., (2019). Cascade R-CNN: High
quality object detection and instance segmentation.
IEEE transactions on pattern analysis and machine
intelligence, vol. 43(5). pp: 1483-1498.
Zhang, S., Chi, C., Yao, Y., et al. (2020). Bridging the gap
between anchor-based and anchor-free detection via
adaptive training sample selection. Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition. pp: 9759-9768.
Lin, T, Y., Maire, M., Belongie, S., et al. (2014). Microsoft
coco: Common objects in context. Computer Visionโ
ECCV 2014: 13th European Conference, Zurich,
Switzerland, pp: 740-755.
Everingham, M., Van, Gool, L., Williams, C, K, I., et al.
(2010). The pascal visual object classes (voc) challenge.
International journal of computer vision, vol 88. pp:
303-338.
Stewart, R., Andriluka, M., and Ng, Y, A., (2016). End-to-
End People Detection in Crowded Scenes. IEEE
Conference on Computer Vision and Pattern
Recognition (CVPR), Las Vegas, NV, USA, pp: 2325-
2333.