Object detection based on shallow spatial feature fusion and adaptive channel selection
1.Key Laboratory of Data Engineering and Visual Computing, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; 2.College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
Abstract:To solve the problem of reduced detection speed due to deepening the backbone network, an object detection method was proposed based on feature fusion and channel selection. The information in the different subspace was reasonably multiplexed in each down-sampling stage, and the feature fusion was performed with the proposed 8-fold, 4-fold and 2-fold down-sampling modules. The fused feature maps were assembled into the SSD network after adaptive channel selection for strengthening the role of global information in the object detection model. A classification loss function based on cosine distance was designed to make the object classification more accurate. The VGG network was used as backbone network. Referring to SSD object detection network, the proposed down-sampling feature fusion module, the adaptive channel selection module and the improved loss function were also added to conduct the multiple groups of comparative experiments. The results show that the mean average precision of the proposed method for object detection reaches 82.2% on the datasets of Pascal VOC 2007 and Pascal VOC 2012 with the image input size of network of 512×512, which is better than that of the single-stage object detection models. The proposed method achieves the same performance as the detection models with deeper backbone network under the condition of ensuring real-time speed.
MA J Y, LIU K, FU H Y. A video violence detection method based on multi-modal feature fusion[J]. Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition),2021,33(5):861-867.(in Chinese)
XUAN D D, WANG J, WANG Z. Salient target detection based on high-level priori semantics[J]. Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition),2020,32(2):304-312.(in Chinese)
CHEN J M, CHEN L P. A video abnormal behavior detection and location method of optimized FCN[J]. Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition),2021,33(1):126-134.(in Chinese)
[4]
HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]∥Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Computer Society,2016: 770-778.
[5]
HUANG G,LIU Z,VAN DER MAATEN L,et al. Densely connected convolutional networks[C]∥Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Computer Society, doi: 10.1109/CVPR.2017.243.
[6]
HE K M,GKIOXARI G,DOLLÁR P,et al. Mask R-CNN[C]∥Proceedings of the 16th IEEE International Conference on Computer Vision. Piscataway: IEEE Computer Society, 2017: 2961-2969.
[7]
REDMON J,DIVVALA S,GIRSHICK R,et al. You only look once: unified, real-time object detection[C]∥Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Computer Society,2016: 779-788.
[8]
LIU W,ANGUELOV D,ERHAN D,et al. SSD: single shot multibox detector[C]∥Proceedings of the 14th European Conference on Computer Vision. Heidelberg: Springer Verlag,2016: 21-37.
[9]
FU C Y,LIU W,RANGA A,et al. DSSD: deconvolutional single shot detector [EB/OL]. [2020-10-13].https:∥arxiv.org/abs/1701.06659,2017.
[10]
LIU S T, HUANG D, WANG Y H. Receptive field block net for accurate and fast object detection[C]∥Proceedings of the 15th European Conference on Computer Vision. Heidelberg: Springer Verlag, 2018: 404-419.
[11]
HOWARD A G,ZHU M L,CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications [EB/OL]. [2020-10-13].https:∥ar-xiv.org/abs/1704.04861,2017.
[12]
SANDLER M,HOWARD A,ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]∥Proceedings of the 31st Meeting of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Piscataway: IEEE Computer Society, 2018: 4510-4520.
[13]
SZEGEDY C,VANHOUCKE V,IOFFE S, et al. Rethinking the inception architecture for computer vision[C]∥Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Computer Society, 2016: 2818-2826.
[14]
HU J,SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Ana-lysis and Machine Intelligence, 2020,42(8):2011-2023.
[15]
ZHANG R. Making convolutional networks shift-inva-riant again [C]∥Proceedings of the 36th International Conference on Machine Learning. [S.l.]: International Machine Learning Society, 2019:12712-12722.