Abstract: To solve the problem that the comprehensive utilization of overall-local-aware spatio-temporal relationship information was not considered in current action prediction and localization and was not conducive to improving the performance of action detection and localization, a temporal action detection method based on overall-local-aware graph network was proposed. To obtain richer overall spatio-temporal feature representation of proposals, the feature similarity and temporal overlap of each action proposal was comprehensively exploited to construct the overall relation graph reasoning sub-network of proposals. To obtain local relation information of proposals under different time scales, the partial order relationship over time for the proposals was exploited, and the local relation graph reasoning sub-network was constructed, which consisted of multiple levels of three-body similar graphs and three-body complementary graphs. The rich overall-local aware features for the proposals were represented, which were used to predict and localize actions. The experiments were conducted on two public datasets of Thumos14 and ActivityNet1.3 and measured by the mean average precision metric. The results show that compared with the advanced methods of PGCN, G-TAD, TAL-Net and CDC, the proposed method can effectively improve the performance of action detection.
QI M, XU H, LI S,et al. An action recognition method based on two-stream network[J]. Journal of Jilin University (Science Edition), 2023, 61(2):347-352.(in Chinese)
[2]
SHOU Z, WANG D A, CHANG S F. Temporal action localization in untrimmed videos via multi-stage CNNs[C]∥Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Computer Society, 2016:1049-1058.
CHEN J M, CHEN L P. A video abnormal behavior detection and location method of optimized FCN[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2021,33(1):126-134.(in Chinese)
[4]
CHAO Y W, VIJAYANARASIMHAN S, SEYBOLD B, et al. Rethinking the faster R-CNN architecture for temporal action localization[C]∥Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Computer Society, 2018:1130-1139.
[5]
LIN T W, ZHAO X, SHOU Z. Single shot temporal action detection[C]∥Proceedings of the 2017 ACM Multimedia Conference. New York: ACM, 2017:988-996.
[6]
SHOU Z, CHAN J, ZAREIAN A, et al. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos[C]∥Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017 :1417-1426.
[7]
FARHA Y A, GALL J. MS-TCN: multi-stage temporal convolutional network for action segmentation[C]∥Proceedings of the 2019 IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition. Piscataway:IEEE Computer Society,2019: 3570-3579.
[8]
CHEN P H, GAN C, SHEN G Y, et al. Relation attention for temporal action localization[J]. IEEE Transactions on Multimedia, 2020, 22(10):2723-2733.
[9]
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. arXiv, 2017,DOI:10.48550/arXiv.1706.03762.
[10]
HU H, GU J Y, ZHANG Z, et al. Relation networks for object detection[C]∥Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Computer Society, 2018:3588-3597.
[11]
KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[C]∥Proceedings of the 5th International Conference on Learning Representations.[S.l.]: ICLR, 2017:596-603.
[12]
CHEN C Y, GRAUMAN K. Efficient activity detection in untrimmed video with max-subgraph search[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(5): 908-921.
[13]
YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]∥Proceedings of the 32nd AAAI Confe-rence on Artificial Intelligence.[S.l.]: AAAI Press, 2018: 7444-7452.
[14]
YANG J W, LU J S, LEE S, et al. Graph R-CNN for scene graph generation[C]∥Proceedings of the 15th European Conference on Computer Vision. Heidelberg:Springer Verlag, 2018:690-706.
[15]
ZENG R H, HUANG W B, GAN C, et al. Graph con-volutional networks for temporal action localization[C]∥Proceedings of the 17th IEEE/CVF International Confe-rence on Computer Vision. Piscataway:IEEE,2019:7093-7102.
ZHOU H, ZHAN Y Z, MAO Q R. Video anomaly detection based on space-time fusion graph network lear-ning[J]. Journal of Computer Research and Development, 2021,58(1): 48-59.(in Chinese)
[17]
LIN T W, ZHAO X, SU H S, et al. BSN: boundary sensitive network for temporal action proposal generation[C]∥Proceedings of the 15th European Conference on Computer Vision. Heidelberg:Springer Verlag,DOI: 10.1007/978-3-030-01225-0_1.
[18]
CARREIRA J, ZISSERMAN A. Quo vadis, action re-cognition? A new model and the kinetics dataset[C]∥Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE ,2017:4724-4733.
[19]
JIANG Y G, LIU J G, ZAMIR A R, et al. THUMOS challenge: action recognition with a large number of classes[EB/OL].[2021-11-29]. http:∥crcv.ucf.edu/THUMOS14/.
[20]
HEILBRON F C, ESCORIA V, GHANEM B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]∥Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Re-cognition. Piscataway:IEEE Computer Society, 2015:961-970.
[21]
GAO J Y, YANG Z H, SUN C, et al. Turn tap: temporal unit regression network for temporal action proposals[C]∥Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2017:3648-3656.
[22]
XU H J, DAS A, SAENKO K. R-C3D: region convolutional 3D network for temporal activity detection[C]∥Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2017:5794-5803.
[23]
BUCH S, ESCORCIA V, SHEN C Q, et al. SST: single-stream temporal action proposals[C]∥Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6373-6382.
[24]
GAO J Y, YANG Z H, NEVATIA R. Cascaded boun-dary regression for temporal action detection[C]∥Proceedings of the 28th British Machine Vision Conference.[S.l.]:BMVA Press, DOI:10.5244/c.31.52.
[25]
XU M M, ZHAO C, ROJAS D S, et al. G-TAD: sub-graph localization for temporal action detection[C]∥Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Computer Society, 2020:10153-10162.
[26]
SINGH B, MARKS T K, JONES M, et al. A multi-stream bi-directional recurrent neural network for fine-grained action detection[C]∥Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Re-cognition. Piscataway:IEEE Computer Society, 2016:1961-1970.
[27]
DAI X Y, SINGH B, ZHANG G Y, et al. Temporal context network for activity localization in videos[C]∥Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2017:5727-5736.
[28]
ZHAO Y, XIONG Y J, WANG L M, et al. Temporal action detection with structured segment networks[J]. International Journal of Computer Vision, 2020,128(1):74-95.