Abstract:In view of the changes in video sampling and the different movement speed of target subject, to solve the problems of the current deep network learning with single video sequence feature and the multiple action classifiers with different classification confidence, the video action recognition method was proposed based on multi time scale two-stream CNN and confidence fusion. Two-stream network was used to learn and extract the context information features between video frames of different time span on multiple time scales, and LSTM was used to predict the action categories with various characteristics. For each scale and mode of action classifier, the confidence of category determination was established with consideration of overall difference and uniqueness of category between the sample and other categories. Each classifier was used to fuse the confidence degree and the score of the action category decision for recognizing the video action. The video action recognition experiments were performed on the dataset of UCF101. The results show that the proposed method can effectively learn the contextual information of multiple time scales of videos and improve the accuracy of video action recognition to 92.2%.
GAO C Q, CHEN X. Deep learning based action detection: a survey\[J\]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2020, 32(6):991-1002.(in Chinese)
[2]
WANG M, ZHANG F L, HU S M. Data-driven image analysis and editing: a survey\[J\]. Journal of Computer-Aided Design & Computer Graphics, 2015,27(11):2015-2024.
[3]
KWON D, KIM H, KIM J, et al. A survey of deep learning-based network anomaly detection\[J\]. Cluster Computing, doi:10.1007/s10586-017-1117-8.
WEN C J, ZHAO S S, SHEN L W, et al. Sports video behavior recognition based on local spatio-temporal pattern\[J\]. Journal of Jilin University (Science Edition),2020,58(2): 379-387. (in Chinese)
PENG J Y, FANG Y, HUANG C, et al. Cyber security named entity recognition based on deep active learning\[J\]. Journal of Sichuan University (Natural Science Edition), 2019,56(3):457-462. (in Chinese)
[6]
JI S W, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition\[J\]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221-231.
[7]
BACCOUCHE M, MAMALET F, WOLF C, et al. Sequential deep learning for human action recognition\[C\]∥ Proceedings of the International Workshop on Human Behavior Understanding. Heidelberg: Springer,2011: 29-39.
[8]
SAVRAN KIZILTEPE R, GAN J Q, ESCOBAR J J. Combining very deep convolutional neural networks and recurrent neural networks for video classification\[C\]∥ Proceedings of the 15th International Work-Conference on Artificial Neural Networks. Heidelberg: Springer Verlag, 2019:811-822.
[9]
SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos\[C\]∥Proceedings of the Advances in Neural Information Processing Systems. Montréal: NIPS, 2014: 568-576.
\[10\]FEICHTENHOFER C, PINZ A, ZISSERMAN A. Con-volutional two-stream network fusion for video action re-cognition\[C\]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Computer Society, 2016: 1933-1941.
[11]
NG Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification\[C\]∥Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Computer Society, 2015: 4694-4702.
[12]
TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks\[C\]∥Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE Compu-ter Society, 2015: 4489-4497.
[13]
DONAHUE J, HENDRICKS L A, ROHRBACH M, et al. Long-term recurrent convolutional networks for visual recognition and description\[J\]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017,39(4):677-691.
[14]
FEICHTENHOFER C, FAN H, MALIK J, et al. Slowfast networks for video recognition\[C\]∥ Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE,2019: 6202-6211.
[15]
SINGH R, KHURANA R, KUSHWAHA A K S, et al. Combining CNN streams of dynamic image and depth data for action recognition\[J\]. Multimedia Systems, 2020,26(3): 313-322. [16]HOCHREITER S, SCHMIDHUBER J. Long short-term memory\[J\]. Neural Computation, 1997, 9(8): 1735-1780.
YIN Y, ZHAN Y Z, JIANG Z. Semi-supervised ensemble learning for video semantic detection based on pseudo-label confidence selection\[J\]. Journal of Computer Applications, 2019, 39(8):2204-2209. (in Chinese)
[18]
CAI Z W, WANG L M, PENG X J, et al. Multi-view super vector for action recognition\[C\]∥Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Computer Society, 2014: 596-603.
[19]
PENG X J, WANG L M, WANG X X, et al. Bag of visual words and fusion methods for action recognition: comprehensive study and good practice\[J\]. Computer Vision and Image Understanding, 2016, 150: 109-125.
[20]
WANG L M, QIAO Y, TANG X O. MoFAP: a multi-level representation for action recognition\[J\]. International Journal of Computer Vision, 2016, 119(3): 254-271.
ZHI H X, YU H T, LI S M. Video classification based on cascaded encoding fusion of temporal and spatial deep features\[J\]. Application Research of Computers, 2018, 35(3):926-929. (in Chinese)