|
|
Speech recognition method based on multitask loss with additional language model |
1. School of Information Engineering, Chang′an University, Xi′an, Shaanxi 710064, China; 2. Operation Management Branch of Shaanxi Transportation Holding Group Co., Ltd., Xi′an, Shaanxi 710065, China |
|
|
Abstract To solve the problems that the Attention′s overly flexible alignment was poorly adaptable in complex environments and the language features were not fully utilized by simple endtoend models, a speech recognition method was investigated based on multitask loss with additional language model. By analyzing the characteristics of the speech signal, the features containing more information were selected in the training. Based on the Attentionbased Conformer endtoend model, the model was trained using multitask loss of CTC loss assisted pure Conformer (Attention), and the ConformerCTC speech recognition model was obtained. Based on the ConformerCTC model, by analyzing and comparing the characteristics and effects of some language models, the Transformer language model was added to the training of the above model through rescoring mechanism, and the ConformerCTCTransformer speech recognition model was obtained. The experiments on the above model were completed on the AISHELL1 data set. The results show that compared with the pure Conformer (Attention) model, the character error rate (CER) of the ConformerCTC model on the test set is reduced by 0.49%, and the CER of the ConformerCTCTransformer model on the test set is reduced by 079% compared with the ConformerCTC model. The adaptability of Attention alignment in complex environments can be improved by CTC loss, and after rescoring the TransformerCTC model with the Transformer language model, the recognition accuracy can be increased by 030% again. Compared with some existing endtoend models, the recognition effect of the ConformerCTCTransformer model is better, indicating that the model has certain effectiveness.
|
|
|
|
|
[1] |
DOKUZ Y, TUFEKCI Z. Minibatch sample selection strategies for deep learning based speech recognition [J]. Applied Acoustics, DOI: 10.1016/j.apacoust.2020.107573.
|
[2] |
鱼昆,张绍阳,侯佳正,等. 语音识别及端到端技术现状及展望[J].计算机系统应用, 2021, 30(3): 14-23.
|
|
YU K, ZHANG S Y, HOU J Z, et al. Survey of speech recognition and endtoend techniques[J]. Computer Systems & Applications, 2021,30(3):14-23. (in Chinese)
|
[3] |
邓慧珍.基于局部自注意力CTC的语音识别[D]. 哈尔滨: 黑龙江大学, 2021.
|
[4] |
DAS A, LI J Y, ZHAO R, et al. Advancing connectionist temporal classification with attention modeling[C]∥Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE,2018:4769-4773.
|
[5] |
杨威,胡燕.混合CTC/attention架构端到端带口音普通话识别[J].计算机应用研究, 2021, 38(3): 755-759.
|
|
YANG W, HU Y. Hybrid CTC/attention architecture for endtoend multiaccent Mandarin speech recognition[J]. Application Research of Computers, 2021, 38(3):755-759. (in Chinese)
|
[6] |
谢旭康,陈戈,孙俊,等.TCNTransformerCTC的端到端语音识别[J].计算机应用研究, 2022, 39(3):699-703.
|
|
XIE X K, CHEN G, SUN J, et al. TCNTransformerCTC for endtoend speech recognition[J]. Application Research of Computers, 2022, 39(3):699-703. (in Chinese)
|
[7] |
GULATI A, QIN J, CHIU C C, et al. Conformer: convolutionaugmented transformer for speech recognition[C]∥Proceedings of the Annual Conference of the International Speech Communication Association. [S.l.]:International Speech Communication Association,2020:5036-5040.
|
[8] |
BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. Endtoend attentionbased large vocabulary speech recognition[C]∥Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE,2016:4945-4949.
|
[9] |
BIADSY F, WEISS R J, MORENO P J, et al. Parrotron: an endtoend speechtospeech conversion model and its applications to hearingimpaired speech and speech separation[C]∥Proceedings of the Annual Conference of the International Speech Communication Association. Lous Tourils, Baixas, France:International Speech Communication Association, 2019:4115-4119.
|
[10] |
MA R, LIU Q, YU K. Highly efficient neural network language model compression using soft binarization training[C]∥Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway:IEEE,2019:62-69.
|
[11] |
葛轶洲,许翔,杨锁荣,等.序列数据的数据增强方法综述[J].计算机科学与探索, 2021, 15(7): 1207-1219.
|
|
GE Y Z, XU X, YANG S R, et al. Survey on sequence data augmentation[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(7):1207-1219. (in Chinese)
|
[12] |
YAO Z Y, WU D, WANG X, et al. WeNet: production oriented streaming and nonstreaming endtoend speech recognition toolkit[C]∥Proceedings of the 22nd Annual Conference of the International Speech Communication Association. [S.l.]: International Speech Communication Association, 2021:2093-2097.
|
[13] |
朱学超,张飞,高鹭,等.基于残差网络和门控卷积网络的语音识别研究[J].计算机工程与应用, 2022, 58(7):185-191.
|
|
ZHU X C, ZHANG F, GAO L, et al. Research on speech recognition based on residual network and gated convolution network[J]. Computer Engineering and Applications, 2022, 58(7):185-191. (in Chinese)
|
[14] |
LIANG C D, XU M L, ZHANG X L. Transformerbased endtoend speech recognition with residual Gaussianbased selfattention[C]∥Proceedings of the 22nd Annual Conference of the International Speech Communication Association. [S.l.]: International Speech Communication Association,2021:1495-1499.
|
|
|
|