Speech recognition method based on multitask loss with additional language model
1. School of Information Engineering, Chang′an University, Xi′an, Shaanxi 710064, China; 2. Operation Management Branch of Shaanxi Transportation Holding Group Co., Ltd., Xi′an, Shaanxi 710065, China
Abstract:To solve the problems that the Attention′s overly flexible alignment was poorly adaptable in complex environments and the language features were not fully utilized by simple endtoend models, a speech recognition method was investigated based on multitask loss with additional language model. By analyzing the characteristics of the speech signal, the features containing more information were selected in the training. Based on the Attentionbased Conformer endtoend model, the model was trained using multitask loss of CTC loss assisted pure Conformer (Attention), and the ConformerCTC speech recognition model was obtained. Based on the ConformerCTC model, by analyzing and comparing the characteristics and effects of some language models, the Transformer language model was added to the training of the above model through rescoring mechanism, and the ConformerCTCTransformer speech recognition model was obtained. The experiments on the above model were completed on the AISHELL1 data set. The results show that compared with the pure Conformer (Attention) model, the character error rate (CER) of the ConformerCTC model on the test set is reduced by 0.49%, and the CER of the ConformerCTCTransformer model on the test set is reduced by 079% compared with the ConformerCTC model. The adaptability of Attention alignment in complex environments can be improved by CTC loss, and after rescoring the TransformerCTC model with the Transformer language model, the recognition accuracy can be increased by 030% again. Compared with some existing endtoend models, the recognition effect of the ConformerCTCTransformer model is better, indicating that the model has certain effectiveness.
柳永利, 张绍阳, 王裕恒, 解熠. 基于多任务损失附加语言模型的语音识别方法[J]. 江苏大学学报(自然科学版), 2023, 44(5): 564-569.
LIU Yongli, ZHANG Shaoyang, WANG Yuheng, XIE Yi. Speech recognition method based on multitask loss with additional language model[J]. Journal of Jiangsu University(Natural Science Eidtion)
, 2023, 44(5): 564-569.
DOKUZ Y, TUFEKCI Z. Minibatch sample selection strategies for deep learning based speech recognition [J]. Applied Acoustics, DOI: 10.1016/j.apacoust.2020.107573.
YU K, ZHANG S Y, HOU J Z, et al. Survey of speech recognition and endtoend techniques[J]. Computer Systems & Applications, 2021,30(3):14-23. (in Chinese)
[3]
邓慧珍.基于局部自注意力CTC的语音识别[D]. 哈尔滨: 黑龙江大学, 2021.
[4]
DAS A, LI J Y, ZHAO R, et al. Advancing connectionist temporal classification with attention modeling[C]∥Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE,2018:4769-4773.
YANG W, HU Y. Hybrid CTC/attention architecture for endtoend multiaccent Mandarin speech recognition[J]. Application Research of Computers, 2021, 38(3):755-759. (in Chinese)
XIE X K, CHEN G, SUN J, et al. TCNTransformerCTC for endtoend speech recognition[J]. Application Research of Computers, 2022, 39(3):699-703. (in Chinese)
[7]
GULATI A, QIN J, CHIU C C, et al. Conformer: convolutionaugmented transformer for speech recognition[C]∥Proceedings of the Annual Conference of the International Speech Communication Association. [S.l.]:International Speech Communication Association,2020:5036-5040.
[8]
BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. Endtoend attentionbased large vocabulary speech recognition[C]∥Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE,2016:4945-4949.
[9]
BIADSY F, WEISS R J, MORENO P J, et al. Parrotron: an endtoend speechtospeech conversion model and its applications to hearingimpaired speech and speech separation[C]∥Proceedings of the Annual Conference of the International Speech Communication Association. Lous Tourils, Baixas, France:International Speech Communication Association, 2019:4115-4119.
[10]
MA R, LIU Q, YU K. Highly efficient neural network language model compression using soft binarization training[C]∥Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway:IEEE,2019:62-69.
GE Y Z, XU X, YANG S R, et al. Survey on sequence data augmentation[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(7):1207-1219. (in Chinese)
[12]
YAO Z Y, WU D, WANG X, et al. WeNet: production oriented streaming and nonstreaming endtoend speech recognition toolkit[C]∥Proceedings of the 22nd Annual Conference of the International Speech Communication Association. [S.l.]: International Speech Communication Association, 2021:2093-2097.
ZHU X C, ZHANG F, GAO L, et al. Research on speech recognition based on residual network and gated convolution network[J]. Computer Engineering and Applications, 2022, 58(7):185-191. (in Chinese)
[14]
LIANG C D, XU M L, ZHANG X L. Transformerbased endtoend speech recognition with residual Gaussianbased selfattention[C]∥Proceedings of the 22nd Annual Conference of the International Speech Communication Association. [S.l.]: International Speech Communication Association,2021:1495-1499.