基于多任务损失附加语言模型的语音识别方法

doi:10.3969/j.issn.1671-7775.2023.05.010

Abstract
Figure/Table
References (0)
Related Citation (15)

Download: PDF (4840 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract To solve the problems that the Attention′s overly flexible alignment was poorly adaptable in complex environments and the language features were not fully utilized by simple endtoend models, a speech recognition method was investigated based on multitask loss with additional language model. By analyzing the characteristics of the speech signal, the features containing more information were selected in the training. Based on the Attentionbased Conformer endtoend model, the model was trained using multitask loss of CTC loss assisted pure Conformer (Attention), and the ConformerCTC speech recognition model was obtained. Based on the ConformerCTC model, by analyzing and comparing the characteristics and effects of some language models, the Transformer language model was added to the training of the above model through rescoring mechanism, and the ConformerCTCTransformer speech recognition model was obtained. The experiments on the above model were completed on the AISHELL1 data set. The results show that compared with the pure Conformer (Attention) model, the character error rate (CER) of the ConformerCTC model on the test set is reduced by 0.49%, and the CER of the ConformerCTCTransformer model on the test set is reduced by 079% compared with the ConformerCTC model. The adaptability of Attention alignment in complex environments can be improved by CTC loss, and after rescoring the TransformerCTC model with the Transformer language model, the recognition accuracy can be increased by 030% again. Compared with some existing endtoend models, the recognition effect of the ConformerCTCTransformer model is better, indicating that the model has certain effectiveness.

Key words： speech recognition deep learning language model multitask loss Conformer Transformer CTC

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	LIU Yongli
	ZHANG Shaoyang
	WANG Yuheng
	XIE Yi

Cite this article:

LIU Yongli,ZHANG Shaoyang,WANG Yuheng等. Speech recognition method based on multitask loss with additional language model[J]. Journal of Jiangsu University(Natural Science Eidtion), 2023, 44(5): 564-569.

URL:

http://zzs.ujs.edu.cn/xbzkb/EN/10.3969/j.issn.1671-7775.2023.05.010 OR http://zzs.ujs.edu.cn/xbzkb/EN/Y2023/V44/I5/564

［1］	DOKUZ Y, TUFEKCI Z. Minibatch sample selection strategies for deep learning based speech recognition [J]. Applied Acoustics, DOI: 10.1016/j.apacoust.2020.107573.
［2］	鱼昆,张绍阳,侯佳正,等. 语音识别及端到端技术现状及展望[J].计算机系统应用, 2021, 30(3): 14-23.
	YU K, ZHANG S Y, HOU J Z, et al. Survey of speech recognition and endtoend techniques[J]. Computer Systems & Applications, 2021,30(3):14-23. (in Chinese)
［3］	邓慧珍.基于局部自注意力CTC的语音识别[D]. 哈尔滨: 黑龙江大学, 2021.
［4］	DAS A, LI J Y, ZHAO R, et al. Advancing connectionist temporal classification with attention modeling[C]∥Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway：IEEE，2018:4769-4773.
［5］	杨威,胡燕.混合CTC/attention架构端到端带口音普通话识别[J].计算机应用研究, 2021, 38(3): 755-759.
	YANG W, HU Y. Hybrid CTC/attention architecture for endtoend multiaccent Mandarin speech recognition[J]. Application Research of Computers, 2021, 38(3):755-759. (in Chinese)
［6］	谢旭康,陈戈,孙俊,等.TCNTransformerCTC的端到端语音识别[J].计算机应用研究, 2022, 39(3):699-703.
	XIE X K, CHEN G, SUN J, et al. TCNTransformerCTC for endtoend speech recognition[J]. Application Research of Computers, 2022, 39(3):699-703. (in Chinese)
［7］	GULATI A, QIN J, CHIU C C, et al. Conformer: convolutionaugmented transformer for speech recognition[C]∥Proceedings of the Annual Conference of the International Speech Communication Association. [S.l.]：International Speech Communication Association，2020:5036-5040.
［8］	BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. Endtoend attentionbased large vocabulary speech recognition[C]∥Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway：IEEE，2016:4945-4949.
［9］	BIADSY F, WEISS R J, MORENO P J, et al. Parrotron: an endtoend speechtospeech conversion model and its applications to hearingimpaired speech and speech separation[C]∥Proceedings of the Annual Conference of the International Speech Communication Association. Lous Tourils, Baixas, France：International Speech Communication Association, 2019:4115-4119.
[10]	MA R, LIU Q, YU K. Highly efficient neural network language model compression using soft binarization training[C]∥Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway：IEEE，2019:62-69.
[11]	葛轶洲,许翔,杨锁荣,等.序列数据的数据增强方法综述[J].计算机科学与探索, 2021, 15(7): 1207-1219.
	GE Y Z, XU X, YANG S R, et al. Survey on sequence data augmentation[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(7):1207-1219. (in Chinese)
[12]	YAO Z Y, WU D, WANG X, et al. WeNet: production oriented streaming and nonstreaming endtoend speech recognition toolkit[C]∥Proceedings of the 22nd Annual Conference of the International Speech Communication Association. [S.l.]: International Speech Communication Association, 2021：2093-2097.
[13]	朱学超,张飞,高鹭,等.基于残差网络和门控卷积网络的语音识别研究[J].计算机工程与应用, 2022, 58(7):185-191.
	ZHU X C, ZHANG F, GAO L, et al. Research on speech recognition based on residual network and gated convolution network[J]. Computer Engineering and Applications, 2022, 58(7):185-191. (in Chinese)
[14]	LIANG C D, XU M L, ZHANG X L. Transformerbased endtoend speech recognition with residual Gaussianbased selfattention[C]∥Proceedings of the 22nd Annual Conference of the International Speech Communication Association. [S.l.]: International Speech Communication Association,2021:1495-1499.