Deep CLAS: deep contextual listen, attend, and spell

Abstract

Contextual Listen, Attend and Spell (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose Deep CLAS to deeply utilize contextual information. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer. Furthermore, the bias attention score is directly utilized to correct the model's output probability distribution. Additionally, a prefix tree is employed to prevent interference from irrelevant information. On the public AISHELL-1 dataset, Deep CLAS achieves a 65.78% relative increase in recall and a 53.49% relative increase in F1-score over the CLAS baselines in named entity recognition.

FullText(HTML)