A Study on Speech Recognition System for Spontaneous Speech

Atsuhiko KAI
Toyohashi University of Technology


Recent progress in studies on speech recognition system has been mainly achieved by the improvement of the computational performance and by introducing the statistical methods into acoustical and language modeling with a large database. Such progress has made important contributions to relax the restrictions of speech recognizers: from isolated word recognition to continuous speech recognition, from speaker dependent recognition to speaker independent recognition, and from small/medium vocabulary recognition to large vocabulary recognition. However, such methods could not be directly applied to spontaneous speech since they had usually assumed to account for read speech and that the utterance followed a grammar for written language. In spontaneous speech, the limitation of the performance is expected in conventional methods based on a hierarchical architecture since spontaneous speech involves ambiguous pronunciations as well as the insertions of interjections, restarts, hesitations and ellipses of postpositional particles (disfluencies and ill-formed constructs).

This thesis discusses an attempt to dealing with spontaneous speech on the basis of using a statistical acoustic model and pattern matching techniques. First, two algorithms which integrate the speech verification and syntactic analysis are presented. The algorithms incorporate the syntactic constraints into speech verification process to find out an optimal or sub-optimal solution with respect to the pattern matching problem. Both of the methods assume that the syntactic knowledge is represented by a context-free grammar which is suitable for describing natural language. One of the methods employs word spotting methods, which have also been used in our conventional system, based on the augmented continuous DP method. Another method is proposed on the basis of an optimal pattern-matching based algorithm, the One Pass DP method. The former approach achieves an efficient process which only needs to perform the word verification in the order of vocabulary size. The latter approach attempts to find out an optimal solution with a little increase in computation with respect to the former approach by employing the beam search method to derive a dynamic constraint for search space and a pruning method to avoid the verification for unlikely hypotheses.

This thesis also investigates the method for dealing with unknown or the out-of-vocabulary words in continuous speech recognition. Such a method is necessary also for dealing with spontaneous speech since the extraneous speech such as interjections and restarts can be processed in the same way. In general, the out-of-vocabulary words can be represented by an arbitrary sequence of subword units if the vocabulary word is constructed by the concatenation of acoustic model based on a subword unit. Thus, the out-of-vocabulary words are detected with respect to the likelihood ratio between the hypotheses which correspond to the registered word and out-of-vocabulary word, respectively. The approach is effectively applied to our speech recognition system. Some experimental results show the effectiveness of this approach using the test utterances which include out-of-vocabulary words and interjections. This approach can also be applied to the rejection of a sentence hypothesis of the recognizer output. To know objectively the effectiveness of this rejection method, some experiments by simulation of isolated word recognition are carried out and the relationship between the word recognition accuracy and the correct rejection rate is reported.

It is necessary for speech recognition system to identify the extraneous speech such as interjections and restarts for dealing with spontaneous speech, as well as to parse illformed sentences which involve the inversion, ellipses of postpositional particles and ungrammatical sentences. Although the spontaneous speech has the acoustic and linguistic phenomena of a different nature, the explicit comparison of the methods for speech and language processing has not been performed. Thus, this thesis attempts to compare different search and parsing strategies by realizing the different experimental systems along with a proposed One Pass method-based system which includes unknown word processing.