Text-To-Phoneme Mapping Using Neural Networks

Bilcu, Enikö Beatrice

Text-to-phoneme (TTP) mapping, also called grapheme-to-phoneme (GTP) conversion, defines the process of transforming a written text into its corresponding phonetic transcription. Text-to-phoneme mapping is a necessary step in any state-of-the-art automatic speech recognition (ASR) and text-to-speech (TTS) system, where the textual information changes dynamically (i.e., new contact entries for name dialing, or new short messages or emails to be read out by a device). There are significant differences between the implementation requirements of a text-to-phoneme mapping module embedded into the automatic speech recognition and into the text-to-speech systems: in automatic speech recognition systems the errors of the text-to-phoneme mapping module are tolerated better (leading to occasional recognition errors) than in the text-to-speech applications, where the effect is immediately and in all cases audible. Automatic speech recognition systems typically use text-to-phoneme mapping to lower the footprint (to avoid storing the lexicon), while maintaining quality. The use of text-to-phoneme mapping in the text-to-speech systems is different. In addition to the phonetic information, the text-to-speech systems also need prosodic information to be able to produce high quality speech, which cannot be predicted by text-to-phoneme mapping. Most state-of-the-art text-to-speech systems use explicit pronunciation lexicon, which is aimed at providing the widest possible coverage, in the order of 100K words, with high quality pronunciation information. Because of this reason, text-to-phoneme mapping is typically used as a fall-back strategy, when the system encounters very rare or non-native words and the quality of a ext-to-speech system is indirectly affected by the quality of the grapheme-to-phoneme conversion. Another important issue is the question of training the text-to-phoneme mapping module. The problem of grapheme-to-phoneme conversion is a static one and such a system is trained off-line. The correspondence between the written and spoken form of a language is usually unchanged in the lifetime of an application. So the complexity/speed of the model training is of secondary importance compared to e.g., the speed of convergence or model size. In this thesis, the problem of text-to-phoneme mapping using neural networks is studied. One of the main goals of the thesis is to provide a comprehensive analysis of different neural network structures which can be implemented to convert a written text into its corresponding phonetic transcription. Another important target, of this work, is to provide new solutions that improve the performance of the existing algorithms, in terms of convergence speed and phoneme accuracy. Three main neural network classes are studied in this thesis: the multilayer perceptron (MLP) neural network, the recurrent neural network (RNN) and the bidirectional recurrent neural network (BRNN). Due to their ability of self adaptation, neural networks have been shown to be a viable solution in applications that require modeling abilities. Such an application is the text-tophoneme mapping where the correspondence between letters of a written text and their corresponding phonetic transcription must be modeled. One of the main concerns in all practical implementations, where neural networks are used, is to develop algorithms which provide fast convergence of the synaptic weights and in the same time good mapping performances. When a neural network is trained for text-to-phoneme mapping, at every iteration, a letter-phoneme pair is presented to the network such that, the number of letters and the number of training iterations are equal. As a result, fast convergence of the neural network means smaller size of the training dictionary since fast convergence is in fact similar to less necessary training letters1. A fast convergence speed is important in applications where only a small linguistic database is available. Of course, one solution could be to use a small dictionary (with very few words) which is presented at the input of the neural network many times until the convergence of the synaptic weights is reached. In this case the time of training becomes more important. Taking into account these two sides of the convergence speed (the size of the training dictionary and the processing time during training) one can understand the importance of having algorithms that ensure fast convergence of the neural network. It is well known that the error back-propagation algorithm which is used to train the MLP neural network, possess sometimes a quite slow convergence (a very large number of iterations required to reach the stability point). In order to increase the convergence speed two novel alternative solutions are proposed in this thesis: one using an adaptive learning rate in the training process and another which is a transform domain implementation of the multilayer perceptron neural network. The computational complexity of the two proposed training algorithms is slightly higher than the computational complexity of the error back-propagation algorithm but the number of training iterations is highly reduced. Due to this fact, although the three algorithms might have the same training time, the novel algorithms necessitate smaller training dictionary. Due to the limitations of the processing power that usually are encountered in real devices, another very important requirement for a text-to-phoneme mapping system is to have low computational and memory costs. In the case of text-to-phoneme mapping systems based in neural networks, the computational complexity is mainly linked to the mathematical complexity of the training algorithm as well as to the number of the synaptic weights of the neural network. Memory load is due to the number of synaptic weights of the neural network which must be stored. Taking into account all these limitations and implementation requirements, in this thesis, several neural network structures with different number of synaptic weights and trained with various training algorithms, are studied. The modeling capability of the neural networks is addressed, which is translated in the text-to-phoneme mapping case into the phoneme accuracy. Different neural network structures, training algorithms and network complexities are analyzed also from this point of view. As a remark here, we mention that input letter encoding plays a very important role in the phoneme accuracy of the grapheme-to-phoneme conversion system. This is why special attention has been paid to the comparative analysis of the performances (in terms of phoneme accuracy) obtained with several orthogonal and non-orthogonal encoding of the input letters. The thesis is structured into four main parts. Chapter 1 brings the reader into the world of text-to-phoneme mapping. In Chapter 2 several different neural network structures and their corresponding training algorithms are described and two new training algorithms are introduced and analyzed. In Chapter 3 the experimental results, for the problem of monolingual text-to-phoneme mapping, obtained with the neural networks described in Chapter 2 are shown. Chapter 4 is dedicated to the problem of bilingual grapheme-to-phoneme conversion and Chapter 5 concludes the thesis.

Research areas