Design and Implementation of a Finnish Unit Selection Speech Synthesizer

Silén, Hanna

Concatenative text-to-speech (TTS) synthesis is a method for artificial speech generation that utilizes a pre-recorded natural speech inventory. The intelligibility of modern TTS systems is considered relatively high thus the current emphasis is on developing the naturalness of the synthetic speech. Unit selection speech synthesis utilizes a large pre-recorded speech inventory which provides a sufficient phonetic and prosodic coverage for the language. In synthesis, the best sequence of units, typically half phones or diphones, is retrieved from the inventory and the concatenation of the units is carried out. Unlike in traditional concatenative TTS systems based on prosodic modification, no processing of unit waveforms is required. This increases the quality of the speech by avoiding the modification of units. In this thesis, the full process of constructing a TUT_VOICE unit selection TTS synthesizer for Finnish is described. The work consisted of the construction of a Finnish voice and the implementation of a synthesis engine. The voice construction included inventory design, recording of a speech inventory with a female voice, phonetic labeling of the speech, and feature extraction. The implementation included building a target unit sequence based on the text to be synthesized, selecting the best sequence of speech units in the inventory, and creation of the output speech waveform by concatenating the selected units. Synthesis quality was in accordance with the expectations: intelligibility was good but quality varied from excellent to poor among different sentences.