Audio research group - Tampere University

Speech synthesis refers to the artificial conversion of text into speech. It is needed in applications where the most convenient modality of interaction with the user is the speech: for instance in hands-free UIs, voice-driven services, and applications targeted for the blind. The aim of voice conversion is to convert the speech spoken by one speaker to sound like the speech of a specific target speaker while maintaining the content of an utterance. It can be used for example in creating various synthesis voices in text-to-speech systems as well as generating various voices from a speaker for dubbing purposes. In addition, knowledge on how to separate the speaker identity and the speech content can lead to improved results in speaker and speech recognition.

Corpus-based speech synthesis

At TUT the speech synthesis research is focused on corpus-based methods, i.e. methods that use real recorded speech data as a basis. The two most widely studied corpus-based synthesis approaches are unit selection and hidden Markov model (HMM) based synthesis. In the concatenative unit selection, synthetic speech is formed by copying and pasting speech segments from a speech database. In the HMM-based synthesis, we employ HMMs that have traditionally been used in speech recognition and try to learn statistical models for speech features (e.g. spectrum, pitch and, phone durations) of a speech database. The resulting models are then used in synthesis to generate artificial speech parameterizations.

Hybrid-form synthesis combining unit selection and HMM-based synthesis

Both unit selection and HMM-based synthesis have their benefits and challenges. In the most recent approach, HMM-based unit selection, we aim at combining the best of the unit selection and HMM-based speech synthesis: the smooth overall quality of HMM-based synthesis and high segmental quality of unit selection. At TUT, we have achieved progress on HMM-based unit selection by employing multiform synthesis where the poor-quality units are replaced using the underlying HMM-based approach.

Demo: Hybrid-form speech synthesis combining unit selection and HMM-based synthesis

Bibliography

conference

Evaluation of detailed modeling of the LP residual in statistical speech synthesis
Jani Nurminen, Hanna Silen, Elina Helander, Moncef Gabbouj, 2013

conference

Ways to Implement Global Variance in Statistical Speech Synthesis
Hanna Silen, Elina Helander, Jani Nurminen, Moncef Gabbouj, 2012

conference

Ch. Voice Conversion in Speech Enhancement, Modeling and Recognition - Algorithms and Applications
Jani Nurminen, Hanna Silén, Victor Popa, Elina Helander, Moncef Gabbouj, 2012

conference

Ch. Voice Conversion in Speech Enhancement, Modeling and Recognition - Algorithms and Applications
Jani Nurminen, Hanna Silén, Victor Popa, Elina Helander, Moncef Gabbouj, 2012

conference

Prediction of voice aperiodicity based on spectral representations in HMM speech synthesis
Hanna Silén, Elina Helander, Moncef Gabbouj, 2011

conference

Analysis of Duration Prediction Accuracy in HMM-Based Speech Synthesis
Hanna Silén, Elina Helander, Jani Nurminen, Moncef Gabbouj, 2010

conference

Using Robust Viterbi Algorithm and HMM-Modeling in Unit Selection TTS to Replace Units of Poor Quality
Hanna Silén, Elina Helander, Jani Nurminen, Konsta Koppinen, Moncef Gabbouj, 2010

conference

Parameterization of vocal fry in HMM-based speech synthesis
Hanna Silén, Elina Helander, Jani Nurminen, Moncef Gabbouj, 2009

article

A hybrid approach to bilingual text-to-phoneme mapping
Enikö Beatrice Bilcu, Jaakko Astola, 2008

conference

Evaluation of Finnish unit selection and HMM-based speech synthesis
Hanna Silén, Elina Helander, Jani Nurminen, Moncef Gabbouj, 2008

phdthesis

Text-To-Phoneme Mapping Using Neural Networks
Enikö Beatrice Bilcu, 2008

conference

The use of diphone variants in optimal text selection for Finnish unit selection speech synthesis
Elina Helander, Hanna Silén, Moncef Gabbouj, 2007

conference

Building a Finnish unit selection TTS system
Hanna Silén, Elina Helander, Konsta Koppinen, Moncef Gabbouj, 2007

Voice conversion

For a human, it is easy to distinguish the speaker identity from the lexical content of the speech. We can easily detect who is speaking and what. The aim of voice conversion is to convert the speech spoken by one speaker to sound like the speech of a specific target speaker while maintaining the content of an utterance. It can be used for example in creating various synthesis voices in TTS systems as well as generating various voices from a speaker for dubbing purposes. In addition, knowledge on how to separate the speaker identity and the speech content can lead to improved results in speaker and speech recognition.