Singing Voice Recognition for Music Information Retrieval
This thesis proposes signal processing methods for analysis of singing voice audio signals, with the objectives of obtaining information about the identity and lyrics content of the singing. Two main topics are presented, singer identification in monophonic and polyphonic music, and lyrics transcription and alignment. The information automatically extracted from the singing voice is meant to be used for applications such as music classification, sorting and organizing music databases, music information retrieval, etc. For singer identification, the thesis introduces methods from general audio classification and specific methods for dealing with the presence of accompaniment. The emphasis is on singer identification in polyphonic audio, where the singing voice is present along withmusical accompaniment. The presence of instruments is detrimental to voice identification performance, and eliminating the effect of instrumental accompaniment is an important aspect of the problem. The study of singer identification is centered around the degradation of classification performance in presence of instruments, and separation of the vocal line for improving performance. For the study, monophonic singing was mixed with instrumental accompaniment at different signalto- noise (singing-to-accompaniment) ratios and the classification process was performed on the polyphonic mixture and on the vocal line separated from the polyphonic mixture. The method for classification including the step for separating the vocals is improving significantly the performance compared to classification of the polyphonic mixtures, but not close to the performance in classifying the monophonic singing itself. Nevertheless, the results show that classification of singing voices can be done robustly in polyphonic music when using source separation. In the problem of lyrics transcription, the thesis introduces the general speech recognition framework and various adjustments that can be done before applying themethods on singing voice. The variability of phonation in singing poses a significant challenge to the speech recognition approach. The thesis proposes using phoneme models trained on speech data and adapted to singing voice characteristics for the recognition of phonemes and words from a singing voice signal. Language models and adaptation techniques are an important aspect of the recognition process. There are two different ways of recognizing the phonemes in the audio: one is alignment, when the true transcription is known and the phonemes have to be located, other one is recognition, when both transcription and location of phonemes have to be found. The alignment is, obviously, a simplified form of the recognition task. Alignment of textual lyrics to music audio is performed by aligning the phonetic transcription of the lyrics with the vocal line separated from the polyphonic mixture, using a collection of commercial songs. The word recognition is tested for transcription of lyrics from monophonic singing. The performance of the proposed system for automatic alignment of lyrics and audio is sufficient for facilitating applications such as automatic karaoke annotation or song browsing. The word recognition accuracy of the lyrics transcription from singing is quite low, but it is shown to be useful in a query-by-singing application, for performing a textual search based on the words recognized from the query. When some key words in the query are recognized, the song can be reliably identified.