Text-independent speaker identification

Kuja-Halkola, Sami
Abstract

This thesis concerns the problem of automatic recognition of a person based on his or her voice. The main objective is on the text-independent speaker identification task. A few different feature extraction algorithms are presented and evaluated using the KING and POLYCOST speech corpora. The principal feature used are the mel-frequency cepstral coefficients, but also other feature sets are considered including the spectral slope and the fundamental frequency.The work concentrates mostly on different classification algorithms used in speaker recognition. The biggest attention is on the Gaussian mixture speaker models (GMM) which have been widely used in many text-independent speaker recognition studies. Conventionally, the GMM parameters are trained with the well-known expectation maximization (EM) algorithm. An identification accuracy of 95.8% is achieved using this method for a population of 25 speakers with 60 seconds of training speech and a test sequence of 5 seconds.A drawback of the conventional EM training is the need for selecting the number of the mixture components, i.e. the model order, before the actual training procedure. Moreover, the number of components is usually the same for each speaker. The most important part of this thesis concerns simultaneous order selection and parameter training of GMMs. We evaluate a couple of recently proposed algorithms capable of selecting the order of each speaker model individually during a single training procedure. The algorithms provide a straightforward way of adjusting the number of components of each GMM to the acoustic characteristics and amount of available training data for each speaker. The methods are based on integrating some model complexity criterion, such as the minimum descripition length, into the EM training process. It is observed that when the amount of available training data varies between speakers, a relative reduction of 13% in error rate is obtained using an algorithm proposed by Figueiredo and Jain [Figueiredo02]. If the speech samples are recorded over a telephone line, a reduction of 11% in error rate is observed using the agglomerative EM algorithm [Figueiredo99].

Year:
2002