Multilabel Sound Event Classification with Neural Networks

Cakir, Emre

There are multiple sound events simultaneously occuring in a real-life audio recording collected e.g. at a busy street in rush hour. The events may include traffic noise, sound of rain, people talking etc. The humans are amazingly good at distinguishing these individual events, but as of yet, there is not any machine that can detect these events with (even close to) human accuracy. Polyphonic nature of the environmental audio recordings makes it hard to detect single sound events when many events are overlapping. With the gigantic audio database and state-of-the-art machine learning methods of the digital age, this is bound to change. In this thesis, we use frequency-domain features to represent the audio input and multilabel deep neural networks (DNN) to detect multiple, simultaneous sound events in a real-life recording. We extract frequency-domain features from these recordings in short time frames. DNNs are artificial neural networks (ANN) with two or more hidden layers and they are especially good at modeling highly nonlinear relations and finding intermediate representations between system input and output. This is exactly the case in real-life sound event detection. Every feature extract is used as a training example and we train the neural network with these examples. For the evaluation of this work, we focus on the performance of different topologies of DNNs used in this task. There are a large number of hyper parameters that define the structure of a DNN, such as the number of neurons in a layer, the learning rate used during learning, number of the hidden layers etc. The effects of each of these parameters are investigated in detail. A detection accuracy of 66.5% is achieved, which outperforms the state-of-the-art method by a large margin.

Research areas