Harmonic/Percussive Separation On-line Demo

This page is an on-line demo of our recent research results on monaural harmonic/percussive source separation. Full presentation of results and method is in our paper entitled

"Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery"

and presented at the 16th IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). Get the BibTex record here.


Code based on and . Don't hesitate to contact us!

Introduction

Our work is about separating the harmonic from the percussive instruments/components that exist in a music mixture. That is, given a single channel (i.e. monaural) musical mixture (i.e. music song), our method separates the percussive and harmonic sounds that exist in the mixture. For example, in a band setup like guitar, bass, drums, percussion (e.g. conga), and vocals, our method separates the drums and the percussion from all the rest. Hence, the name harmonic/percussive source separation (HPSS).

Not caring for details, just want the demo or the results? Click here and go to the demonstration or here and go to the results!


Proposed method

Our proposed method for HPSS is based on two of out previous works. One is the MaD TwinNet and the other is the Phase Unwrapping (PU) phase recovery algorithm

For convenience, below you can find a brief introduction to the above mention methods. It has to be noted that we offer code for both of our methods and pre-trained weights (where applicable), in order to help reproducibility.

So, feel free and use our methods, visit, star, and clone out GitHub repositories, and enjoy separating sources!

Below you can see an illustration of our proposed method for HPSS.

MaD TwinNet

MaD TwinNet is based on the Masker-Denoiser architecture, augmented with the Twin Network. Thus, the "MaD" is from the "Masker-Denoiser" and TwinNet from the Twin Network. The role of MaD TwinNet in this work is to perform the separation of the percussive and the harmonic components. For a general presentation of the MaD TwinNet, you can check at the corresponding paper and demo.

The Masker is the first component of MaD TwinNet and accepts as an input the magnitude spectrogram of the mixture. Then, the Masker predicts and applies a time-frequency mask to its input and outputs a first estimate of the magnitude spectrogram of the percussive components. This estimate of the percussive components is then given as an input to the Denoiser.

The Denoiser predicts and applies a time-frequency denoising filter to the estimated percussive components. This filter aims at removing interferences, artifacts, and (in general) any other noise introduced by the separation process from the Masker.

After the application of the denoising filter by the Denoiser, the now cleaned estimated of the magnitude spectrogram of the percussive components can be used to estimate the harmonic components. This result in having separated the percussive and harmonic components.

The estimated harmonic components are given as an input to the PU algorithm to enhance them more, by applying improved phase recovery techniques.

Illustration of the MaD TwinNet for HPSS
Illustration of the method for HPSS

Phase recovery

The most common approach when separating music signals by employing magnitude spectrogram is to use the phase of the mixture. This approach is equivalent to assuming that each time-frequency bin of the short-time Fourier transform (STFT) contains information for only one source. In a realistic scenario, such as the harmonic/percussive case, this assumption does no longer hold since the sources are strongly overlapping in time and frequency.

The PU algorithm consists in predicting the phase of the harmonic source by using a sinusoidal model. Then, from this initial estimate, an iterative procedure is applied to minimize the mixing error and yield the final sources estimates. We applied the PU algorithm on the predictions of the harmonic components, in order to reduce the interferences from the percussive sources.

The iterative process is illustrated in the image bellow, and more details can be found on the corresponding website.

The iterative scheme of PU-Iter when there are 2 complex numbers to be estimated.
The iterative scheme of PU-Iter when there are 2 complex numbers to be estimated.

Demonstration

Below you can actually listen the performance of our method! We have a set of songs and for each one, we offer for listening the original mixture (i.e. the song), the original voice, and the voice as is separated by our method.

We have resulting audio from two different settings. These settings correspond to different set of hyper-parameters for MaD TwinNet.

  • In Setting 1 we used the hyper-parameters as they defined at the corresponding paper of MaD TwinNet.
  • In Setting 2 we altered them in order to match the settings where PU algorithm performs better.
You can find more information on the exact hyper-parameters in out paper!

Must be mentioned that we did not do any kind of extra post-processing to the files. You will just hear the actual, unprocessed, output of our method.


Original mixture


Original harmonic content


Original percussive content

Song information
Artist Title Genre
Signe Jakobsen What Have You Done To Me Rock Singer-Songwriter


Predicted content - Setting 1
Harmonic content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Percussive content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Predicted content - Setting 2
Harmonic content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Percussive content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU


Original mixture


Original harmonic content


Original percussive content

Song information
Artist Title Genre
Fergessen Back From The Start Melodic Indie Rock


Predicted content - Setting 1
Harmonic content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Percussive content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Predicted content - Setting 2
Harmonic content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Percussive content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU


Original mixture


Original harmonic content


Original percussive content

Song information
Artist Title Genre
Sambasevam Shanmugam Kaathaadi Bollywood


Predicted content - Setting 1
Harmonic content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Percussive content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Predicted content - Setting 2
Harmonic content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Percussive content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU


Original mixture


Original harmonic content


Original percussive content

Song information
Artist Title Genre
James Elder & Mark M Thompson The English Actor Indie Pop


Predicted content - Setting 1
Harmonic content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Percussive content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Predicted content - Setting 2
Harmonic content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Percussive content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU


Original mixture


Original harmonic content


Original percussive content

Song information
Artist Title Genre
Leaf Come around Atmospheric Indie Pop


Predicted content - Setting 1
Harmonic content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Percussive content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Predicted content - Setting 2
Harmonic content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU

Percussive content

KAM


MaDTwinNet & mix phase


MaDTwinNet & PU


Data and objective results

In other words, from what data our method learned, on what data it is tested, and how well it performed from an objective perspective?

To benchmark our complete method (i.e. MaD TwinNet plus PU algorithm), we compared the obtained results against a typical method (kernel additive model, KAM) and against MaD TwinNet but using the phase of the mixture.

You can see the information about the data at the "Dataset" section and information about the obtained objective results at the "Objective results" section.


Dataset

In order to train our method, we used the development subset of the Demixing Secret Dataset (DSD), which consists of 50 mixtures with their corresponding sources, plus music stems from MedleyDB.

For testing our method, we used the testing subset of the DSD, consisting of 50 mixtures and their corresponding sources.


Objective results

We objectively evaluated our method using the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifacts ratio (SAR). The results can be seen at the table below.

The objective evaluation results of our method for Setting 1 and Setting 2, and for the percussive components, the harmonic components, and on average.
Percussive Harmonic Average
SDR SIR SAR SDR SIR SAR SDR SIR SAR
KAM 01.42 00.44 03.76 06.60 06.71 17.66 04.01 03.57 10.71
Setting 1 MaDTwinNet & mix phase 03.35 04.65 06.10 08.62 14.22 10.75 05.99 09.44 08.43
MaDTwinNet & PU 03.35 04.66 06.08 08.58 14.45 10.59 05.97 09.55 08.34
KAM 00.98 05.03 -1.17 06.35 06.58 18.51 03.66 05.80 08.67
Setting 2 MaDTwinNet & mix phase 03.60 04.73 06.07 08.70 12.84 11.78 06.15 08.79 08.92
MaDTwinNet & PU 03.59 04.76 06.00 08.69 13.11 11.57 06.14 08.94 08.78

Acknowledgements

We would like to kindly acknowledge all those that supported and helped us for this work.