Transfer Learning for Improving Singing Voice Detection in Polyphonic Instrumental Music

1. Some synthesized audio clip samples in the source task

2. Some polyphonic music samples in the target task

3. Patterns of different filters in L_i, for each subgraph, the x-axis is time (T) and the y-axis is frequency (F).

For more details on the filter patterns learned in the source task, please visit here : https://github.com/moses1994/singing-voice-detection/tree/master/3-filter-patterns-learned-in-source-task

For more details on the filter patterns learned in the target task, please visit here : https://github.com/moses1994/singing-voice-detection/tree/master/3-filter-patterns-learned-in-target-task

4. Some internal representations of feature maps in audio clips

To gain deeper insights of the proposed method, we visualized the internal representations of the fixed and the fine-tuning convolutional layers based on the same song clip in the target task. From top to bottom in each graph, they are: 1) log mel spectrogram of the audio clip; 2) the output of L_i in Fine-tuning mode in the target task; 3) the output of L_i in Fixed mode in the target task.
Due to the limitation of space, the internal representations of one feature map, which was randomly selected from the feature maps, In these Figures, L₁ learns more obvious basic local features of the input spectrogram than L₂ and L₃. The L₁ after fine-tuning, whose representations of high frequency harmonics are abundant, learns more high frequency harmonic components than that of fixed one. The fine-tuned and retrained L₁ preserves features of the input audio clip better and matches the target domain more closely
For more samples about the internal representations learned in the target task, please visit here : https://github.com/moses1994/singing-voice-detection/tree/master/4-internal-representations-of-feature-maps

polyphonic music clip: 1

polyphonic music clip: 2

5. Final singing-voice detection results for some polyphonic songs

In these videos, the middle black vertical line represents the current playback position, the blue lines indicate the raw audio waves, the yellow block represents the singing voice activity, and the gray block indicates the non-singing area.

Song 1: Gaosuwo

Song 2: Haoxianghaoxiang

Song 3: Duolaimi

Song 4: Mote

It is noteworthy that after transfer learning, the CRNN model in the target task detected the “thanks” voice of the singer to the audience at the concert, at 4:39 seconds of the fourth song named “Mote.wav”. However, the baseline system trained in the target task did not detect it. This may be due to the speech data in the source task has voice samples related to "thank you", and after transfer learning, the CRNN model in the target task absorbed these knowledge.

6. The results of two different transfer modes and the detection results in frame-level.

Table 1: The results of two different transfer modes.

Table 2: The detection results in frame-level.