SlideShare a Scribd company logo
SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING
                        ROBUST PRINCIPAL COMPONENT ANALYSIS

                  Po-Sen Huang, Scott Deeann Chen, Paris Smaragdis, Mark Hasegawa-Johnson

                                          University of Illinois at Urbana-Champaign
                                      Department of Electrical and Computer Engineering
                                      405 North Mathews Avenue, Urbana, IL 61801 USA
                                    {huang146, chen124, paris, jhasegaw}@illinois.edu



                            ABSTRACT                                                                                                Low Rank
                                                                                                                                     Matrix L
                                                                                                                       Robust
    Separating singing voices from music accompaniment is                             Signal            STFT
                                                                                                                        PCA         Sparse
an important task in many applications, such as music infor-                                                                        Matrix S
mation retrieval, lyric recognition and alignment. Music ac-
                                                                                                                                   Time
companiment can be assumed to be in a low-rank subspace,                                       Evaluation      ISTFT            Frequency
because of its repetition structure; on the other hand, singing                                                                  Masking
voices can be regarded as relatively sparse within songs. In
this paper, based on this assumption, we propose using ro-
bust principal component analysis for singing-voice separa-
tion from music accompaniment. Moreover, we examine the                                               Fig. 1. Proposed framework
separation result by using a binary time-frequency masking
method. Evaluations on the MIR-1K dataset show that this
method can achieve around 1∼1.4 dB higher GNSDR com-
pared with two state-of-the-art approaches without using prior                    from two or more microphones are not available, monaural
training or requiring particular features.                                        singing-voice separation becomes very challenging.
  Index Terms— Robust Principal Component Analysis,                                   Previous work on singing-voice separation systems can
Music/Voice Separation, Time-Frequency Masking                                    be classified into two categories: (1) Supervised systems,
                                                                                  which usually first map signals onto a feature space, then
                      1. INTRODUCTION                                             detect singing voice segments, and finally apply source sep-
                                                                                  aration techniques such as non-negative matrix factorization
A singing voice provides useful information for a song, as                        [5], adaptive Bayesian modeling [6], and pitch-based inter-
it embeds the singer, the lyrics, and the emotion of the song.                    ference [7, 8]. (2) Unsupervised systems, which requires no
There are many applications using this information, for exam-                     prior training or particular features, such as the source/filter
ple, lyric recognition [1] and alignment [2], singer identifica-                   model [9] and the autocorrelation-based method [10].
tion [3], and music information retrieval [4]. However, these                         In this paper, we propose to model accompaniment based
applications encounter problems when music accompaniment                          on the idea that repetition is a core principle in music [11];
exists, since music accompaniment is as noise or interference                     therefore we can assume the music accompaniments lie in a
to singing voices. An automatic singing-voice separation sys-                     low-rank subspace. On the other hand, the singing voice has
tem is used for attenuating or removing the music accompa-                        more variation and is relatively sparse within a song. Based
niment.                                                                           on these assumptions, we propose to use Robust Principal
    Human auditory system has extraordinary capability in                         Component Analysis (RPCA) [12], which is a matrix factor-
separating singing voices from background music accompa-                          ization algorithm for solving underlying low-rank and sparse
niment. Although this task is effortless for humans, it is dif-                   matrices.
ficult for machines. In particular, when spatial cues acquired
                                                                                      The organization of this paper is as follows: Section 2
     This research was supported in part by U.S. ARL and ARO under grant          introduces RPCA. Section 3 discusses binary time frequency
number W911NF-09-1-0383. The authors thank Chao-Ling Hsu [8] and Za-
far Rafii [10] for detailed discussions on their experiments, and Dr. Yi Ma
                                                                                  masks for source separation. Section 4 presents the experi-
[12, 13] and Arvind Ganesh for discussions on robust principal component          mental results using the MIR-1K dataset. We conclude the
analysis.                                                                         paper in Section 5.




978-1-4673-0046-9/12/$26.00 ©2012 IEEE                                       57                                                       ICASSP 2012
2. ROBUST PRINCIPAL COMPONENT ANALYSIS
Cand` s et al. [12] proposed RPCA, which is a convex pro-
      e
gram, for recovering low-rank matrices when a fraction of
their entries have been corrupted by errors, i.e., when the ma-
trix is sufficiently sparse. The approach, Principal Compo-
nent Pursuit, suggests solving the following convex optimiza-
tion problem:
                 minimize ||L||∗ + λ||S||1
                                                                                                 (a) Original Matrix M
                 subject to L + S = M

where M ∈ Rn1 ×n2 , L ∈ Rn1 ×n2 , S ∈ Rn1 ×n2 . || · ||∗
and || · ||1 denote the nuclear norm (sum of singular values)
and the L1-norm (sum of absolute values of matrix entries),
respectively. λ > 0 is a trade-off parameter between the rank
of L and the sparsity of S. As suggested in [12], using a value
of λ = 1/ max(n1 , n2 ) is a good rule of thumb, which can
then be adjusted slightly to obtain the best possible result. We
explore λk = k/ max(n1 , n2 ) with different values of k,
thus testing different tradeoffs between the rank of L and the                                   (b) Low-Rank Matrix L
sparsity of S.
    Since music instruments can reproduce the same sounds
each time they are played and music has, in general, an under-
lying repeating musical structure, we can think of music as a
low-rank signal. Singing voices, on the contrary, have more
variation (higher rank) but are relatively sparse in the time and
frequency domains. We can then think of singing voices as
components making up the sparse matrix. By RPCA, we ex-
pect the low-rank matrix L to contain music accompaniment
and the sparse matrix S to contain vocal signals.                                                (c) Sparse Matrix M
    We perform the separation as follows: First, we compute
the spectrogram of music signals as matrix M , calculated                Fig. 2. Example RPCA results for yifen 2 01 at SNR=5 for (a) the
from the Short-Time-Fourier Transform (STFT). Second,                    original matrix, (b) the low-rank matrix, and (c) the sparse matrix.
we use the inexact Augmented Lagrange Multiplier (ALM)
method [13], which is an efficient algorithm for solving
RPCA problem, to solve L + S = |M |, given the input                     time frequency masking Mb as follows:
magnitude of M . Then by RPCA, we can obtain two output
matrices L and S. From the example spectrogram, Figure 2,
we can observe that there are formant structures in the sparse                               1       |S(m, n)| > gain ∗ |L(m, n)|
matrix S, which indicates vocal activity, and musical notes in             Mb (m, n) =                                                   (1)
                                                                                             0       otherwise
the low-rank matrix L.
    Note that in order to obtain waveforms of the esti-                      for all m = 1 . . . n1 and n = 1 . . . n2 .
mated components, we record the phase of original sig-
nals P = phase(M ), append the phase to matrix L and                         Once the time-frequency mask Mb is computed, it is ap-
S by L(m, n) = LejP (m,n) , S(m, n) = SejP (m,n) , for                   plied to the original STFT matrix M to obtain the separation
m = 1 . . . n1 and n = 1 . . . n2 , and calculate the inverse            matrix Xsinging and Xmusic , as shown in Equation (2).
STFT (ISTFT). The source codes and sound examples are
available at http://mickey.ifp.uiuc.edu/wiki/Software.                        Xsinging (m, n)       =          Mb (m, n)M (m, n)
                                                                              Xmusic (m, n)         =       (1 − Mb (m, n))M (m, n)
           3. TIME-FREQUENCY MASKING                                                                                            (2)
                                                                             for all m = 1 . . . n1 and n = 1 . . . n2 .
Given the separation results of the low-rank L and sparse S              To examine the effectiveness of the binary mask, we assign
matrices, we can further apply binary time-frequency mask-               Xsinging = S and Xmusic = L directly as the case with no
ing methods for better separation results. We define binary               mask.



                                                                    58
4. EXPERIMENTAL RESULTS                                            a binary mask, and (3) a comparison of our results with the
                                                                                  previous literature [8, 10] in terms of GNSDR.
4.1. Dataset                                                                      (1) The effect of λk
                                                                                  The value λk = k/ (max(n1 , n2 ) can be used for trading
We evaluate our system using the MIR-1K dataset1 . There are
                                                                                  off the rank of L with the sparsity of S. The matrix S is sparser
1000 song clips encoded with a sample rate of 16 kHz, with
                                                                                  with higher λk , and vice versa. Intuitively, for the source sep-
a duration from 4 to 13 sec. The clips were extracted from
                                                                                  aration problem, if the matrix S is sparser, there is less inter-
110 Chinese karaoke pop songs performed by both male and
                                                                                  ference in the matrix S; however, deletions of original signal
female amateurs. The dataset includes manual annotations
                                                                                  components might also result in artifacts. On the other hand,
of the pitch contours, lyrics, indices and types for unvoiced
                                                                                  if S matrix is less sparse, the signal contains less artifacts,
frames, and indices of the vocal and non-vocal frames.
                                                                                  but there is more interference from the other sources that ex-
                                                                                  ist in matrix S. Experimental results (Figure 3) show these
4.2. Evaluation
                                                                                  trends for both the case of no mask and the case of binary
Following the evaluation framework in [8, 10], we create                          mask (gain=1) at different SNR values.
three sets of mixtures using the 1000 clips of the MIR-1K
dataset. For each clip, the singing voice and the music accom-
paniment were mixed at -5, 0, and 5 dB SNRs, respectively.
Zero indicates that the singing voice and the music are at the
same energy levels, negative values indicate the energy of the
music accompaniment is larger than the singing voice, and so
on.
    For source separation evaluation, in addition to evaluating
the Global Normalized Source to Distortion Ratio (GNSDR)
as [8, 10], we also evaluate our performance in terms of
Source to Interference Ratio (SIR), Source to Artifacts Ratio
(SAR), and Source to Distortion Ratio (SDR) by BSS-EVAL
metrics [14]. The Normalized SDR (NSDR) is defined as
           NSDR(ˆ, v, x) = SDR(ˆ, v) − SDR(x, v)
                v              v                                      (3)                     (a) no mask                  (b) binary mask

 where v is the resynthesized singing voice, v is the origi-
        ˆ                                                                         Fig. 3. Comparison between the case using (a) no mask and (b) a
nal clean singing voice, and x is the mixture. NSDR is for                        binary mask at SNR={-5,0,5}, k (of λk )={0.1, 0.5, 1, 1.5, 2, 2.5},
estimating the improvement of the SDR between the prepro-                         and gain=1.
cessed mixture x and the separated singing voice v . The
                                                    ˆ
GNSDR is calculated by taking the mean of the NSDRs over                          (2) The effect of the gain factor with a binary mask
all mixtures of each set, weighted by their length.                               The gain factor adjusts the energy between the sparse matrix
                                   N
                                                                                  and the low-rank matrix. As shown in Figure 4, we exam-
                                   n=1   wn NSDR(ˆn , vn , xn )
                                                 v                                ine different gain factors {0.1, 0.5, 1, 1.5, 2} at λ1 where
       GNSDR(ˆ, v, x) =
             v                                 N
                                                                      (4)
                                               n=1   wn                           SNR={-5, 0, 5}. Similar to the effect of λk , a higher gain
                                                                                  factor results in a lower power sparse matrix S. Hence, there
 where n is the index of a song and N is the total number of                      is larger interference and fewer artifacts at high gain, and vice
the songs, and wn is the length of the nth song. Higher values                    versa.
of SDR, SAR, SIR, and GNSDR represent better separation                           (3) Comparison with previous systems
quality2 .                                                                        From previous observations, moderate values for λk and the
                                                                                  gain factor balance the separation results in terms of SDR,
4.3. Experiments                                                                  SAR, and SIR. We empirically choose λ1 (also suggested in
In the separation process, the spectrogram of each mixture                        [12]) and gain = 1 to compare with previous literature on
is computed using a window size of 1024 and a hop size of                         singing-voice separation in terms of GNSDR using the MIR-
256 (at Fs=16,000). Three experiments were run: (1) an eval-                      1K dataset [8, 10].
uation of the effect of λk for controlling low rankness and                            Hsu and Jang [8] performed singing-voice separation us-
sparsity, (2) an evaluation of the effect of the gain factor with                 ing a pitch-based inference separation method on the MIR-
   1 https://sites.google.com/site/unvoicedsoundseparation/mir-1k
                                                                                  1K dataset. Their method combines the singing-voice separa-
   2 The suppression of noise is reflected in SIR. The artifacts introduced        tion method [7], the separation of the unvoiced singing-voice
by the denoising process are reflected in SAR. The overall performance is          frames, and a spectral subtraction method. Rafii and Pardo
reflected in SDR.                                                                  proposed a singing-voice separation method by extracting the



                                                                             59
pand current work to speech noise reduction, since in many
                                                                         situations, the noise spectrogram is relatively low-rank, while
                                                                         the speech spectrogram is relatively sparse.

                                                                                                 6. REFERENCES

                                                                          [1] C.-K. Wang, R.-Y. Lyu, and Y.-C. Chiang, “An automatic
                                                                              singing transcription system with multilingual singing lyric
                                                                              recognizer and robust melody tracker,” in Proc. of Interspeech,
                                                                              2003, pp. 1197–1200.
                                                                          [2] H. Fujihara, M. Goto, J. Ogata, K. Komatani, T. Ogata, and
                                                                              H. G. Okuno, “Automatic synchronization between lyrics and
                                                                              music cd recordings based on viterbi alignment of segregated
                                                                              vocal signals,” in Proc. of ISM, 12 2006, pp. 257–264.
                                                                          [3] A. Berenzweig, D. P. W. Ellis, and S. Lawrence, “Using voice
                                                                              segments to improve artist classification of music,” in AES
Fig. 4. Comparison between various gain factors using a binary
                                                                              22nd International Conference: Virtual, Synthetic, and Enter-
mask with λ1
                                                                              tainment Audio, 2002.
                                                                          [4] H. Fujihara and M. Goto, “A music information retrieval sys-
                                                                              tem based on singing voice timbre,” in ISMIR, 2007, pp. 467–
repeating musical structure estimated by the autocorrelation                  470.
function [10] .                                                           [5] S. Vembu and S. Baumann, “Separation of vocals from poly-
    As shown in Figure 5, both our approach using binary                      phonic audio recordings,” in ISMIR, 2005, pp. 337–344.
mask and our approach using no mask achieve better GNSDR                  [6] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, “Adap-
compared with previous approaches [8, 10], with the no mask                   tation of Bayesian models for single-channel source separation
approach providing the best overall results. These methods                    and its application to voice/music separation in popular songs,”
are compared to an ideal situation where an ideal binary mask                 Audio, Speech, and Language Processing, IEEE Transactions
is used as the upper-bound performance of the singing-voice                   on, vol. 15, no. 5, pp. 1564 –1578, July 2007.
separation task with algorithms based on binary masking                   [7] Yipeng Li and DeLiang Wang, “Separation of singing voice
technique.                                                                    from music accompaniment for monaural recordings,” Audio,
                                                                              Speech, and Language Processing, IEEE Transactions on, vol.
                                                                              15, no. 4, pp. 1475 –1487, May 2007.
                                                                          [8] C.-L. Hsu and J.-S.R. Jang, “On the improvement of singing
                                                                              voice separation for monaural recordings using the MIR-1K
                                                                              dataset,” Audio, Speech, and Language Processing, IEEE
                                                                              Transactions on, vol. 18, no. 2, pp. 310 –319, Feb. 2010.
                                                                          [9] J.-L. Durrieu, G. Richard, B. David, and C. Fevotte,
                                                                              “Source/filter model for unsupervised main melody extraction
                                                                              from polyphonic audio signals,” Audio, Speech, and Language
                                                                              Processing, IEEE Transactions on, vol. 18, no. 3, pp. 564–575,
Fig. 5. Comparison between Hsu [8], Rafii [10], the binary mask,               March 2010.
the no mask, and the ideal binary mask cases in terms of GNSDR.          [10] Z. Rafii and B. Pardo, “A simple music/voice separation
                                                                              method based on the extraction of the repeating musical struc-
        5. CONCLUSION AND FUTURE WORK                                         ture,” in ICASSP, May 2011, pp. 221– 224.
In this paper, we proposed an unsupervised approach which                [11] H. Schenker, Harmony, University of Chicago Press, 1954.
applies robust principal component analysis on singing-voice             [12] Emmanuel J. Cand` s, Xiaodong Li, Yi Ma, and John Wright,
                                                                                                   e
separation from music accompaniment for monaural record-                      “Robust principal component analysis?,” J. ACM, vol. 58, pp.
ings. We also examined the parameter λk in RPCA and the                       11:1–11:37, Jun. 2011.
gain factor with a binary mask in detail. Without using prior            [13] Z. Lin, M. Chen, L. Wu, and Y. Ma, “The augmented La-
training or requiring particular features, we are able to achieve             grange multiplier method for exact recovery of corrupted low-
around 1∼1.4 dB higher GNSDR compared with two state-                         rank matrices,” Tech. Rep. UILU-ENG-09-2215, UIUC, Nov.
of-the-art approaches, by taking into account the rank of mu-                 2009.
sic accompaniment and the sparsity of singing voices.                    [14] E. Vincent, R. Gribonval, and C. Fevotte, “Performance mea-
    There are several ways in which this work could be ex-                    surement in blind audio source separation,” Audio, Speech,
                                                                              and Language Processing, IEEE Transactions on, vol. 14, no.
tended, for example, (1) to investigate dynamic parameter se-
                                                                              4, pp. 1462 –1469, July 2006.
lection methods according to different contexts, or (2) to ex-



                                                                    60

More Related Content

What's hot

Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
IDES Editor
 
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
IDES Editor
 
Transfer Learning, Soft Distance-Based Bias, and the Hierarchical BOA
Transfer Learning, Soft Distance-Based Bias, and the Hierarchical BOATransfer Learning, Soft Distance-Based Bias, and the Hierarchical BOA
Transfer Learning, Soft Distance-Based Bias, and the Hierarchical BOA
Martin Pelikan
 
IRJET-Virtual Music Guide for Beginners using MATLAB and DSP Kit
IRJET-Virtual Music Guide for Beginners using MATLAB and DSP KitIRJET-Virtual Music Guide for Beginners using MATLAB and DSP Kit
IRJET-Virtual Music Guide for Beginners using MATLAB and DSP Kit
IRJET Journal
 
Performance Analysis of M-ary Optical CDMA in Presence of Chromatic Dispersion
Performance Analysis of M-ary Optical CDMA in Presence of Chromatic DispersionPerformance Analysis of M-ary Optical CDMA in Presence of Chromatic Dispersion
Performance Analysis of M-ary Optical CDMA in Presence of Chromatic Dispersion
IDES Editor
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
CSCJournals
 
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
IDES Editor
 
From nebular to pynebular: a new package for the analysis of emission lines
From nebular to pynebular: a new package for the analysis of emission linesFrom nebular to pynebular: a new package for the analysis of emission lines
From nebular to pynebular: a new package for the analysis of emission lines
AstroAtom
 
Lo2520082015
Lo2520082015Lo2520082015
Lo2520082015
IJERA Editor
 
Comparative Analysis of Distortive and Non-Distortive Techniques for PAPR Red...
Comparative Analysis of Distortive and Non-Distortive Techniques for PAPR Red...Comparative Analysis of Distortive and Non-Distortive Techniques for PAPR Red...
Comparative Analysis of Distortive and Non-Distortive Techniques for PAPR Red...
IDES Editor
 
Resource Oriented Architecture for Managing Multimedia Content by Florian
Resource Oriented Architecture for Managing Multimedia Content by FlorianResource Oriented Architecture for Managing Multimedia Content by Florian
Resource Oriented Architecture for Managing Multimedia Content by Florian
Multimedia and Vision Laboratory at Universidad del Valle
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...Cemal Ardil
 
Performance of Matching Algorithmsfor Signal Approximation
Performance of Matching Algorithmsfor Signal ApproximationPerformance of Matching Algorithmsfor Signal Approximation
Performance of Matching Algorithmsfor Signal Approximation
iosrjce
 
A probabilistic model for recursive factorized image features
A probabilistic model for recursive factorized image featuresA probabilistic model for recursive factorized image features
A probabilistic model for recursive factorized image featuresirisshicat
 
Scheme of work est 2011
Scheme of work est 2011Scheme of work est 2011
Scheme of work est 2011SMS
 

What's hot (19)

Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
 
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
 
Transfer Learning, Soft Distance-Based Bias, and the Hierarchical BOA
Transfer Learning, Soft Distance-Based Bias, and the Hierarchical BOATransfer Learning, Soft Distance-Based Bias, and the Hierarchical BOA
Transfer Learning, Soft Distance-Based Bias, and the Hierarchical BOA
 
IRJET-Virtual Music Guide for Beginners using MATLAB and DSP Kit
IRJET-Virtual Music Guide for Beginners using MATLAB and DSP KitIRJET-Virtual Music Guide for Beginners using MATLAB and DSP Kit
IRJET-Virtual Music Guide for Beginners using MATLAB and DSP Kit
 
Performance Analysis of M-ary Optical CDMA in Presence of Chromatic Dispersion
Performance Analysis of M-ary Optical CDMA in Presence of Chromatic DispersionPerformance Analysis of M-ary Optical CDMA in Presence of Chromatic Dispersion
Performance Analysis of M-ary Optical CDMA in Presence of Chromatic Dispersion
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
 
FK_icassp_2014
FK_icassp_2014FK_icassp_2014
FK_icassp_2014
 
48
4848
48
 
From nebular to pynebular: a new package for the analysis of emission lines
From nebular to pynebular: a new package for the analysis of emission linesFrom nebular to pynebular: a new package for the analysis of emission lines
From nebular to pynebular: a new package for the analysis of emission lines
 
Lo2520082015
Lo2520082015Lo2520082015
Lo2520082015
 
Comparative Analysis of Distortive and Non-Distortive Techniques for PAPR Red...
Comparative Analysis of Distortive and Non-Distortive Techniques for PAPR Red...Comparative Analysis of Distortive and Non-Distortive Techniques for PAPR Red...
Comparative Analysis of Distortive and Non-Distortive Techniques for PAPR Red...
 
Resource Oriented Architecture for Managing Multimedia Content by Florian
Resource Oriented Architecture for Managing Multimedia Content by FlorianResource Oriented Architecture for Managing Multimedia Content by Florian
Resource Oriented Architecture for Managing Multimedia Content by Florian
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
B0510916
B0510916B0510916
B0510916
 
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
 
Performance of Matching Algorithmsfor Signal Approximation
Performance of Matching Algorithmsfor Signal ApproximationPerformance of Matching Algorithmsfor Signal Approximation
Performance of Matching Algorithmsfor Signal Approximation
 
A probabilistic model for recursive factorized image features
A probabilistic model for recursive factorized image featuresA probabilistic model for recursive factorized image features
A probabilistic model for recursive factorized image features
 
Scheme of work est 2011
Scheme of work est 2011Scheme of work est 2011
Scheme of work est 2011
 

Similar to SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMPONENT ANALYSIS

Eigen Subspace based Direction of Arrival Estimation for Coherent Sources
Eigen Subspace based Direction of Arrival Estimation for Coherent SourcesEigen Subspace based Direction of Arrival Estimation for Coherent Sources
Eigen Subspace based Direction of Arrival Estimation for Coherent Sources
INFOGAIN PUBLICATION
 
performance analysis of MUSIC and ESPRIT DOA estimation used in adaptive arra...
performance analysis of MUSIC and ESPRIT DOA estimation used in adaptive arra...performance analysis of MUSIC and ESPRIT DOA estimation used in adaptive arra...
performance analysis of MUSIC and ESPRIT DOA estimation used in adaptive arra...
CSCJournals
 
Anvita Eusipco 2004
Anvita Eusipco 2004Anvita Eusipco 2004
Anvita Eusipco 2004guest6e7a1b1
 
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Lushanthan Sivaneasharajah
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
TELKOMNIKA JOURNAL
 
T26123129
T26123129T26123129
T26123129
IJERA Editor
 
Performance analysis of music and esprit doa estimation algorithms for adapti...
Performance analysis of music and esprit doa estimation algorithms for adapti...Performance analysis of music and esprit doa estimation algorithms for adapti...
Performance analysis of music and esprit doa estimation algorithms for adapti...
marwaeng
 
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
Demixing Commercial Music Productions via Human-Assisted Time-Frequency MaskingDemixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
dhia_naruto
 
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
April Smith
 
Emotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechEmotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio Speech
IOSR Journals
 
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVMIRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET Journal
 
IRJET- A Review of Music Analysis Techniques
IRJET- A Review of Music Analysis TechniquesIRJET- A Review of Music Analysis Techniques
IRJET- A Review of Music Analysis Techniques
IRJET Journal
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...CSCJournals
 
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
cscpconf
 
Direction of Arrival Estimation Based on MUSIC Algorithm Using Uniform and No...
Direction of Arrival Estimation Based on MUSIC Algorithm Using Uniform and No...Direction of Arrival Estimation Based on MUSIC Algorithm Using Uniform and No...
Direction of Arrival Estimation Based on MUSIC Algorithm Using Uniform and No...
IJERA Editor
 
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
IJERA Editor
 
Paper id 24201464
Paper id 24201464Paper id 24201464
Paper id 24201464IJRAT
 
129966863723746268[1]
129966863723746268[1]129966863723746268[1]
129966863723746268[1]威華 王
 
Matrix Padding Method for Sparse Signal Reconstruction
Matrix Padding Method for Sparse Signal ReconstructionMatrix Padding Method for Sparse Signal Reconstruction
Matrix Padding Method for Sparse Signal Reconstruction
CSCJournals
 

Similar to SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMPONENT ANALYSIS (20)

CASR-Report
CASR-ReportCASR-Report
CASR-Report
 
Eigen Subspace based Direction of Arrival Estimation for Coherent Sources
Eigen Subspace based Direction of Arrival Estimation for Coherent SourcesEigen Subspace based Direction of Arrival Estimation for Coherent Sources
Eigen Subspace based Direction of Arrival Estimation for Coherent Sources
 
performance analysis of MUSIC and ESPRIT DOA estimation used in adaptive arra...
performance analysis of MUSIC and ESPRIT DOA estimation used in adaptive arra...performance analysis of MUSIC and ESPRIT DOA estimation used in adaptive arra...
performance analysis of MUSIC and ESPRIT DOA estimation used in adaptive arra...
 
Anvita Eusipco 2004
Anvita Eusipco 2004Anvita Eusipco 2004
Anvita Eusipco 2004
 
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
 
T26123129
T26123129T26123129
T26123129
 
Performance analysis of music and esprit doa estimation algorithms for adapti...
Performance analysis of music and esprit doa estimation algorithms for adapti...Performance analysis of music and esprit doa estimation algorithms for adapti...
Performance analysis of music and esprit doa estimation algorithms for adapti...
 
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
Demixing Commercial Music Productions via Human-Assisted Time-Frequency MaskingDemixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
 
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
 
Emotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechEmotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio Speech
 
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVMIRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
 
IRJET- A Review of Music Analysis Techniques
IRJET- A Review of Music Analysis TechniquesIRJET- A Review of Music Analysis Techniques
IRJET- A Review of Music Analysis Techniques
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
 
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
 
Direction of Arrival Estimation Based on MUSIC Algorithm Using Uniform and No...
Direction of Arrival Estimation Based on MUSIC Algorithm Using Uniform and No...Direction of Arrival Estimation Based on MUSIC Algorithm Using Uniform and No...
Direction of Arrival Estimation Based on MUSIC Algorithm Using Uniform and No...
 
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
 
Paper id 24201464
Paper id 24201464Paper id 24201464
Paper id 24201464
 
129966863723746268[1]
129966863723746268[1]129966863723746268[1]
129966863723746268[1]
 
Matrix Padding Method for Sparse Signal Reconstruction
Matrix Padding Method for Sparse Signal ReconstructionMatrix Padding Method for Sparse Signal Reconstruction
Matrix Padding Method for Sparse Signal Reconstruction
 

More from chakravarthy Gopi

14104042 resume
14104042 resume14104042 resume
14104042 resume
chakravarthy Gopi
 
400 puzzles-and-answers-for-interview
400 puzzles-and-answers-for-interview400 puzzles-and-answers-for-interview
400 puzzles-and-answers-for-interview
chakravarthy Gopi
 
More puzzles to puzzle you
More puzzles to puzzle youMore puzzles to puzzle you
More puzzles to puzzle you
chakravarthy Gopi
 
puzzles to puzzle u
 puzzles to puzzle u puzzles to puzzle u
puzzles to puzzle u
chakravarthy Gopi
 
Linear predictive coding documentation
Linear predictive coding  documentationLinear predictive coding  documentation
Linear predictive coding documentationchakravarthy Gopi
 
The performance of turbo codes for wireless communication systems
The performance of turbo codes for wireless communication systemsThe performance of turbo codes for wireless communication systems
The performance of turbo codes for wireless communication systemschakravarthy Gopi
 
implementation of _phase_lock_loop_system_using_control_system_techniques
implementation of _phase_lock_loop_system_using_control_system_techniquesimplementation of _phase_lock_loop_system_using_control_system_techniques
implementation of _phase_lock_loop_system_using_control_system_techniqueschakravarthy Gopi
 
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEM
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEMA REAL TIME PRECRASH VEHICLE DETECTION SYSTEM
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEMchakravarthy Gopi
 
Model resume
Model resumeModel resume
Model resume
chakravarthy Gopi
 
Lpc vocoder implemented by using matlab
Lpc vocoder implemented by using matlabLpc vocoder implemented by using matlab
Lpc vocoder implemented by using matlabchakravarthy Gopi
 

More from chakravarthy Gopi (12)

14104042 resume
14104042 resume14104042 resume
14104042 resume
 
400 puzzles-and-answers-for-interview
400 puzzles-and-answers-for-interview400 puzzles-and-answers-for-interview
400 puzzles-and-answers-for-interview
 
More puzzles to puzzle you
More puzzles to puzzle youMore puzzles to puzzle you
More puzzles to puzzle you
 
puzzles to puzzle u
 puzzles to puzzle u puzzles to puzzle u
puzzles to puzzle u
 
Mahaprasthanam
MahaprasthanamMahaprasthanam
Mahaprasthanam
 
GATE Ece(2012)
GATE Ece(2012)GATE Ece(2012)
GATE Ece(2012)
 
Linear predictive coding documentation
Linear predictive coding  documentationLinear predictive coding  documentation
Linear predictive coding documentation
 
The performance of turbo codes for wireless communication systems
The performance of turbo codes for wireless communication systemsThe performance of turbo codes for wireless communication systems
The performance of turbo codes for wireless communication systems
 
implementation of _phase_lock_loop_system_using_control_system_techniques
implementation of _phase_lock_loop_system_using_control_system_techniquesimplementation of _phase_lock_loop_system_using_control_system_techniques
implementation of _phase_lock_loop_system_using_control_system_techniques
 
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEM
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEMA REAL TIME PRECRASH VEHICLE DETECTION SYSTEM
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEM
 
Model resume
Model resumeModel resume
Model resume
 
Lpc vocoder implemented by using matlab
Lpc vocoder implemented by using matlabLpc vocoder implemented by using matlab
Lpc vocoder implemented by using matlab
 

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMPONENT ANALYSIS

  • 1. SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMPONENT ANALYSIS Po-Sen Huang, Scott Deeann Chen, Paris Smaragdis, Mark Hasegawa-Johnson University of Illinois at Urbana-Champaign Department of Electrical and Computer Engineering 405 North Mathews Avenue, Urbana, IL 61801 USA {huang146, chen124, paris, jhasegaw}@illinois.edu ABSTRACT Low Rank Matrix L Robust Separating singing voices from music accompaniment is Signal STFT PCA Sparse an important task in many applications, such as music infor- Matrix S mation retrieval, lyric recognition and alignment. Music ac- Time companiment can be assumed to be in a low-rank subspace, Evaluation ISTFT Frequency because of its repetition structure; on the other hand, singing Masking voices can be regarded as relatively sparse within songs. In this paper, based on this assumption, we propose using ro- bust principal component analysis for singing-voice separa- tion from music accompaniment. Moreover, we examine the Fig. 1. Proposed framework separation result by using a binary time-frequency masking method. Evaluations on the MIR-1K dataset show that this method can achieve around 1∼1.4 dB higher GNSDR com- pared with two state-of-the-art approaches without using prior from two or more microphones are not available, monaural training or requiring particular features. singing-voice separation becomes very challenging. Index Terms— Robust Principal Component Analysis, Previous work on singing-voice separation systems can Music/Voice Separation, Time-Frequency Masking be classified into two categories: (1) Supervised systems, which usually first map signals onto a feature space, then 1. INTRODUCTION detect singing voice segments, and finally apply source sep- aration techniques such as non-negative matrix factorization A singing voice provides useful information for a song, as [5], adaptive Bayesian modeling [6], and pitch-based inter- it embeds the singer, the lyrics, and the emotion of the song. ference [7, 8]. (2) Unsupervised systems, which requires no There are many applications using this information, for exam- prior training or particular features, such as the source/filter ple, lyric recognition [1] and alignment [2], singer identifica- model [9] and the autocorrelation-based method [10]. tion [3], and music information retrieval [4]. However, these In this paper, we propose to model accompaniment based applications encounter problems when music accompaniment on the idea that repetition is a core principle in music [11]; exists, since music accompaniment is as noise or interference therefore we can assume the music accompaniments lie in a to singing voices. An automatic singing-voice separation sys- low-rank subspace. On the other hand, the singing voice has tem is used for attenuating or removing the music accompa- more variation and is relatively sparse within a song. Based niment. on these assumptions, we propose to use Robust Principal Human auditory system has extraordinary capability in Component Analysis (RPCA) [12], which is a matrix factor- separating singing voices from background music accompa- ization algorithm for solving underlying low-rank and sparse niment. Although this task is effortless for humans, it is dif- matrices. ficult for machines. In particular, when spatial cues acquired The organization of this paper is as follows: Section 2 This research was supported in part by U.S. ARL and ARO under grant introduces RPCA. Section 3 discusses binary time frequency number W911NF-09-1-0383. The authors thank Chao-Ling Hsu [8] and Za- far Rafii [10] for detailed discussions on their experiments, and Dr. Yi Ma masks for source separation. Section 4 presents the experi- [12, 13] and Arvind Ganesh for discussions on robust principal component mental results using the MIR-1K dataset. We conclude the analysis. paper in Section 5. 978-1-4673-0046-9/12/$26.00 ©2012 IEEE 57 ICASSP 2012
  • 2. 2. ROBUST PRINCIPAL COMPONENT ANALYSIS Cand` s et al. [12] proposed RPCA, which is a convex pro- e gram, for recovering low-rank matrices when a fraction of their entries have been corrupted by errors, i.e., when the ma- trix is sufficiently sparse. The approach, Principal Compo- nent Pursuit, suggests solving the following convex optimiza- tion problem: minimize ||L||∗ + λ||S||1 (a) Original Matrix M subject to L + S = M where M ∈ Rn1 ×n2 , L ∈ Rn1 ×n2 , S ∈ Rn1 ×n2 . || · ||∗ and || · ||1 denote the nuclear norm (sum of singular values) and the L1-norm (sum of absolute values of matrix entries), respectively. λ > 0 is a trade-off parameter between the rank of L and the sparsity of S. As suggested in [12], using a value of λ = 1/ max(n1 , n2 ) is a good rule of thumb, which can then be adjusted slightly to obtain the best possible result. We explore λk = k/ max(n1 , n2 ) with different values of k, thus testing different tradeoffs between the rank of L and the (b) Low-Rank Matrix L sparsity of S. Since music instruments can reproduce the same sounds each time they are played and music has, in general, an under- lying repeating musical structure, we can think of music as a low-rank signal. Singing voices, on the contrary, have more variation (higher rank) but are relatively sparse in the time and frequency domains. We can then think of singing voices as components making up the sparse matrix. By RPCA, we ex- pect the low-rank matrix L to contain music accompaniment and the sparse matrix S to contain vocal signals. (c) Sparse Matrix M We perform the separation as follows: First, we compute the spectrogram of music signals as matrix M , calculated Fig. 2. Example RPCA results for yifen 2 01 at SNR=5 for (a) the from the Short-Time-Fourier Transform (STFT). Second, original matrix, (b) the low-rank matrix, and (c) the sparse matrix. we use the inexact Augmented Lagrange Multiplier (ALM) method [13], which is an efficient algorithm for solving RPCA problem, to solve L + S = |M |, given the input time frequency masking Mb as follows: magnitude of M . Then by RPCA, we can obtain two output matrices L and S. From the example spectrogram, Figure 2, we can observe that there are formant structures in the sparse 1 |S(m, n)| > gain ∗ |L(m, n)| matrix S, which indicates vocal activity, and musical notes in Mb (m, n) = (1) 0 otherwise the low-rank matrix L. Note that in order to obtain waveforms of the esti- for all m = 1 . . . n1 and n = 1 . . . n2 . mated components, we record the phase of original sig- nals P = phase(M ), append the phase to matrix L and Once the time-frequency mask Mb is computed, it is ap- S by L(m, n) = LejP (m,n) , S(m, n) = SejP (m,n) , for plied to the original STFT matrix M to obtain the separation m = 1 . . . n1 and n = 1 . . . n2 , and calculate the inverse matrix Xsinging and Xmusic , as shown in Equation (2). STFT (ISTFT). The source codes and sound examples are available at http://mickey.ifp.uiuc.edu/wiki/Software. Xsinging (m, n) = Mb (m, n)M (m, n) Xmusic (m, n) = (1 − Mb (m, n))M (m, n) 3. TIME-FREQUENCY MASKING (2) for all m = 1 . . . n1 and n = 1 . . . n2 . Given the separation results of the low-rank L and sparse S To examine the effectiveness of the binary mask, we assign matrices, we can further apply binary time-frequency mask- Xsinging = S and Xmusic = L directly as the case with no ing methods for better separation results. We define binary mask. 58
  • 3. 4. EXPERIMENTAL RESULTS a binary mask, and (3) a comparison of our results with the previous literature [8, 10] in terms of GNSDR. 4.1. Dataset (1) The effect of λk The value λk = k/ (max(n1 , n2 ) can be used for trading We evaluate our system using the MIR-1K dataset1 . There are off the rank of L with the sparsity of S. The matrix S is sparser 1000 song clips encoded with a sample rate of 16 kHz, with with higher λk , and vice versa. Intuitively, for the source sep- a duration from 4 to 13 sec. The clips were extracted from aration problem, if the matrix S is sparser, there is less inter- 110 Chinese karaoke pop songs performed by both male and ference in the matrix S; however, deletions of original signal female amateurs. The dataset includes manual annotations components might also result in artifacts. On the other hand, of the pitch contours, lyrics, indices and types for unvoiced if S matrix is less sparse, the signal contains less artifacts, frames, and indices of the vocal and non-vocal frames. but there is more interference from the other sources that ex- ist in matrix S. Experimental results (Figure 3) show these 4.2. Evaluation trends for both the case of no mask and the case of binary Following the evaluation framework in [8, 10], we create mask (gain=1) at different SNR values. three sets of mixtures using the 1000 clips of the MIR-1K dataset. For each clip, the singing voice and the music accom- paniment were mixed at -5, 0, and 5 dB SNRs, respectively. Zero indicates that the singing voice and the music are at the same energy levels, negative values indicate the energy of the music accompaniment is larger than the singing voice, and so on. For source separation evaluation, in addition to evaluating the Global Normalized Source to Distortion Ratio (GNSDR) as [8, 10], we also evaluate our performance in terms of Source to Interference Ratio (SIR), Source to Artifacts Ratio (SAR), and Source to Distortion Ratio (SDR) by BSS-EVAL metrics [14]. The Normalized SDR (NSDR) is defined as NSDR(ˆ, v, x) = SDR(ˆ, v) − SDR(x, v) v v (3) (a) no mask (b) binary mask where v is the resynthesized singing voice, v is the origi- ˆ Fig. 3. Comparison between the case using (a) no mask and (b) a nal clean singing voice, and x is the mixture. NSDR is for binary mask at SNR={-5,0,5}, k (of λk )={0.1, 0.5, 1, 1.5, 2, 2.5}, estimating the improvement of the SDR between the prepro- and gain=1. cessed mixture x and the separated singing voice v . The ˆ GNSDR is calculated by taking the mean of the NSDRs over (2) The effect of the gain factor with a binary mask all mixtures of each set, weighted by their length. The gain factor adjusts the energy between the sparse matrix N and the low-rank matrix. As shown in Figure 4, we exam- n=1 wn NSDR(ˆn , vn , xn ) v ine different gain factors {0.1, 0.5, 1, 1.5, 2} at λ1 where GNSDR(ˆ, v, x) = v N (4) n=1 wn SNR={-5, 0, 5}. Similar to the effect of λk , a higher gain factor results in a lower power sparse matrix S. Hence, there where n is the index of a song and N is the total number of is larger interference and fewer artifacts at high gain, and vice the songs, and wn is the length of the nth song. Higher values versa. of SDR, SAR, SIR, and GNSDR represent better separation (3) Comparison with previous systems quality2 . From previous observations, moderate values for λk and the gain factor balance the separation results in terms of SDR, 4.3. Experiments SAR, and SIR. We empirically choose λ1 (also suggested in In the separation process, the spectrogram of each mixture [12]) and gain = 1 to compare with previous literature on is computed using a window size of 1024 and a hop size of singing-voice separation in terms of GNSDR using the MIR- 256 (at Fs=16,000). Three experiments were run: (1) an eval- 1K dataset [8, 10]. uation of the effect of λk for controlling low rankness and Hsu and Jang [8] performed singing-voice separation us- sparsity, (2) an evaluation of the effect of the gain factor with ing a pitch-based inference separation method on the MIR- 1 https://sites.google.com/site/unvoicedsoundseparation/mir-1k 1K dataset. Their method combines the singing-voice separa- 2 The suppression of noise is reflected in SIR. The artifacts introduced tion method [7], the separation of the unvoiced singing-voice by the denoising process are reflected in SAR. The overall performance is frames, and a spectral subtraction method. Rafii and Pardo reflected in SDR. proposed a singing-voice separation method by extracting the 59
  • 4. pand current work to speech noise reduction, since in many situations, the noise spectrogram is relatively low-rank, while the speech spectrogram is relatively sparse. 6. REFERENCES [1] C.-K. Wang, R.-Y. Lyu, and Y.-C. Chiang, “An automatic singing transcription system with multilingual singing lyric recognizer and robust melody tracker,” in Proc. of Interspeech, 2003, pp. 1197–1200. [2] H. Fujihara, M. Goto, J. Ogata, K. Komatani, T. Ogata, and H. G. Okuno, “Automatic synchronization between lyrics and music cd recordings based on viterbi alignment of segregated vocal signals,” in Proc. of ISM, 12 2006, pp. 257–264. [3] A. Berenzweig, D. P. W. Ellis, and S. Lawrence, “Using voice segments to improve artist classification of music,” in AES Fig. 4. Comparison between various gain factors using a binary 22nd International Conference: Virtual, Synthetic, and Enter- mask with λ1 tainment Audio, 2002. [4] H. Fujihara and M. Goto, “A music information retrieval sys- tem based on singing voice timbre,” in ISMIR, 2007, pp. 467– repeating musical structure estimated by the autocorrelation 470. function [10] . [5] S. Vembu and S. Baumann, “Separation of vocals from poly- As shown in Figure 5, both our approach using binary phonic audio recordings,” in ISMIR, 2005, pp. 337–344. mask and our approach using no mask achieve better GNSDR [6] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, “Adap- compared with previous approaches [8, 10], with the no mask tation of Bayesian models for single-channel source separation approach providing the best overall results. These methods and its application to voice/music separation in popular songs,” are compared to an ideal situation where an ideal binary mask Audio, Speech, and Language Processing, IEEE Transactions is used as the upper-bound performance of the singing-voice on, vol. 15, no. 5, pp. 1564 –1578, July 2007. separation task with algorithms based on binary masking [7] Yipeng Li and DeLiang Wang, “Separation of singing voice technique. from music accompaniment for monaural recordings,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 4, pp. 1475 –1487, May 2007. [8] C.-L. Hsu and J.-S.R. Jang, “On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18, no. 2, pp. 310 –319, Feb. 2010. [9] J.-L. Durrieu, G. Richard, B. David, and C. Fevotte, “Source/filter model for unsupervised main melody extraction from polyphonic audio signals,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18, no. 3, pp. 564–575, Fig. 5. Comparison between Hsu [8], Rafii [10], the binary mask, March 2010. the no mask, and the ideal binary mask cases in terms of GNSDR. [10] Z. Rafii and B. Pardo, “A simple music/voice separation method based on the extraction of the repeating musical struc- 5. CONCLUSION AND FUTURE WORK ture,” in ICASSP, May 2011, pp. 221– 224. In this paper, we proposed an unsupervised approach which [11] H. Schenker, Harmony, University of Chicago Press, 1954. applies robust principal component analysis on singing-voice [12] Emmanuel J. Cand` s, Xiaodong Li, Yi Ma, and John Wright, e separation from music accompaniment for monaural record- “Robust principal component analysis?,” J. ACM, vol. 58, pp. ings. We also examined the parameter λk in RPCA and the 11:1–11:37, Jun. 2011. gain factor with a binary mask in detail. Without using prior [13] Z. Lin, M. Chen, L. Wu, and Y. Ma, “The augmented La- training or requiring particular features, we are able to achieve grange multiplier method for exact recovery of corrupted low- around 1∼1.4 dB higher GNSDR compared with two state- rank matrices,” Tech. Rep. UILU-ENG-09-2215, UIUC, Nov. of-the-art approaches, by taking into account the rank of mu- 2009. sic accompaniment and the sparsity of singing voices. [14] E. Vincent, R. Gribonval, and C. Fevotte, “Performance mea- There are several ways in which this work could be ex- surement in blind audio source separation,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 14, no. tended, for example, (1) to investigate dynamic parameter se- 4, pp. 1462 –1469, July 2006. lection methods according to different contexts, or (2) to ex- 60