Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model

Audio Separation Comparison: Clustering Repeating Period and
Hidden Markov Model
Yao Yao
MSDS 7335 402
Machine Learning Comparison Project
Introduction
Contemporarymusiciscreatedbylayeringandmixing variousvocalsoverdifferentinstrumentals
[Figure 1].Separationcan be messyand mostlysupervisedbecause unsupervisedresultscouldbe trivial,
where the original isolatedsounds couldbe commonplace. Musictheorysuggeststhata collective
synchronization isnecessaryfordifferentcomponents toformmelodiestocreate a sensationof an
orderedharmony,whichallowsthe hearingsense toseparate disorderednoisefrommusical
sophistication [2].
Figure 1: Pro Toolscan be usedto mix and layerdifferent sound componentssuchasbassand lead
vocalsintoa full contemporary song[3].The bottom"LeadVoxENG"showsunevensoundmanipulations
Soundisoverlaidontopof eachother ona 3D scale and couldbe visualizedbyaspectrogram, but
separationbypercussion,harmonics,or variousunsupervised spectrasignaturesisarbitraryandcould
be recognizedasborderline noise insteadof acollectivesyncopatedharmony.There are noadequate
validationtracksforarbitrary unsupervised separation.The goal istohave collective harmonyinthe

separation,where structural soundintegrityalsoneedstobe intactafterseparation forthe sound
texture tosoundfull asthe original.Machine learningcouldbe appliedtoseparate vocalsfrom
instrumentals,whichhasvalidationtracksand has real life applicationbecauseDJs canuse various a
cappellasontopof various instrumentalstocreate blendtapes [4].
Audioseparationtechniqueshave existedpriortoandcouldbe enhancedby thatof machine learning,
such as the clusteringrepeatingperiodtechnique andtrainedHiddenMarkovModel.Techniquescould
be validatedwithofficial versionsof the separatedaudio,where single CDsprovide the mainmix and
the separatedinstrumentalsandacappella versions fortheirrespectivevalidation. Listeningtohigh
accuracy resultscouldbe enjoyable andthe besttechniquescouldbe applied towards audiowhere
official versionsof separatedinstrumentalsanda cappellaare unavailable.
Dataset
Promotional CDandvinyl distributionof singlesongs,wherethe mainmix alsoincludesthe instrumental
and a cappellaversionscouldbe usedforthe do-it-yourself (DIY) separationandthe validationof the
results.Inthisreal life example, differentseparationtechniquesare appliedto a songreleasedin1998
by Aaliyahcalled"Are YouThatSomebody"withasonglengthof 4 minutes and28 seconds [5].The
.mp3 compressed filehasabit rate of 128 kB/s,where the 3.7 kB file isconvertedinto .wav tobe
machine readable at an uncompressedsize of 23.1 kB. The outputof the separated DIYinstrumental and
a cappellaare uncompressed.wav filesof about23 kB each forthe Pythonvalidation process andthen
convertedinto to.mp3 to a size of about 4 kB. The validation.mp3sare fromthe same single CDwhere
the official instrumental and acappellaare available as 3.4 kB and3.5 kB, respectively.
The MultimediaInformation Retrieval Datasethas1000 pre-labeledacappella, instrumentals,andboth
for the 70% training,20%testing,and10% validationprocess [6].The .wavfiles(516MB total) have a
stereosample rate of 16KHz, where male andfemale vocals last4 to 13 seconds tocreate a supervised
HiddenMarkovmodel with theirrespective spectral patterns.

Figure 2: A 35-secondsample of spectrogramat 1:28 of the song.dB and Hz structure can maintain
similarmagnitude acrossseparatedaudio.A thresholdallowance
Upon plottingthe spectrogramsof the full song,official instrumental,andofficial a cappella,we see
fromthe CD filesthathorizontal lines are more likelyinstrumentalsand curvedlinesare more likely
vocals[Figure 2].dB and Hz structure can maintainsimilarmagnitude acrossseparatedaudioandthe
separationtechniqueisnota simple subtraction where vocalsare carvedout.The soundintegritydoes
not diminishafterseparationanda thresholdallowance called'masking'isneededtogauge how much
of the soundstructure isallowedforeachof the separatedspectrogramtokeepitsstructural sound
integrityintact.
Simple subtractionseparation techniquescanleadtoundesiredresults,where consequences of
overcompensatingforone outputcandeteriorate the otherinquality resultingin grainy,robotic, or
underwatervocalsorinstrumentals[Figure 3].

Figure 3: A 35-secondsample of spectrogramat 1:28 of the songaftersimple subtraction. Consequences
of overcompensating forone outputcan deteriorate the otherinquality
Method
For the clusteringrepeatingperiod procedure,the original spectrogram isused tocreate a median
repeatingsegmentpertime period(p),where the periodisoptimizedbythe instrumentalpattern that
repeats[Figure 4] [7].The separatedinstrumental iscreatedfromhow farfrommedianrepeating
spectrogramisallowedtorepeatalongoriginal while the separatedacappellaiscreatedbythe outliers
of a thresholdfromthe medianrepeatingDIYinstrumental.Thresholdsare neededbecause itisnot
pure subtractionseparation,wherethe value toseparate instrumentalandtoseparate a cappellacould
be different[6].

Figure 4: DIY instrumental:Howfarfrommedianrepeatingspectrogramisallowedtorepeatalong
original.DIYa cappella:Outliersoutsideof athresholdfromthe medianrepeatingDIYinstrumental
For the supervisedMarkovmodel procedure, the labeledspectrogram dictionaryof vocalsandmusicare
permutatedby overlayingsoundstogetherandby sequencingsoundstogether[Figure 5] [8].The
probabilitydensitiesare calculatedfortransitionallikelihoodof bothsoundoverlaysandsequencing and
matchedwiththe highestprobability patternalignmentwithsongspectrogram [9].The instrumental
and a cappellaare separated fromfull songbysubtractionwiththe matchedsoundoverlay, where the
resultsare dependenton the qualityanddiversityof dataset.
Figure 5: HiddenMarkovModel is usedtofindthe transitionprobabilitiesof certain sequencingand
certainoverlaysof labeledsound fromalabeleddictionary

For the validation procedure,the normalizeddistance betweenspectrograms isappliedforthe Mel-
Frequency Cepstral Coefficients(MFCC) comparison [Figure6] [10].The comparisonprocedure first
takesthe Fouriertransform of the spectrogramand mapsthe frequencies intopitch.Thenadiscrete
cosine transform isappliedtofindthe pathand cost of comparison.The normalizeddifference istaken
to findthe absolute difference of how muchdistance the pathdeviatesfromaperfect45° [10]. Both
path andcost are plottedto show howthe transformeddatacompare overtime vs.time axes.After
transformations,dataintegrityiskeptbutthe unitsbecome abstract,where the normalizeddistance
betweenfullsonganditself is 0.
Figure 6a and 6b: MFCC comparisonbaseline testshowsthe scalardistance betweenthe songand
instrumental is 143.34 andthat for the song anda cappellais 146.31
As a baseline test,the scalardistance betweenthe songandinstrumentalis 143.34 and that forthe song
and a cappellais 146.31 [Figure 6].The baseline numberssuggestthat the distance between goodDIY
separationsand the actual versions,respectively,should range fromaround 145 to 0. Inthe case of the
bad example,the distance fromreal instrumentalis135 andthe distance fromreal a cappellais 285 for
theirrespective DIYseparations[Figure3].
Results
For the clusteringrepeatingperiodmethod,the recurrencematrix isplottedusingadiagonal
redundancysimilartothe MFCC transformations forthe k-meansclusteringalgorithmtorecognize
similarstructural componentsof the periodicrepetitionof the song[Figure 7].The k-meansclustersare
thenoverlaidontopof the full songspectrogramwhere the medianspectrogramisoptimizedata
periodthatrepeats every6.9334 seconds [1].

Figure 7a and 7b: Repetition visualizedbythe recurrence matrix couldbe clusteredbyk-means.
Increase clusterstofindoptimal repetitionviaspectrogramoverlay.
Afteroptimizingparametersbyear,the normalizeddistance betweenseparated instrumental andactual
is98.08, where the instrumentalmargin is1 andthe a cappellamargin is0.5 [Figure 8].The normalized
distance betweenseparated acappellaandactual is 101.20, where the instrumental margin is1and the
a cappellamarginis 0.6 [Figure 9].The fullnessof the instrumental spectrumispreserved becauseof the
medianrepeatingperiodandthe manuallyadjustedmaskingparameters. The acappellaseparationhas
gaps inthe spectrumdespite the medianthreshold,perhapsbecause the medianperiodalsocaptured
the redundantchorusthat repeatsoverthe songas well.
Figure 8: ClusteringRepeatingPeriodtechnique: normalizeddistancebetweenseparated instrumental
and actual is 98.08, where the instrumental margin is1and the a cappellamargin is0.5

Figure 9: ClusteringRepeatingPeriodtechnique: normalizeddistancebetweenseparated acappellaand
actual is101.20, where the instrumentalmarginis1 andthe a cappellamarginis 0.6
For the supervisedMarkovmodel, thereisatime intensive approachtolabel,train, anditerate through
the labeleddatasetandthe separationis dependenton the qualityanddiversityof the labeled dataset.
The normalized distance betweenseparated instrumental andactual is 100.6 [Figure 10] and the
normalized distancebetweenseparated acappellaandactual is 111.6 [Figure 11].The gaps in
separation forbothinstrumental andacappellaare the resultof misinterpretedmatchedprobabilities
that couldbe alleviatedwithsoundsthatmatchwiththe actual song.

Figure 10: HiddenMarkov Model: normalized distance betweenseparated instrumental andactual is
100.6
Figure 11: HiddenMarkov Model: normalized distance betweenseparated acappellaandactual is 111.6
Conclusion
For the clusteringrepeatingperiod withoptimization,the advantagesincludethatthe algorithmisfast
and the datasetisself containedby the songfile,whereasthe disadvantagesincludethatthe sample
limitationscouldresultinaninsufficientmedianandthatit hasto be manuallyoptimized byear.The
distance betweenthe separatedinstrumental andactual is98.08 and that forthe separatedacappella
and actual is 101.20 for the clusteringrepeating periodmethod.
For the supervisedHiddenMarkovModel withsoundoverlayandsequencing,the advantagesinclude
that the separationcouldbe very precise dependingondataset,whereasthe disadvantagesinclude
weirdseparationfromprobabilisticmatching andthe procedure isverytime intensive withlabeling,
training, anditerationof the labeleddatasetontothe song. The distance betweenthe separated
instrumental andactual is 100.6 and that for the separateda cappellaandactual is 111.6 for the
supervisedHiddenMarkovModel.
From thiscomparison,clusteringrepeatingperiodwith optimizationisthe bettermachinelearning
methodby lesstime spent, more accurate resultstothe validationtracks, andthe generality of the
algorithmtobe appliedto unique songswithouthavingtolabel more trainingdatasets.

Citations
1. Y. Yao. "AudioSeparationviaClusteringRepeatingPeriodvs.HiddenMarkovModel,"Github,
2018. [Online].Available: https://github.com/yaowser/audio-separation[Accessed 6-Jun-2018]
2. G. Elert." Music & Noise,"The PhysicsHypertextbook,2018. [Online].Available:
https://physics.info/music/ [Accessed 6-Jun-2018]
3. "Pro Tools12 Professional AnnualSubscription,"Amazon,2018. [Online]. Available:
https://www.amazon.com/Pro-Tools-Professional-Annual-Subscription/dp/B00V540NKW
[Accessed 6-Jun-2018]
4. M. Weiss. "TipsforMixingVocalstoan Instrumental,"ProAudioFiles,2012. [Online]. Available:
https://theproaudiofiles.com/tips-for-mixing-vocals-to-a-two-track-instrumental/ [Accessed6-
Jun-2018]
5. "Aaliyah–Are You That Somebody?"Discogs,2008.[Online]. Available:
https://www.discogs.com/Aaliyah-Are-You-That-Somebody/release/346060 [Accessed 6-Jun-
2018]
6. C. Hsu."MIR-1K Dataset,"Google Sites,2011. [Online].Available:
https://sites.google.com/site/unvoicedsoundseparation/mir-1k[Accessed 6-Jun-2018]
7. Z. Rafii."REpeatingPatternExtractionTechnique,"ZafarRafii,2018. [Online].Available:
http://zafarrafii.com/#REPET[Accessed 6-Jun-2018]
8. J. Han. "AudioImputation,"Northwestern University,2012. [Online].Available:
http://www.cs.northwestern.edu/~jha222/imputation [Accessed 6-Jun-2018]
9. A. Lloyd."HiddenMarkovModelsinPractice,"Slide Player,2015. [Online].Available:
http://slideplayer.com/slide/4757315/ [Accessed6-Jun-2018]
10. T. Tourani."ComparingAudioFilesPython,"Github,2015. [Online].Available:
https://github.com/d4r3topk/comparing-audio-files-python[Accessed6-Jun-2018]

Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from Yao Yao

More from Yao Yao (20)

Recently uploaded

Recently uploaded (20)

Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model