Straight

535 views

Published on

VOCODER

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
535
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Straight

  1. 1. 9-10 August, 2002 Computational Audition 1 Fixed Point Representations forFixed Point Representations for Very High-Quality Speech andVery High-Quality Speech and Sound Modification SystemSound Modification System Hideki KawaharaHideki Kawahara Wakayama University, JapanWakayama University, Japan
  2. 2. 9-10 August, 2002 Computational Audition 2 SummarySummary nn Functional (computational after Marr)Functional (computational after Marr) approach is important and productive.approach is important and productive. nn Fixed points provide feature values asFixed points provide feature values as well as their reliability indices.well as their reliability indices. –– UsingUsing within channelwithin channel informationinformation nn Fixed point concept may provide clue toFixed point concept may provide clue to integrate Fourier based concept andintegrate Fourier based concept and wavelet-wavelet-MellinMellin transform based concept.transform based concept. Reference systemReference system ““LocalLocal”” center of gravitycenter of gravity
  3. 3. 9-10 August, 2002 Computational Audition 3 original STRAIGHT: demoSTRAIGHT: demo
  4. 4. 9-10 August, 2002 Computational Audition 4 STRAIGHT demo: morphingSTRAIGHT demo: morphing neutral angry interpolationextrapolation extrapolation Word: /hai/ (“Yes” in Japanese)
  5. 5. 9-10 August, 2002 Computational Audition 5 BackgroundBackground nn ““Auditory BrainAuditory Brain”” Project by CRESTProject by CREST –– Short term goal: speech processing systemsShort term goal: speech processing systems based on functional models of auditory functions.based on functional models of auditory functions. –– STRAIGHT: a very high-quality speechSTRAIGHT: a very high-quality speech manipulation systemmanipulation system –– Fixed point based algorithmsFixed point based algorithms (alternative way of dimensional reduction of(alternative way of dimensional reduction of auditory representations)auditory representations) –– Long term goal:Long term goal: ““computationalcomputational”” theorirdtheorird ofof auditionaudition nn Frustrations in ways how auditory modelsFrustrations in ways how auditory models are used in ASR and how speech processingare used in ASR and how speech processing systems are evaluated.systems are evaluated.
  6. 6. 9-10 August, 2002 Computational Audition 6 ““Auditory BrainAuditory Brain”” ProjectProject nn To develop a very high quality speechTo develop a very high quality speech and/or sound manipulation system basedand/or sound manipulation system based on perceptually relevant parameters andon perceptually relevant parameters and it does not preserve phase/waveformit does not preserve phase/waveform information.information. –– ?? Are distance based quality measures relevant???? Are distance based quality measures relevant?? –– ?? Why does periodic sound sounds smoother and?? Why does periodic sound sounds smoother and richer (in Auditory Fovea)??richer (in Auditory Fovea)?? –– ?? Is it relevant to test highly nonlinear speech?? Is it relevant to test highly nonlinear speech perception using elementary sounds ??perception using elementary sounds ??
  7. 7. 9-10 August, 2002 Computational Audition 7 Why high quality?Why high quality? nn Ecological approach for investigatingEcological approach for investigating highly nonlinear system, Humanhighly nonlinear system, Human Not necessarily be predictable from elementary test signals Necessary to use ecologically valid stimuli Naturalness
  8. 8. 9-10 August, 2002 Computational Audition 8 Hans Moravec: Robot, 2000, Oxford xbox iMac PlayStation-3 Key issue: compatibilityKey issue: compatibility Background figure is removed. Please visit Hans Moravec’s page for the original figure. Faster than exponential growth in computing power (Chapter 3: Power and Presence, Page 60) http://www.frc.ri.cmu.edu/~hpm/book98/
  9. 9. 9-10 August, 2002 Computational Audition 9 ““Auditory BrainAuditory Brain”” ProjectProject nn Computational theories of speech/auditoryComputational theories of speech/auditory perceptionperception –– ecological constraints on evolutionecological constraints on evolution –– It cannot be ad hoc.It cannot be ad hoc. »» When there is an elegant and reasonable algorithmWhen there is an elegant and reasonable algorithm and it does not violate ecological (biological andand it does not violate ecological (biological and environmental) constraints, there is no reason toenvironmental) constraints, there is no reason to deny that the algorithm shares the commondeny that the algorithm shares the common underlying principles with our auditory system.underlying principles with our auditory system.
  10. 10. 9-10 August, 2002 Computational Audition 10 ““Auditory BrainAuditory Brain”” ProjectProject nn Computational theories of speech/auditoryComputational theories of speech/auditory perceptionperception –– Periodicity: time-frequency sampling gridPeriodicity: time-frequency sampling grid –– Periodicity: stable reference point for wavelet-Periodicity: stable reference point for wavelet- MellinMellin transformtransform –– Log-linear frequency axisLog-linear frequency axis »» Wavelet-Wavelet-MellinMellin transform: shape and sizetransform: shape and size –– Why two ears? ICAWhy two ears? ICA –– Long term correlation (structure)Long term correlation (structure) »» ASR, musicASR, music
  11. 11. 9-10 August, 2002 Computational Audition 11 STRAIGHT a core technologySTRAIGHT a core technology nn Conceptually simple architectureConceptually simple architecture –– Channel VOCODERChannel VOCODER –– Source filter modelSource filter model nn Graded parameters (Graded parameters (vsvs binary decision)binary decision) –– Sensitivity analysisSensitivity analysis –– MorphingMorphing nn Reliability / TransparencyReliability / Transparency –– No post-processingNo post-processing –– Weakly constrained modelWeakly constrained model
  12. 12. 9-10 August, 2002 Computational Audition 12 Structure of STRAIGHT STRAIGHT: architectureSTRAIGHT: architecture
  13. 13. 9-10 August, 2002 Computational Audition 13 STRAIGHT a core technologySTRAIGHT a core technology nn Conceptually simple architectureConceptually simple architecture –– Channel VOCODERChannel VOCODER –– Source filter modelSource filter model nn Graded parameters (Graded parameters (vsvs binary decision)binary decision) –– Sensitivity analysisSensitivity analysis –– MorphingMorphing nn Reliability / TransparencyReliability / Transparency –– No post-processingNo post-processing –– Weakly constrained modelWeakly constrained model
  14. 14. 9-10 August, 2002 Computational Audition 14 Structure of STRAIGHT Spectral envelope estimation STRAIGHT: structureSTRAIGHT: structure
  15. 15. 9-10 August, 2002 Computational Audition 15 Weakly constrained spectralWeakly constrained spectral envelope estimationenvelope estimation waveform Time window Interferences in the time domain Interferences in the frequency domain
  16. 16. 9-10 August, 2002 Computational Audition 16 Weakly constrained spectralWeakly constrained spectral envelope estimationenvelope estimation Reduction of edge discontinuity Reduction of periodicity interference smoothing by spline basisComposite window
  17. 17. 9-10 August, 2002 Computational Audition 17 Time-frequency smoothing ( Time-frequency smoothing (current implementation)current implementation) F0 synchronous Gaussian window complimentary time window reduced interference spectrum F0 synchronous Gaussian window complimentary time window
  18. 18. 9-10 August, 2002 Computational Audition 18 Compensation of over-Compensation of over- smoothingsmoothing
  19. 19. 9-10 August, 2002 Computational Audition 19 Compensation of over-smoothingCompensation of over-smoothing
  20. 20. 9-10 August, 2002 Computational Audition 20 Weakly constrained spectralWeakly constrained spectral envelope estimationenvelope estimation
  21. 21. 9-10 August, 2002 Computational Audition 21 Weakly constrained spectralWeakly constrained spectral envelope estimationenvelope estimation
  22. 22. 9-10 August, 2002 Computational Audition 22 Fixed point based algorithmsFixed point based algorithms nn Fixed points in the frequency domain: → Fixed points in the frequency domain: → F0 extractionF0 extraction nn Fixed points in the time domain: → Fixed points in the time domain: → Excitation extractionExcitation extraction
  23. 23. 9-10 August, 2002 Computational Audition 23 Fixed point of mappingFixed point of mapping fixed point y x y=f(x) * Instantaneous frequency of a filter output around a sinusoidal component * Energy centroid of a windowed signal around an event Examples
  24. 24. 9-10 August, 2002 Computational Audition 24 Averaging and fixed pointAveraging and fixed point nn Prominent componentProminent component Window locations Average of windowed value Fixed point background
  25. 25. 9-10 August, 2002 Computational Audition 25 Averaging and fixed pointAveraging and fixed point nn Prominent componentProminent component Window locations Average of windowed value Fixed point background Parameters (position,slope,[level])
  26. 26. 9-10 August, 2002 Computational Audition 26 Structure of STRAIGHT F0 estimation STRAIGHT: structureSTRAIGHT: structure
  27. 27. 9-10 August, 2002 Computational Audition 27 window selection for reliablewindow selection for reliable representation of mappingrepresentation of mapping Refinement of Fo synchronous windows
  28. 28. 9-10 August, 2002 Computational Audition 28 Window with harmonicWindow with harmonic cancellationcancellation
  29. 29. 9-10 August, 2002 Computational Audition 29 Fixed-point-based sinusoidalFixed-point-based sinusoidal components extractioncomponents extraction
  30. 30. 9-10 August, 2002 Computational Audition 30 Fixed-point-based sinusoidalFixed-point-based sinusoidal frequency and C/N estimationfrequency and C/N estimation C/N information enablesC/N information enables optimum F0 estimation based onoptimum F0 estimation based on multiple harmonic componentsmultiple harmonic components
  31. 31. 9-10 August, 2002 Computational Audition 31 Approximate estimation of C/NApproximate estimation of C/N
  32. 32. 9-10 August, 2002 Computational Audition 32 Reliable built-in mechanismReliable built-in mechanism for fundamental component selectionfor fundamental component selection linearlinear filter arrangement log-linearlog-linear filter arrangement mapping filter output
  33. 33. 9-10 August, 2002 Computational Audition 33 Fixed points on C/N mapFixed points on C/N map Fundamental component
  34. 34. 9-10 August, 2002 Computational Audition 34 F0 evaluation based on EGGF0 evaluation based on EGG gross error W/O:0.72% with:0.32% female
  35. 35. 9-10 August, 2002 Computational Audition 35 Graded sourceGraded source InformationInformation nn Fixed point basedFixed point based FoFo extractionextraction (with C/N map)(with C/N map) F0 trajectoriesF0 trajectories (resolution: 1/F0)(resolution: 1/F0) C/N for each fixed point Graded aperiodicity information Is also extracted
  36. 36. 9-10 August, 2002 Computational Audition 36 Fixed points in the time domainFixed points in the time domain nn How to define auditory temporal eventsHow to define auditory temporal events –– Localized energyLocalized energy centroidcentroid Alternative representation
  37. 37. 9-10 August, 2002 Computational Audition 37 Fixed points in the time domainFixed points in the time domain Squared whitened signal Energy centroid Gaussian window Amount of energy concentration Speech waveform
  38. 38. 9-10 August, 2002 Computational Audition 38 waveform Energy centrold Window center Fixed points Fixed point based event detectionFixed point based event detection
  39. 39. 9-10 August, 2002 Computational Audition 39 Mean time duration Definition of event in the time domainDefinition of event in the time domain Event location Windowed whitened signal
  40. 40. 9-10 August, 2002 Computational Audition 40 Windowed event location andWindowed event location and the original event locationthe original event location Gaussian window Approximation of envelope Windowed location Original location Window location
  41. 41. 9-10 August, 2002 Computational Audition 41 Slope at fixed point duration Window parameter Duration can be estimated fromDuration can be estimated from the geometrical parameter atthe geometrical parameter at the fixed pointthe fixed point
  42. 42. 9-10 August, 2002 Computational Audition 42 Equivalence between the time domainEquivalence between the time domain definition and the frequency domaindefinition and the frequency domain definitiondefinition waveform Time domain definition Frequency domain definition Group delay
  43. 43. 9-10 August, 2002 Computational Audition 43 Inverse problem:Inverse problem: Where is the excitation?Where is the excitation? Minimum phase response Event as the energy centroid Excitation (impulse) compensation
  44. 44. 9-10 August, 2002 Computational Audition 44 Equivalence in definitionsEquivalence in definitions nn Frequency domain definition of the event locationFrequency domain definition of the event location nn Assuming causalityAssuming causality Group delay
  45. 45. 9-10 August, 2002 Computational Audition 45 Group delay of a minimum phaseGroup delay of a minimum phase responseresponse を介した計算Cepstrum
  46. 46. 9-10 August, 2002 Computational Audition 46 Compensation based onCompensation based on minimum phase group delayminimum phase group delay Observed group delay Causal group delay Compensated event location Compensated event duration
  47. 47. 9-10 August, 2002 Computational Audition 47 example Observed group delay Minimum phase group delay Compensated group delay
  48. 48. 9-10 August, 2002 Computational Audition 48 Excitation estimation based on fixedExcitation estimation based on fixed point based event detectionpoint based event detection Event based concentration Excitation based concentration Energy centroid Compensated group delay excitation Vocal fold closure
  49. 49. 9-10 August, 2002 Computational Audition 49 Excitation extraction accuracyExcitation extraction accuracy Standard deviation
  50. 50. 9-10 August, 2002 Computational Audition 50 Estimated excitation Speech waveform Multiple resolution display ofMultiple resolution display of events (fixed points)events (fixed points)
  51. 51. 9-10 August, 2002 Computational Audition 51 Multiple resolution display ofMultiple resolution display of events (fixed points)events (fixed points) demo Fixed points due to one excitation aligns on a straight line
  52. 52. 9-10 August, 2002 Computational Audition 52 Phase map of wavelet transformPhase map of wavelet transform
  53. 53. 9-10 August, 2002 Computational Audition 53 Instantaneous frequency basedInstantaneous frequency based fixed pointsfixed points
  54. 54. 9-10 August, 2002 Computational Audition 54 Instantaneous frequency basedInstantaneous frequency based fixed pointsfixed points
  55. 55. 9-10 August, 2002 Computational Audition 55 Group delay based fixed pointsGroup delay based fixed points
  56. 56. 9-10 August, 2002 Computational Audition 56 Group delay based fixed pointsGroup delay based fixed points
  57. 57. 9-10 August, 2002 Computational Audition 57 Structure of STRAIGHT Source attribute control STRAIGHT: structureSTRAIGHT: structure
  58. 58. 9-10 August, 2002 Computational Audition 58 Group delay manipulated mixed-mode excitation source Group delay manipulated mixed-mode excitation source group delay asymmetry impulse response ..provides continuous coverage from pulse train to random noise
  59. 59. 9-10 August, 2002 Computational Audition 59 SummarySummary nn Functional (computational after Marr)Functional (computational after Marr) approach is important and productive.approach is important and productive. nn Fixed points provide feature values asFixed points provide feature values as well as their reliability indices.well as their reliability indices. –– UsingUsing within channelwithin channel informationinformation nn Fixed point concept may provide clue toFixed point concept may provide clue to integrate Fourier based concept andintegrate Fourier based concept and wavelet-wavelet-MellinMellin transform based concept.transform based concept.
  60. 60. 9-10 August, 2002 Computational Audition 60 ColleaguesColleagues nn Haruhiro KatayoseHaruhiro Katayose, Toshio, Toshio IrinoIrino,, TakanobuTakanobu NishiuraNishiura (Wakayama(Wakayama UnivUniv.).) nn MinoruMinoru TsuzakiTsuzaki, Hideki, Hideki IwasawaIwasawa (ATR)(ATR) nn ParhamParham ZolfaghariZolfaghari (NTT)(NTT) nn Kiyohiro ShikanoKiyohiro Shikano, Hiroshi, Hiroshi SaruwatariSaruwatari (NAIST)(NAIST) nn Fumitada ItakuraFumitada Itakura, Kazuya Takeda, Shoji, Kazuya Takeda, Shoji kajitakajita, Hideki, Hideki BannoBanno (CIAIR, Nagoya(CIAIR, Nagoya UnivUniv.).) nn MasatoMasato AkagiAkagi, Masashi, Masashi UnokiUnoki (JAIST)(JAIST) nn Seiichi Nakagawa (Seiichi Nakagawa (ToyohashiToyohashi Inst. Tech)Inst. Tech) nn ShigekiShigeki SagayamaSagayama, Nobuaki, Nobuaki MinematsuMinematsu ((UnivUniv. Tokyo). Tokyo) nn DianeDiane KewleyKewley-Port (Indiana-Port (Indiana UnivUniv. USA). USA) nn Osamu Fujimura (Ohio stateOsamu Fujimura (Ohio state UnivUniv. USA). USA) nn Alain deAlain de CheveignCheveignéé (IRCAM, France)(IRCAM, France) nn Roy D. Patterson (CNBH, UK)Roy D. Patterson (CNBH, UK)
  61. 61. 9-10 August, 2002 Computational Audition 61 ReferencesReferences nn Hideki Kawahara,Hideki Kawahara, IkuyoIkuyo Masuda-Masuda-KatsuseKatsuse and Alain deand Alain de CheveigneCheveigne: Restructuring: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and anspeech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of ainstantaneous-frequency-based F0 extraction: Possible role of a reptitivereptitive structure in sounds, Speech Communication, 27, pp.187-207 (1999).structure in sounds, Speech Communication, 27, pp.187-207 (1999). nn Hideki Kawahara,Hideki Kawahara, Haruhiro KatayoseHaruhiro Katayose, Alain de, Alain de CheveigneCheveigne, Roy D. Patterson:, Roy D. Patterson: Fixed Point Analysis of Frequency to Instantaneous Frequency Mapping forFixed Point Analysis of Frequency to Instantaneous Frequency Mapping for Accurate Estimation of F0 and Periodicity , Proc. EUROSPEECH'99, Volume 6,Accurate Estimation of F0 and Periodicity , Proc. EUROSPEECH'99, Volume 6, Page 2781-2784 (1999).Page 2781-2784 (1999). nn Hideki Kawahara, YoshinoriHideki Kawahara, Yoshinori AtakeAtake and Parhamand Parham ZolfaghariZolfaghari: Accurate vocal event: Accurate vocal event detection method based on a fixed-point to weighted average group delay,detection method based on a fixed-point to weighted average group delay, ICSLP-2000, Beijing, pp.664-667 2000.ICSLP-2000, Beijing, pp.664-667 2000. nn H. Kawahara and PH. Kawahara and P ZolfaghariZolfaghari: Systematic F0 glitches around vowel nasal: Systematic F0 glitches around vowel nasal transitions, EUROSPEECH'2001, pp.2459-2462, 2001.transitions, EUROSPEECH'2001, pp.2459-2462, 2001. nn H. Kawahara, JoH. Kawahara, Jo EstillEstill and O. Fujimura:and O. Fujimura: AperiodicityAperiodicity extraction and control usingextraction and control using mixed mode excitation and group delay manipulation for a high quality speechmixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT, MAVEBA 2001,analysis, modification and synthesis system STRAIGHT, MAVEBA 2001, Sept.13-15,Sept.13-15, FirentzeFirentze Italy, 2001.Italy, 2001. nn H. Kawahara and H.H. Kawahara and H. KatayoseKatayose: Scat generation research program based on: Scat generation research program based on STRAIGHT, a high-quality speech analysis, modification and synthesis system,STRAIGHT, a high-quality speech analysis, modification and synthesis system, J. IPSJ, 43, 2, pp.208-218 2002. (in Japanese)J. IPSJ, 43, 2, pp.208-218 2002. (in Japanese)
  62. 62. 9-10 August, 2002 Computational Audition 62 For computationalFor computational ““AuditionAudition”” seed#1 seed#2 F0 trajectory andF0 trajectory and frequency axis modificationfrequency axis modification Parts preparationParts preparation Mixing and level adjustmentMixing and level adjustment
  63. 63. 9-10 August, 2002 Computational Audition 63 Nonlinear time warping basedNonlinear time warping based on phase of the F0 componenton phase of the F0 component (FM pulse train)(FM pulse train) without time warpingwithout time warping with time warpingwith time warping
  64. 64. 9-10 August, 2002 Computational Audition 64 Nonlinear time warping basedNonlinear time warping based on phase of the F0 componenton phase of the F0 component (vowel sequence /(vowel sequence /aiueoaiueo/)/) without time warpingwithout time warping with time warpingwith time warping

×