Your SlideShare is downloading. ×
Speaker Segmentation (2006)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Speaker Segmentation (2006)

720
views

Published on

A presentation about a Speaker Segmentation Project developed at INESC Porto back in 2006

A presentation about a Speaker Segmentation Project developed at INESC Porto back in 2006

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
720
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
33
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Real-time Automatic Speaker Segmentation Luís Gustavo Martins UTM – INESC Porto [email_address] http://www.inescporto.pt/~lmartins LabMeetings March 16, 2006 INESC Porto
  • 2. Notice
    • This work is licensed under the Creative Commons Attribution-Share Alike 2.5 Portugal License. To view a copy of this license, visit
    • http://creativecommons.org/licenses/by-sa/2.5/pt/
    • or send a letter to
    • Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
  • 3. Summary
    • Summary
      • System Overview
      • Audio Analysis front-end
      • Speaker Coarse Segmentation
      • Speaker Change Validation
      • Speaker Model Update
      • Experimental Results
      • Achievements
      • Conclusions
  • 4. Scope
    • Objective
      • Development of a Real-time, Automatic Speaker Segmentation module
        • Already having in mind for future development:
          • Speaker Tracking
          • Speaker Identification
    • Challenges
      • No pre-knowledge about the number and identities of speakers
      • On-line and Real-time operation
        • Audio data is not available beforehand
        • Must only use small amounts of arriving speaker data
        • iterative and computationally intensive methods are unfeasible
  • 5. System Overview
  • 6. Audio Analysis front-end
  • 7. Audio Analysis front-end
    • Front-end Processing
      • 8kHz, 16 bit, pre-emphasized, mono speech streams
      • 25ms analysis frames with no overlap
      • Speech segments with 2.075 secs and 1.4 secs overlap
        • Consecutive sub-segments with 1.375 secs each
  • 8. Audio Analysis front-end
    • Feature Extraction (1)
      • Speaker Modeling
        • 10th-order LPC / LSP
          • Source / Filter approach
        • Other possible features…
          • MFCC
          • Pitch
    SOURCE FILTER
  • 9. Audio Analysis front-end
    • LPC Modeling (1) [Rabiner93, Campbell97]
      • Linear Predictive Coding
        • Order p
    Yule-Walker equations Durbin’s recursive algorithm Toeplitz autocorrelation matrix Autocorrelation method
  • 10. Audio Analysis front-end
    • LPC Modeling (2)
    Whitening Filter  Pitch LPC Spectrum FFT Spectrum
  • 11. Audio Analysis front-end
    • LSP Modeling [Campbell97]
      • Linear Spectral Pairs
        • More robust to quantization, as normally used in speech coding
      • Derived from the LPC a k coefficients
        • Zeros of A(z) mapped to the unit circle in the Z-Domain
        • Use of a pair of (p+1)-order polynomials
  • 12.
    • Speaker Modeling
      • Speaker information is mostly contained in the voiced part of the speech signal…
        • Can you identify Who’s speaking?
      • LPC / LSP analysis behaves badly with non-voiced (i.e. non-periodic) signals
        • Unvoiced/Silence data degrades speaker model accuracy!
          •  Select only voiced data for processing…
    Audio Analysis front-end Unvoiced speech frames Voiced speech frames
  • 13. Audio Analysis front-end
    • Voiced / Unvoiced / Silence (V/U/S) detection
      • Feature Extraction (2)
        • Short Time Energy (STE)
          •  silence detection
        • Zero Crossing Rate (ZCR)
          •  voiced / unvoiced detection
  • 14. Audio Analysis front-end
    • V/U/S speech classes modeled by Gaussian Distributions
      • modeled by 2-d Gaussian Distributions
        • Simple and Fast  real-time operation
      • Dataset:
        • ~4 minutes of manually annotated speech signals
        • 2 male and 2 female Portuguese speakers
    ZCR STE voiced unvoiced silence
  • 15. Audio Analysis front-end
    • Manual Annotation of V/U/S segments in a speech signal
  • 16. Audio Analysis front-end
    • V/U/S Speech dataset
      • Voiced / Unvoiced / Silence stratification in manually segmented audio files
    -------------------------------------- Portuguese Male 2 -------------------------------------- Total Time = 60 secs voiced = 30 secs = 50.0% unvoiced = 14 secs = 23.3% silence = 13 secs = 21.6% Voiced /(Voiced + Unvoiced) = 68% Unvoiced / (Voiced + Unvoiced) = 32% -------------------------------------- Portuguese Male 1 -------------------------------------- Total Time = 60 secs Voiced = 37 secs = 62% unvoiced = 12 secs = 20% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 76% Unvoiced / (Voiced + Unvoiced) = 24% -------------------------------------- Portuguese Female 2 -------------------------------------- Total Time = 60 secs voiced = 30 secs = 50.0% unvoiced = 19 secs = 31.6% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 61.2% Unvoiced / (Voiced + Unvoiced) = 38.7% -------------------------------------- Portuguese Female 1 -------------------------------------- Total Time = 60 secs voiced = 32 secs = 53.3% unvoiced = 17 secs = 28.3% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 65.3% Unvoiced / (Voiced + Unvoiced) = 32.7%
  • 17. Audio Analysis front-end
    • Automatic Classification of V/U/S speech frames :
      • 10-fold Cross-Validation
        • Confusion matrix:
      • Some voiced frames are being discarded as unvoiced …
        • Waste of relevant and scarce data…
      • A few unvoiced and silence frames are being misclassified as voiced
        • Contamination of the data to be analyzed
    contamination waste (Theoretical Random Classifier Correct Classifications = 33.33%) Total Classification Error = 18.385% Total Correct Classifications = 81.615+/-1.13912% 64.92 % 33.55 % 0.88 % silence 34.66 % 62.28 % 6.8 % unvoiced 0.41 % 4.17 % 92.32 % voiced silence unvoiced voiced Classified as: ↓
  • 18. Audio Analysis front-end
    • Voiced / Unvoiced / Silence (V/U/S) detection
      • Advantages
        • Only quasi-stationary parts of the speech signal are used
          • Include most of the speaker information in a speech signal
          • Avoids model degradation in LPC/LSP
        • Potentially more robust to different speakers/languages
          • Different languages may have distinct V/U/S stratification
          • Speakers talk differently (i.e. more paused  more silence frames)
      • Drawbacks
        • May cause few data points per speech sub-segment
          • Ill-estimation of the covariance matrices
            • number of data points (i.e. voiced frames) >= d(d+1)/2
            • d=dim(cov matrix) = 10 (i.e. 10 LSP coefficients)
            • nr. data points / sub-segment >= 55 frames
            • Not always guaranteed!!
        •  use of dynamically sized windows
          • Does this really work??
  • 19. Speaker Coarse Segmentation
  • 20. Speaker Coarse Segmentation
    • Divergence Shape
      • Only uses LSP features
        • Assumes Gaussian Distribution
      • Calculated between consecutive sub-segments
    Speech stream with 4 speech segments [Campbell97] [Lu2002]
  • 21. Speaker Coarse Segmentation
    • Dynamic Threshold [Lu2002]
    • Speaker change whenever:
  • 22. Speaker Coarse Segmentation
    • Coarse Segmentation performance
      • Presents high False Alarm Rate (FAR = Type I errors) 
    • Possible solution:
      • Use a Speaker Validation Strategy
        • Should allow decreasing FAR … 
        • … but should also avoid an increase in Miss Detections (MDR = Type II errors) 
  • 23. Speaker Change Validation
  • 24. Speaker Change Validation
    • Bayesian Information Criterion (BIC) (1)
      • Hypothesis 0:
        • Single θ z model for speaker data in segments X and Y
    L 0  L 0  Same speaker in segments X and Y Different speakers in segments X and Y X Y Z
  • 25. Speaker Change Validation
    • Bayesian Information Criterion (BIC) (2)
      • Hypothesis 1:
        • Separate models θ x , θ y for speakers in segments X and Y, respectively
    L 1  L 1  Same speaker in segments X and Y Different speakers in segments X and Y X Y Z
  • 26. Speaker Change Validation
    • Bayesian Information Criterion (BIC) (3)
      • Log Likelihood Ratio (LLR)
      • However, this is not a far comparison…
        • The models do not have the same number of parameters!
          • More complex models always fit better the data
            • They should be penalized when compared with simpler models
            •  Δ K = difference of the nr. parameters in the two hypotheses
     Need to define a Threshold … 
    • No Threshold needed! Or is it!?
  • 27. Speaker Change Validation
    • Bayesian Information Criterion (BIC) (4)
      • Using Gaussian models for θ x, θ y and θ z :
      • Validate Speaker Change Point when:
    • Threshold Free! 
    … but λ must be set… 
  • 28. Speaker Change Validation
    • Bayesian Information Criterion (BIC) (5)
      • BIC needs large amounts of data for good accuracy!
        • Each speech segment only contains 55 data points … too few!
      • Solution:
        • Speaker Model Update…
  • 29. Speaker Model Update
  • 30. Speaker Model Update
    • “ Quasi-GMM” speaker modeling [Lu2002]
      • Approximation to GMM (Gaussian Mixture Models)
        • using segmental clustering of Gaussian Models instead of EM
          • Gaussian models incrementally updated with new arriving speaker data
      • less accurate than GMM…
        • … but feasible for real-time operation
  • 31. Speaker Model Update
    • “ Quasi-GMM” speaker modeling [Lu2002]
      • Segmental Clustering
        • Start with one Gaussian Mixture (~GMM1)
        • DO:
          • Update mixture as speaker data is received
          • WHILE:
            • dissimilarity between mixture model before and after update is sufficiently small
        • Create a new Gaussian mixture (GMMn+1)
          • Up to a maximum of 32 mixtures (GMM32)
        • Mixture Weight ( w m ):
  • 32. Speaker Model Update
    • “ Quasi-GMM” speaker modeling [Lu2002]
      • Gaussian Model on-line updating
    •  μ dependent terms are discarded [Lu2002]
      • Increase robustness to changes in noise and background sound
      • ~ Cepstral Mean Subtraction (CMR)
  • 33. Speaker Change Validation
  • 34. Speaker Change Validation
    • BIC and Quasi-GMM Speaker Models
      • Validate Speaker Change Point when:
  • 35. Complete System
  • 36. Complete System
  • 37. Experimental Results
    • Speaker Datasets:
      • INESC Porto dataset:
        • Sources:
          • MPEG-7 Content Set CD1 [MPEG.N2467]
          • broadcast news from assorted sources
          • male, female, various languages
        • 43 minutes of speaker audio
          • 16 bit @ 22.05kHz PCM, single-channel
        • Ground Truth
          • 181 speaker changes
          • Manually annotated
          • Speaker segments durations
            • Maximum ~= 120 secs
            • Minimum = 2.25 secs
            • Mean = 19.81 secs
            • Std.Dev. = 27.08 secs
  • 38. Experimental Results
    • Speaker Datasets:
      • TIMIT/AUTH dataset:
        • Sources:
          • TIMIT database
            • 630 English speakers
            • 6300 sentences
        • 56 minutes of speaker audio
          • 16 bit @ 22.05kHz PCM, single-channel
        • Ground Truth
          • 983 speaker changes
          • Manually annotated
          • Speaker segments durations
            • Maximum ~= 12 secs
            • Minimum = 1.139 secs
            • Mean = 3.28 secs
            • Std.Dev. = 1.52 secs
  • 39. Experimental Results
    • Efficiency Measures
  • 40. Experimental Results
    • System’s Parameters fine-tuning
      • Parameters
        • Dynamic Threshold : α and nr. of previous frames
        • BIC: λ
        • qGMM: mixture creation thresholds
        • Detection Tolerance Interval: set to [-1;+1] secs.
      • tune system to higher FAR & lower MDR
        • Missed speaker changes can not be recovered by subsequent processing
        • False speaker changes will hopefully be discarded by subsequent processing
          • Speaker Tracking module (future work)
            • Merge adjacent segments identified as belonging to the same speaker
  • 41. Experimental Results
    • Dynamic Threshold and BIC parameters ( α and λ )
      • Best Results found for: α = 0.8 λ = 0.6
  • 42. Experimental Results
    • INESC Porto dataset evaluation (1)
    INESC System ver.1 INESC System ver.2
    • Features:
      • LSP
    • Voiced Filter disabled
    • On-line processing (realtime)
    • Uses BIC
    • Features:
      • LSP
    • Voiced Filter enabled
    • On-line processing (realtime)
    • Uses BIC
  • 43. Experimental Results
    • TIMIT/AUTH dataset evaluation (1)
    INESC System ver.1 INESC System ver.2
    • Features:
      • LSP
    • Voiced Filter disabled
    • On-line processing (realtime)
    • Uses BIC
    • Features:
      • LSP
    • Voiced Filter enabled
    • On-line processing (realtime)
    • Uses BIC
  • 44. Experimental Results
    • INESC Porto dataset evaluation (2)
    INESC System ver.2
    • Features:
      • LSP
    • Voiced Filter disabled
    • On-line processing (realtime)
    • Uses BIC
    AUTH System 1
    • Features:
      • AudioSpectrumCentroid
      • AudioWaveformEnvelope
    • Multiple-pass (non-realtime)
    • Uses BIC
  • 45. Experimental Results
    • TIMIT/AUTH dataset evaluation (2)
    AUTH System 1 INESC System ver.2
    • Features:
      • AudioSpectrumCentroid
      • AudioWaveformEnvelope
    • Multiple-pass (non-realtime)
    • Uses BIC
    • Features:
      • LSP
    • Voiced Filter disabled
    • On-line processing (realtime)
    • Uses BIC
  • 46. Experimental Results
    • TIMIT/AUTH dataset evaluation (3)
    AUTH System 2 INESC System ver.2
    • Features:
      • DFT Mag
      • STE
      • AudioWaveformEnvelope
      • AudioSpectrumCentroid
      • MFCC
    • Fast system (realtime?)
    • Uses BIC
    • Features:
      • LSP
    • Voiced Filter disabled
    • On-line processing (realtime)
    • Uses BIC
  • 47. Experimental Results
    • Time Shifts on the detected Speaker Change Points
      • Detection tolerance interval = [-1, 1] secs
    INESC System ver.1
  • 48. Achievements
    • Software
      • C++ routines
        • Numerical routines
          • Matrix Determinant
          • Polynomial Roots
          • Levinson-Durbin
        • LPC (adapted from Marsyas)
        • LSP
        • Divergence and Bhattacharyya Shape metrics
        • BIC
        • Quasi-GMM modeling class
      • Automatic Speaker Segment prototype application
        • As a Library (DLL)
          • Integrated into “4VDO - Annotator”
        • As a stand-alone application
    • Reports
      • VISNET deliverables
        • D29, D30, D31, D40, D41
    • Publications (co-author)
      • “ Speaker Change Detection using BIC: A comparison on two datasets”
        • Accepted to the ISCCSP2006
      • “ Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches”
        • Submitted to ICME2006
    DEMO
  • 49. Conclusions
    • Open Issues…
      • Voiced detection procedure
        • Results should improve…
      • Parameter fine-tuning
        • Dynamic Threshold
        • BIC parameter
        • Quasi-GMM Model
    • Further Work
      • Audio Features
        • Evaluate other features for speaker segmentation, tracking and identification
          • Pitch
          • MFCC
      • Speaker Tracking
        • Clustering of speaker segments
        • Evaluation
          • Ground Truth  Needs manual annotation work
      • Speaker Identification
        • Speaker Model Training
        • Evaluation
          • Ground Truth  Needs manual annotation work
  • 50. Contributors
    • INESC Porto
      • Rui Costa
      • Jaime Cardoso
      • Luís Filipe Teixeira
      • Sílvio Macedo
    • VISNET
        • Aristotle University of Thessaloniki (AUTH), Greece
          • Margarita Kotti
          • Emmanuoil Benetos
          • Constantine Kotropoulos
  • 51. Thank you!
    • Questions?
    • [email_address]