Speaker Segmentation (2006)
Upcoming SlideShare
Loading in...5
×
 

Speaker Segmentation (2006)

on

  • 894 views

A presentation about a Speaker Segmentation Project developed at INESC Porto back in 2006

A presentation about a Speaker Segmentation Project developed at INESC Porto back in 2006

Statistics

Views

Total Views
894
Views on SlideShare
892
Embed Views
2

Actions

Likes
0
Downloads
29
Comments
0

1 Embed 2

http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Speaker Segmentation (2006) Speaker Segmentation (2006) Presentation Transcript

  • Real-time Automatic Speaker Segmentation Luís Gustavo Martins UTM – INESC Porto [email_address] http://www.inescporto.pt/~lmartins LabMeetings March 16, 2006 INESC Porto
  • Notice
    • This work is licensed under the Creative Commons Attribution-Share Alike 2.5 Portugal License. To view a copy of this license, visit
    • http://creativecommons.org/licenses/by-sa/2.5/pt/
    • or send a letter to
    • Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
  • Summary
    • Summary
      • System Overview
      • Audio Analysis front-end
      • Speaker Coarse Segmentation
      • Speaker Change Validation
      • Speaker Model Update
      • Experimental Results
      • Achievements
      • Conclusions
  • Scope
    • Objective
      • Development of a Real-time, Automatic Speaker Segmentation module
        • Already having in mind for future development:
          • Speaker Tracking
          • Speaker Identification
    • Challenges
      • No pre-knowledge about the number and identities of speakers
      • On-line and Real-time operation
        • Audio data is not available beforehand
        • Must only use small amounts of arriving speaker data
        • iterative and computationally intensive methods are unfeasible
  • System Overview
  • Audio Analysis front-end
  • Audio Analysis front-end
    • Front-end Processing
      • 8kHz, 16 bit, pre-emphasized, mono speech streams
      • 25ms analysis frames with no overlap
      • Speech segments with 2.075 secs and 1.4 secs overlap
        • Consecutive sub-segments with 1.375 secs each
  • Audio Analysis front-end
    • Feature Extraction (1)
      • Speaker Modeling
        • 10th-order LPC / LSP
          • Source / Filter approach
        • Other possible features…
          • MFCC
          • Pitch
    SOURCE FILTER
  • Audio Analysis front-end
    • LPC Modeling (1) [Rabiner93, Campbell97]
      • Linear Predictive Coding
        • Order p
    Yule-Walker equations Durbin’s recursive algorithm Toeplitz autocorrelation matrix Autocorrelation method
  • Audio Analysis front-end
    • LPC Modeling (2)
    Whitening Filter  Pitch LPC Spectrum FFT Spectrum
  • Audio Analysis front-end
    • LSP Modeling [Campbell97]
      • Linear Spectral Pairs
        • More robust to quantization, as normally used in speech coding
      • Derived from the LPC a k coefficients
        • Zeros of A(z) mapped to the unit circle in the Z-Domain
        • Use of a pair of (p+1)-order polynomials
    • Speaker Modeling
      • Speaker information is mostly contained in the voiced part of the speech signal…
        • Can you identify Who’s speaking?
      • LPC / LSP analysis behaves badly with non-voiced (i.e. non-periodic) signals
        • Unvoiced/Silence data degrades speaker model accuracy!
          •  Select only voiced data for processing…
    Audio Analysis front-end Unvoiced speech frames Voiced speech frames
  • Audio Analysis front-end
    • Voiced / Unvoiced / Silence (V/U/S) detection
      • Feature Extraction (2)
        • Short Time Energy (STE)
          •  silence detection
        • Zero Crossing Rate (ZCR)
          •  voiced / unvoiced detection
  • Audio Analysis front-end
    • V/U/S speech classes modeled by Gaussian Distributions
      • modeled by 2-d Gaussian Distributions
        • Simple and Fast  real-time operation
      • Dataset:
        • ~4 minutes of manually annotated speech signals
        • 2 male and 2 female Portuguese speakers
    ZCR STE voiced unvoiced silence
  • Audio Analysis front-end
    • Manual Annotation of V/U/S segments in a speech signal
  • Audio Analysis front-end
    • V/U/S Speech dataset
      • Voiced / Unvoiced / Silence stratification in manually segmented audio files
    -------------------------------------- Portuguese Male 2 -------------------------------------- Total Time = 60 secs voiced = 30 secs = 50.0% unvoiced = 14 secs = 23.3% silence = 13 secs = 21.6% Voiced /(Voiced + Unvoiced) = 68% Unvoiced / (Voiced + Unvoiced) = 32% -------------------------------------- Portuguese Male 1 -------------------------------------- Total Time = 60 secs Voiced = 37 secs = 62% unvoiced = 12 secs = 20% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 76% Unvoiced / (Voiced + Unvoiced) = 24% -------------------------------------- Portuguese Female 2 -------------------------------------- Total Time = 60 secs voiced = 30 secs = 50.0% unvoiced = 19 secs = 31.6% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 61.2% Unvoiced / (Voiced + Unvoiced) = 38.7% -------------------------------------- Portuguese Female 1 -------------------------------------- Total Time = 60 secs voiced = 32 secs = 53.3% unvoiced = 17 secs = 28.3% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 65.3% Unvoiced / (Voiced + Unvoiced) = 32.7%
  • Audio Analysis front-end
    • Automatic Classification of V/U/S speech frames :
      • 10-fold Cross-Validation
        • Confusion matrix:
      • Some voiced frames are being discarded as unvoiced …
        • Waste of relevant and scarce data…
      • A few unvoiced and silence frames are being misclassified as voiced
        • Contamination of the data to be analyzed
    contamination waste (Theoretical Random Classifier Correct Classifications = 33.33%) Total Classification Error = 18.385% Total Correct Classifications = 81.615+/-1.13912% 64.92 % 33.55 % 0.88 % silence 34.66 % 62.28 % 6.8 % unvoiced 0.41 % 4.17 % 92.32 % voiced silence unvoiced voiced Classified as: ↓
  • Audio Analysis front-end
    • Voiced / Unvoiced / Silence (V/U/S) detection
      • Advantages
        • Only quasi-stationary parts of the speech signal are used
          • Include most of the speaker information in a speech signal
          • Avoids model degradation in LPC/LSP
        • Potentially more robust to different speakers/languages
          • Different languages may have distinct V/U/S stratification
          • Speakers talk differently (i.e. more paused  more silence frames)
      • Drawbacks
        • May cause few data points per speech sub-segment
          • Ill-estimation of the covariance matrices
            • number of data points (i.e. voiced frames) >= d(d+1)/2
            • d=dim(cov matrix) = 10 (i.e. 10 LSP coefficients)
            • nr. data points / sub-segment >= 55 frames
            • Not always guaranteed!!
        •  use of dynamically sized windows
          • Does this really work??
  • Speaker Coarse Segmentation
  • Speaker Coarse Segmentation
    • Divergence Shape
      • Only uses LSP features
        • Assumes Gaussian Distribution
      • Calculated between consecutive sub-segments
    Speech stream with 4 speech segments [Campbell97] [Lu2002]
  • Speaker Coarse Segmentation
    • Dynamic Threshold [Lu2002]
    • Speaker change whenever:
  • Speaker Coarse Segmentation
    • Coarse Segmentation performance
      • Presents high False Alarm Rate (FAR = Type I errors) 
    • Possible solution:
      • Use a Speaker Validation Strategy
        • Should allow decreasing FAR … 
        • … but should also avoid an increase in Miss Detections (MDR = Type II errors) 
  • Speaker Change Validation
  • Speaker Change Validation
    • Bayesian Information Criterion (BIC) (1)
      • Hypothesis 0:
        • Single θ z model for speaker data in segments X and Y
    L 0  L 0  Same speaker in segments X and Y Different speakers in segments X and Y X Y Z
  • Speaker Change Validation
    • Bayesian Information Criterion (BIC) (2)
      • Hypothesis 1:
        • Separate models θ x , θ y for speakers in segments X and Y, respectively
    L 1  L 1  Same speaker in segments X and Y Different speakers in segments X and Y X Y Z
  • Speaker Change Validation
    • Bayesian Information Criterion (BIC) (3)
      • Log Likelihood Ratio (LLR)
      • However, this is not a far comparison…
        • The models do not have the same number of parameters!
          • More complex models always fit better the data
            • They should be penalized when compared with simpler models
            •  Δ K = difference of the nr. parameters in the two hypotheses
     Need to define a Threshold … 
    • No Threshold needed! Or is it!?
  • Speaker Change Validation
    • Bayesian Information Criterion (BIC) (4)
      • Using Gaussian models for θ x, θ y and θ z :
      • Validate Speaker Change Point when:
    • Threshold Free! 
    … but λ must be set… 
  • Speaker Change Validation
    • Bayesian Information Criterion (BIC) (5)
      • BIC needs large amounts of data for good accuracy!
        • Each speech segment only contains 55 data points … too few!
      • Solution:
        • Speaker Model Update…
  • Speaker Model Update
  • Speaker Model Update
    • “ Quasi-GMM” speaker modeling [Lu2002]
      • Approximation to GMM (Gaussian Mixture Models)
        • using segmental clustering of Gaussian Models instead of EM
          • Gaussian models incrementally updated with new arriving speaker data
      • less accurate than GMM…
        • … but feasible for real-time operation
  • Speaker Model Update
    • “ Quasi-GMM” speaker modeling [Lu2002]
      • Segmental Clustering
        • Start with one Gaussian Mixture (~GMM1)
        • DO:
          • Update mixture as speaker data is received
          • WHILE:
            • dissimilarity between mixture model before and after update is sufficiently small
        • Create a new Gaussian mixture (GMMn+1)
          • Up to a maximum of 32 mixtures (GMM32)
        • Mixture Weight ( w m ):
  • Speaker Model Update
    • “ Quasi-GMM” speaker modeling [Lu2002]
      • Gaussian Model on-line updating
    •  μ dependent terms are discarded [Lu2002]
      • Increase robustness to changes in noise and background sound
      • ~ Cepstral Mean Subtraction (CMR)
  • Speaker Change Validation
  • Speaker Change Validation
    • BIC and Quasi-GMM Speaker Models
      • Validate Speaker Change Point when:
  • Complete System
  • Complete System
  • Experimental Results
    • Speaker Datasets:
      • INESC Porto dataset:
        • Sources:
          • MPEG-7 Content Set CD1 [MPEG.N2467]
          • broadcast news from assorted sources
          • male, female, various languages
        • 43 minutes of speaker audio
          • 16 bit @ 22.05kHz PCM, single-channel
        • Ground Truth
          • 181 speaker changes
          • Manually annotated
          • Speaker segments durations
            • Maximum ~= 120 secs
            • Minimum = 2.25 secs
            • Mean = 19.81 secs
            • Std.Dev. = 27.08 secs
  • Experimental Results
    • Speaker Datasets:
      • TIMIT/AUTH dataset:
        • Sources:
          • TIMIT database
            • 630 English speakers
            • 6300 sentences
        • 56 minutes of speaker audio
          • 16 bit @ 22.05kHz PCM, single-channel
        • Ground Truth
          • 983 speaker changes
          • Manually annotated
          • Speaker segments durations
            • Maximum ~= 12 secs
            • Minimum = 1.139 secs
            • Mean = 3.28 secs
            • Std.Dev. = 1.52 secs
  • Experimental Results
    • Efficiency Measures
  • Experimental Results
    • System’s Parameters fine-tuning
      • Parameters
        • Dynamic Threshold : α and nr. of previous frames
        • BIC: λ
        • qGMM: mixture creation thresholds
        • Detection Tolerance Interval: set to [-1;+1] secs.
      • tune system to higher FAR & lower MDR
        • Missed speaker changes can not be recovered by subsequent processing
        • False speaker changes will hopefully be discarded by subsequent processing
          • Speaker Tracking module (future work)
            • Merge adjacent segments identified as belonging to the same speaker
  • Experimental Results
    • Dynamic Threshold and BIC parameters ( α and λ )
      • Best Results found for: α = 0.8 λ = 0.6
  • Experimental Results
    • INESC Porto dataset evaluation (1)
    INESC System ver.1 INESC System ver.2
    • Features:
      • LSP
    • Voiced Filter disabled
    • On-line processing (realtime)
    • Uses BIC
    • Features:
      • LSP
    • Voiced Filter enabled
    • On-line processing (realtime)
    • Uses BIC
  • Experimental Results
    • TIMIT/AUTH dataset evaluation (1)
    INESC System ver.1 INESC System ver.2
    • Features:
      • LSP
    • Voiced Filter disabled
    • On-line processing (realtime)
    • Uses BIC
    • Features:
      • LSP
    • Voiced Filter enabled
    • On-line processing (realtime)
    • Uses BIC
  • Experimental Results
    • INESC Porto dataset evaluation (2)
    INESC System ver.2
    • Features:
      • LSP
    • Voiced Filter disabled
    • On-line processing (realtime)
    • Uses BIC
    AUTH System 1
    • Features:
      • AudioSpectrumCentroid
      • AudioWaveformEnvelope
    • Multiple-pass (non-realtime)
    • Uses BIC
  • Experimental Results
    • TIMIT/AUTH dataset evaluation (2)
    AUTH System 1 INESC System ver.2
    • Features:
      • AudioSpectrumCentroid
      • AudioWaveformEnvelope
    • Multiple-pass (non-realtime)
    • Uses BIC
    • Features:
      • LSP
    • Voiced Filter disabled
    • On-line processing (realtime)
    • Uses BIC
  • Experimental Results
    • TIMIT/AUTH dataset evaluation (3)
    AUTH System 2 INESC System ver.2
    • Features:
      • DFT Mag
      • STE
      • AudioWaveformEnvelope
      • AudioSpectrumCentroid
      • MFCC
    • Fast system (realtime?)
    • Uses BIC
    • Features:
      • LSP
    • Voiced Filter disabled
    • On-line processing (realtime)
    • Uses BIC
  • Experimental Results
    • Time Shifts on the detected Speaker Change Points
      • Detection tolerance interval = [-1, 1] secs
    INESC System ver.1
  • Achievements
    • Software
      • C++ routines
        • Numerical routines
          • Matrix Determinant
          • Polynomial Roots
          • Levinson-Durbin
        • LPC (adapted from Marsyas)
        • LSP
        • Divergence and Bhattacharyya Shape metrics
        • BIC
        • Quasi-GMM modeling class
      • Automatic Speaker Segment prototype application
        • As a Library (DLL)
          • Integrated into “4VDO - Annotator”
        • As a stand-alone application
    • Reports
      • VISNET deliverables
        • D29, D30, D31, D40, D41
    • Publications (co-author)
      • “ Speaker Change Detection using BIC: A comparison on two datasets”
        • Accepted to the ISCCSP2006
      • “ Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches”
        • Submitted to ICME2006
    DEMO
  • Conclusions
    • Open Issues…
      • Voiced detection procedure
        • Results should improve…
      • Parameter fine-tuning
        • Dynamic Threshold
        • BIC parameter
        • Quasi-GMM Model
    • Further Work
      • Audio Features
        • Evaluate other features for speaker segmentation, tracking and identification
          • Pitch
          • MFCC
      • Speaker Tracking
        • Clustering of speaker segments
        • Evaluation
          • Ground Truth  Needs manual annotation work
      • Speaker Identification
        • Speaker Model Training
        • Evaluation
          • Ground Truth  Needs manual annotation work
  • Contributors
    • INESC Porto
      • Rui Costa
      • Jaime Cardoso
      • Luís Filipe Teixeira
      • Sílvio Macedo
    • VISNET
        • Aristotle University of Thessaloniki (AUTH), Greece
          • Margarita Kotti
          • Emmanuoil Benetos
          • Constantine Kotropoulos
  • Thank you!
    • Questions?
    • [email_address]