Real-time Automatic    Speaker Segmentation Luís Gustavo Martins UTM – INESC Porto [email_address] http://www.inescporto.p...
Notice <ul><li>This work is licensed under the Creative Commons Attribution-Share Alike 2.5 Portugal License. To view a co...
Summary <ul><li>Summary </li></ul><ul><ul><li>System Overview </li></ul></ul><ul><ul><li>Audio Analysis front-end </li></u...
Scope <ul><li>Objective </li></ul><ul><ul><li>Development of a Real-time, Automatic Speaker Segmentation module </li></ul>...
System Overview
Audio Analysis front-end
Audio Analysis front-end <ul><li>Front-end Processing </li></ul><ul><ul><li>8kHz, 16 bit, pre-emphasized, mono speech stre...
Audio Analysis front-end <ul><li>Feature Extraction (1) </li></ul><ul><ul><li>Speaker Modeling </li></ul></ul><ul><ul><ul>...
Audio Analysis front-end <ul><li>LPC Modeling (1)  [Rabiner93, Campbell97]   </li></ul><ul><ul><li>Linear Predictive Codin...
Audio Analysis front-end <ul><li>LPC Modeling (2) </li></ul>Whitening Filter    Pitch LPC Spectrum FFT Spectrum
Audio Analysis front-end <ul><li>LSP Modeling  [Campbell97]   </li></ul><ul><ul><li>Linear Spectral Pairs </li></ul></ul><...
<ul><li>Speaker Modeling </li></ul><ul><ul><li>Speaker information is  mostly contained in the voiced part  of the speech ...
Audio Analysis front-end <ul><li>Voiced / Unvoiced / Silence (V/U/S) detection </li></ul><ul><ul><li>Feature Extraction (2...
Audio Analysis front-end <ul><li>V/U/S speech classes modeled by Gaussian Distributions </li></ul><ul><ul><li>modeled by 2...
Audio Analysis front-end <ul><li>Manual Annotation of V/U/S segments in a speech signal </li></ul>
Audio Analysis front-end <ul><li>V/U/S Speech dataset </li></ul><ul><ul><li>Voiced / Unvoiced / Silence stratification in ...
Audio Analysis front-end <ul><li>Automatic Classification of V/U/S speech frames : </li></ul><ul><ul><li>10-fold Cross-Val...
Audio Analysis front-end <ul><li>Voiced / Unvoiced / Silence (V/U/S) detection </li></ul><ul><ul><li>Advantages </li></ul>...
Speaker Coarse Segmentation
Speaker Coarse Segmentation <ul><li>Divergence Shape  </li></ul><ul><ul><li>Only uses  LSP features </li></ul></ul><ul><ul...
Speaker Coarse Segmentation <ul><li>Dynamic Threshold  [Lu2002]   </li></ul><ul><li>Speaker change whenever: </li></ul>
Speaker Coarse Segmentation <ul><li>Coarse Segmentation performance </li></ul><ul><ul><li>Presents high False Alarm Rate (...
Speaker Change Validation
Speaker Change Validation <ul><li>Bayesian Information Criterion (BIC) (1) </li></ul><ul><ul><li>Hypothesis 0: </li></ul><...
Speaker Change Validation <ul><li>Bayesian Information Criterion (BIC) (2) </li></ul><ul><ul><li>Hypothesis 1: </li></ul><...
Speaker Change Validation <ul><li>Bayesian Information Criterion (BIC) (3) </li></ul><ul><ul><li>Log Likelihood Ratio (LLR...
Speaker Change Validation <ul><li>Bayesian Information Criterion (BIC) (4) </li></ul><ul><ul><li>Using  Gaussian models  f...
Speaker Change Validation <ul><li>Bayesian Information Criterion (BIC) (5) </li></ul><ul><ul><li>BIC needs large amounts o...
Speaker Model Update
Speaker Model Update <ul><li>“ Quasi-GMM”  speaker modeling  [Lu2002]   </li></ul><ul><ul><li>Approximation to  GMM  (Gaus...
Speaker Model Update <ul><li>“ Quasi-GMM”  speaker modeling  [Lu2002]   </li></ul><ul><ul><li>Segmental Clustering </li></...
Speaker Model Update <ul><li>“ Quasi-GMM”  speaker modeling  [Lu2002]   </li></ul><ul><ul><li>Gaussian Model on-line updat...
Speaker Change Validation
Speaker Change Validation <ul><li>BIC and Quasi-GMM Speaker Models </li></ul><ul><ul><li>Validate Speaker Change Point whe...
Complete System
Complete System
Experimental Results <ul><li>Speaker Datasets: </li></ul><ul><ul><li>INESC Porto dataset: </li></ul></ul><ul><ul><ul><li>S...
Experimental Results <ul><li>Speaker Datasets: </li></ul><ul><ul><li>TIMIT/AUTH dataset: </li></ul></ul><ul><ul><ul><li>So...
Experimental Results <ul><li>Efficiency Measures </li></ul>
Experimental Results <ul><li>System’s Parameters fine-tuning </li></ul><ul><ul><li>Parameters </li></ul></ul><ul><ul><ul><...
Experimental Results <ul><li>Dynamic Threshold and BIC parameters  ( α  and  λ ) </li></ul><ul><ul><li>Best Results found ...
Experimental Results <ul><li>INESC Porto dataset evaluation (1) </li></ul>INESC System ver.1  INESC System ver.2  <ul><li>...
Experimental Results <ul><li>TIMIT/AUTH dataset evaluation (1) </li></ul>INESC System ver.1  INESC System ver.2  <ul><li>F...
Experimental Results <ul><li>INESC Porto dataset evaluation (2) </li></ul>INESC System ver.2  <ul><li>Features: </li></ul>...
Experimental Results <ul><li>TIMIT/AUTH dataset evaluation (2) </li></ul>AUTH System 1  INESC System ver.2  <ul><li>Featur...
Experimental Results <ul><li>TIMIT/AUTH dataset evaluation (3) </li></ul>AUTH System 2  INESC System ver.2  <ul><li>Featur...
Experimental Results <ul><li>Time Shifts on the detected Speaker Change Points  </li></ul><ul><ul><li>Detection tolerance ...
Achievements <ul><li>Software </li></ul><ul><ul><li>C++ routines </li></ul></ul><ul><ul><ul><li>Numerical routines </li></...
Conclusions <ul><li>Open Issues… </li></ul><ul><ul><li>Voiced detection procedure </li></ul></ul><ul><ul><ul><li>Results s...
Contributors <ul><li>INESC Porto </li></ul><ul><ul><li>Rui Costa </li></ul></ul><ul><ul><li>Jaime Cardoso </li></ul></ul><...
Thank you! <ul><li>Questions? </li></ul><ul><li>[email_address] </li></ul>
Upcoming SlideShare
Loading in …5
×

Speaker Segmentation (2006)

1,111 views

Published on

A presentation about a Speaker Segmentation Project developed at INESC Porto back in 2006

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,111
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
43
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Speaker Segmentation (2006)

  1. 1. Real-time Automatic Speaker Segmentation Luís Gustavo Martins UTM – INESC Porto [email_address] http://www.inescporto.pt/~lmartins LabMeetings March 16, 2006 INESC Porto
  2. 2. Notice <ul><li>This work is licensed under the Creative Commons Attribution-Share Alike 2.5 Portugal License. To view a copy of this license, visit </li></ul><ul><li>http://creativecommons.org/licenses/by-sa/2.5/pt/ </li></ul><ul><li>or send a letter to </li></ul><ul><li>Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. </li></ul>
  3. 3. Summary <ul><li>Summary </li></ul><ul><ul><li>System Overview </li></ul></ul><ul><ul><li>Audio Analysis front-end </li></ul></ul><ul><ul><li>Speaker Coarse Segmentation </li></ul></ul><ul><ul><li>Speaker Change Validation </li></ul></ul><ul><ul><li>Speaker Model Update </li></ul></ul><ul><ul><li>Experimental Results </li></ul></ul><ul><ul><li>Achievements </li></ul></ul><ul><ul><li>Conclusions </li></ul></ul>
  4. 4. Scope <ul><li>Objective </li></ul><ul><ul><li>Development of a Real-time, Automatic Speaker Segmentation module </li></ul></ul><ul><ul><ul><li>Already having in mind for future development: </li></ul></ul></ul><ul><ul><ul><ul><li>Speaker Tracking </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Speaker Identification </li></ul></ul></ul></ul><ul><li>Challenges </li></ul><ul><ul><li>No pre-knowledge about the number and identities of speakers </li></ul></ul><ul><ul><li>On-line and Real-time operation </li></ul></ul><ul><ul><ul><li>Audio data is not available beforehand </li></ul></ul></ul><ul><ul><ul><li>Must only use small amounts of arriving speaker data </li></ul></ul></ul><ul><ul><ul><li>iterative and computationally intensive methods are unfeasible </li></ul></ul></ul>
  5. 5. System Overview
  6. 6. Audio Analysis front-end
  7. 7. Audio Analysis front-end <ul><li>Front-end Processing </li></ul><ul><ul><li>8kHz, 16 bit, pre-emphasized, mono speech streams </li></ul></ul><ul><ul><li>25ms analysis frames with no overlap </li></ul></ul><ul><ul><li>Speech segments with 2.075 secs and 1.4 secs overlap </li></ul></ul><ul><ul><ul><li>Consecutive sub-segments with 1.375 secs each </li></ul></ul></ul>
  8. 8. Audio Analysis front-end <ul><li>Feature Extraction (1) </li></ul><ul><ul><li>Speaker Modeling </li></ul></ul><ul><ul><ul><li>10th-order LPC / LSP </li></ul></ul></ul><ul><ul><ul><ul><li>Source / Filter approach </li></ul></ul></ul></ul><ul><ul><ul><li>Other possible features… </li></ul></ul></ul><ul><ul><ul><ul><li>MFCC </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Pitch </li></ul></ul></ul></ul><ul><ul><ul><ul><li>… </li></ul></ul></ul></ul>SOURCE FILTER
  9. 9. Audio Analysis front-end <ul><li>LPC Modeling (1) [Rabiner93, Campbell97] </li></ul><ul><ul><li>Linear Predictive Coding </li></ul></ul><ul><ul><ul><li>Order p </li></ul></ul></ul>Yule-Walker equations Durbin’s recursive algorithm Toeplitz autocorrelation matrix Autocorrelation method
  10. 10. Audio Analysis front-end <ul><li>LPC Modeling (2) </li></ul>Whitening Filter  Pitch LPC Spectrum FFT Spectrum
  11. 11. Audio Analysis front-end <ul><li>LSP Modeling [Campbell97] </li></ul><ul><ul><li>Linear Spectral Pairs </li></ul></ul><ul><ul><ul><li>More robust to quantization, as normally used in speech coding </li></ul></ul></ul><ul><ul><li>Derived from the LPC a k coefficients </li></ul></ul><ul><ul><ul><li>Zeros of A(z) mapped to the unit circle in the Z-Domain </li></ul></ul></ul><ul><ul><ul><li>Use of a pair of (p+1)-order polynomials </li></ul></ul></ul>
  12. 12. <ul><li>Speaker Modeling </li></ul><ul><ul><li>Speaker information is mostly contained in the voiced part of the speech signal… </li></ul></ul><ul><ul><ul><li>Can you identify Who’s speaking? </li></ul></ul></ul><ul><ul><li>LPC / LSP analysis behaves badly with non-voiced (i.e. non-periodic) signals </li></ul></ul><ul><ul><ul><li>Unvoiced/Silence data degrades speaker model accuracy! </li></ul></ul></ul><ul><ul><ul><ul><li> Select only voiced data for processing… </li></ul></ul></ul></ul>Audio Analysis front-end Unvoiced speech frames Voiced speech frames
  13. 13. Audio Analysis front-end <ul><li>Voiced / Unvoiced / Silence (V/U/S) detection </li></ul><ul><ul><li>Feature Extraction (2) </li></ul></ul><ul><ul><ul><li>Short Time Energy (STE) </li></ul></ul></ul><ul><ul><ul><ul><li> silence detection </li></ul></ul></ul></ul><ul><ul><ul><li>Zero Crossing Rate (ZCR) </li></ul></ul></ul><ul><ul><ul><ul><li> voiced / unvoiced detection </li></ul></ul></ul></ul>
  14. 14. Audio Analysis front-end <ul><li>V/U/S speech classes modeled by Gaussian Distributions </li></ul><ul><ul><li>modeled by 2-d Gaussian Distributions </li></ul></ul><ul><ul><ul><li>Simple and Fast  real-time operation </li></ul></ul></ul><ul><ul><li>Dataset: </li></ul></ul><ul><ul><ul><li>~4 minutes of manually annotated speech signals </li></ul></ul></ul><ul><ul><ul><li>2 male and 2 female Portuguese speakers </li></ul></ul></ul>ZCR STE voiced unvoiced silence
  15. 15. Audio Analysis front-end <ul><li>Manual Annotation of V/U/S segments in a speech signal </li></ul>
  16. 16. Audio Analysis front-end <ul><li>V/U/S Speech dataset </li></ul><ul><ul><li>Voiced / Unvoiced / Silence stratification in manually segmented audio files </li></ul></ul>-------------------------------------- Portuguese Male 2 -------------------------------------- Total Time = 60 secs voiced = 30 secs = 50.0% unvoiced = 14 secs = 23.3% silence = 13 secs = 21.6% Voiced /(Voiced + Unvoiced) = 68% Unvoiced / (Voiced + Unvoiced) = 32% -------------------------------------- Portuguese Male 1 -------------------------------------- Total Time = 60 secs Voiced = 37 secs = 62% unvoiced = 12 secs = 20% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 76% Unvoiced / (Voiced + Unvoiced) = 24% -------------------------------------- Portuguese Female 2 -------------------------------------- Total Time = 60 secs voiced = 30 secs = 50.0% unvoiced = 19 secs = 31.6% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 61.2% Unvoiced / (Voiced + Unvoiced) = 38.7% -------------------------------------- Portuguese Female 1 -------------------------------------- Total Time = 60 secs voiced = 32 secs = 53.3% unvoiced = 17 secs = 28.3% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 65.3% Unvoiced / (Voiced + Unvoiced) = 32.7%
  17. 17. Audio Analysis front-end <ul><li>Automatic Classification of V/U/S speech frames : </li></ul><ul><ul><li>10-fold Cross-Validation </li></ul></ul><ul><ul><ul><li>Confusion matrix: </li></ul></ul></ul><ul><ul><li>Some voiced frames are being discarded as unvoiced … </li></ul></ul><ul><ul><ul><li>Waste of relevant and scarce data… </li></ul></ul></ul><ul><ul><li>A few unvoiced and silence frames are being misclassified as voiced </li></ul></ul><ul><ul><ul><li>Contamination of the data to be analyzed </li></ul></ul></ul>contamination waste (Theoretical Random Classifier Correct Classifications = 33.33%) Total Classification Error = 18.385% Total Correct Classifications = 81.615+/-1.13912% 64.92 % 33.55 % 0.88 % silence 34.66 % 62.28 % 6.8 % unvoiced 0.41 % 4.17 % 92.32 % voiced silence unvoiced voiced Classified as: ↓
  18. 18. Audio Analysis front-end <ul><li>Voiced / Unvoiced / Silence (V/U/S) detection </li></ul><ul><ul><li>Advantages </li></ul></ul><ul><ul><ul><li>Only quasi-stationary parts of the speech signal are used </li></ul></ul></ul><ul><ul><ul><ul><li>Include most of the speaker information in a speech signal </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Avoids model degradation in LPC/LSP </li></ul></ul></ul></ul><ul><ul><ul><li>Potentially more robust to different speakers/languages </li></ul></ul></ul><ul><ul><ul><ul><li>Different languages may have distinct V/U/S stratification </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Speakers talk differently (i.e. more paused  more silence frames) </li></ul></ul></ul></ul><ul><ul><li>Drawbacks </li></ul></ul><ul><ul><ul><li>May cause few data points per speech sub-segment </li></ul></ul></ul><ul><ul><ul><ul><li>Ill-estimation of the covariance matrices </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>number of data points (i.e. voiced frames) >= d(d+1)/2 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>d=dim(cov matrix) = 10 (i.e. 10 LSP coefficients) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>nr. data points / sub-segment >= 55 frames </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Not always guaranteed!! </li></ul></ul></ul></ul></ul><ul><ul><ul><li> use of dynamically sized windows </li></ul></ul></ul><ul><ul><ul><ul><li>Does this really work?? </li></ul></ul></ul></ul>
  19. 19. Speaker Coarse Segmentation
  20. 20. Speaker Coarse Segmentation <ul><li>Divergence Shape </li></ul><ul><ul><li>Only uses LSP features </li></ul></ul><ul><ul><ul><li>Assumes Gaussian Distribution </li></ul></ul></ul><ul><ul><li>Calculated between consecutive sub-segments </li></ul></ul>Speech stream with 4 speech segments [Campbell97] [Lu2002]
  21. 21. Speaker Coarse Segmentation <ul><li>Dynamic Threshold [Lu2002] </li></ul><ul><li>Speaker change whenever: </li></ul>
  22. 22. Speaker Coarse Segmentation <ul><li>Coarse Segmentation performance </li></ul><ul><ul><li>Presents high False Alarm Rate (FAR = Type I errors)  </li></ul></ul><ul><li>Possible solution: </li></ul><ul><ul><li>Use a Speaker Validation Strategy </li></ul></ul><ul><ul><ul><li>Should allow decreasing FAR …  </li></ul></ul></ul><ul><ul><ul><li>… but should also avoid an increase in Miss Detections (MDR = Type II errors)  </li></ul></ul></ul>
  23. 23. Speaker Change Validation
  24. 24. Speaker Change Validation <ul><li>Bayesian Information Criterion (BIC) (1) </li></ul><ul><ul><li>Hypothesis 0: </li></ul></ul><ul><ul><ul><li>Single θ z model for speaker data in segments X and Y </li></ul></ul></ul>L 0  L 0  Same speaker in segments X and Y Different speakers in segments X and Y X Y Z
  25. 25. Speaker Change Validation <ul><li>Bayesian Information Criterion (BIC) (2) </li></ul><ul><ul><li>Hypothesis 1: </li></ul></ul><ul><ul><ul><li>Separate models θ x , θ y for speakers in segments X and Y, respectively </li></ul></ul></ul>L 1  L 1  Same speaker in segments X and Y Different speakers in segments X and Y X Y Z
  26. 26. Speaker Change Validation <ul><li>Bayesian Information Criterion (BIC) (3) </li></ul><ul><ul><li>Log Likelihood Ratio (LLR) </li></ul></ul><ul><ul><li>However, this is not a far comparison… </li></ul></ul><ul><ul><ul><li>The models do not have the same number of parameters! </li></ul></ul></ul><ul><ul><ul><ul><li>More complex models always fit better the data </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>They should be penalized when compared with simpler models </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li> Δ K = difference of the nr. parameters in the two hypotheses </li></ul></ul></ul></ul></ul> Need to define a Threshold …  <ul><li>No Threshold needed! Or is it!? </li></ul>
  27. 27. Speaker Change Validation <ul><li>Bayesian Information Criterion (BIC) (4) </li></ul><ul><ul><li>Using Gaussian models for θ x, θ y and θ z : </li></ul></ul><ul><ul><li>Validate Speaker Change Point when: </li></ul></ul><ul><li>Threshold Free!  </li></ul>… but λ must be set… 
  28. 28. Speaker Change Validation <ul><li>Bayesian Information Criterion (BIC) (5) </li></ul><ul><ul><li>BIC needs large amounts of data for good accuracy! </li></ul></ul><ul><ul><ul><li>Each speech segment only contains 55 data points … too few! </li></ul></ul></ul><ul><ul><li>Solution: </li></ul></ul><ul><ul><ul><li>Speaker Model Update… </li></ul></ul></ul>
  29. 29. Speaker Model Update
  30. 30. Speaker Model Update <ul><li>“ Quasi-GMM” speaker modeling [Lu2002] </li></ul><ul><ul><li>Approximation to GMM (Gaussian Mixture Models) </li></ul></ul><ul><ul><ul><li>using segmental clustering of Gaussian Models instead of EM </li></ul></ul></ul><ul><ul><ul><ul><li>Gaussian models incrementally updated with new arriving speaker data </li></ul></ul></ul></ul><ul><ul><li>less accurate than GMM… </li></ul></ul><ul><ul><ul><li>… but feasible for real-time operation </li></ul></ul></ul>
  31. 31. Speaker Model Update <ul><li>“ Quasi-GMM” speaker modeling [Lu2002] </li></ul><ul><ul><li>Segmental Clustering </li></ul></ul><ul><ul><ul><li>Start with one Gaussian Mixture (~GMM1) </li></ul></ul></ul><ul><ul><ul><li>DO: </li></ul></ul></ul><ul><ul><ul><ul><li>Update mixture as speaker data is received </li></ul></ul></ul></ul><ul><ul><ul><ul><li>WHILE: </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>dissimilarity between mixture model before and after update is sufficiently small </li></ul></ul></ul></ul></ul><ul><ul><ul><li>Create a new Gaussian mixture (GMMn+1) </li></ul></ul></ul><ul><ul><ul><ul><li>Up to a maximum of 32 mixtures (GMM32) </li></ul></ul></ul></ul><ul><ul><ul><li>Mixture Weight ( w m ): </li></ul></ul></ul>
  32. 32. Speaker Model Update <ul><li>“ Quasi-GMM” speaker modeling [Lu2002] </li></ul><ul><ul><li>Gaussian Model on-line updating </li></ul></ul><ul><li> μ dependent terms are discarded [Lu2002] </li></ul><ul><ul><li>Increase robustness to changes in noise and background sound </li></ul></ul><ul><ul><li>~ Cepstral Mean Subtraction (CMR) </li></ul></ul>
  33. 33. Speaker Change Validation
  34. 34. Speaker Change Validation <ul><li>BIC and Quasi-GMM Speaker Models </li></ul><ul><ul><li>Validate Speaker Change Point when: </li></ul></ul>
  35. 35. Complete System
  36. 36. Complete System
  37. 37. Experimental Results <ul><li>Speaker Datasets: </li></ul><ul><ul><li>INESC Porto dataset: </li></ul></ul><ul><ul><ul><li>Sources: </li></ul></ul></ul><ul><ul><ul><ul><li>MPEG-7 Content Set CD1 [MPEG.N2467] </li></ul></ul></ul></ul><ul><ul><ul><ul><li>broadcast news from assorted sources </li></ul></ul></ul></ul><ul><ul><ul><ul><li>male, female, various languages </li></ul></ul></ul></ul><ul><ul><ul><li>43 minutes of speaker audio </li></ul></ul></ul><ul><ul><ul><ul><li>16 bit @ 22.05kHz PCM, single-channel </li></ul></ul></ul></ul><ul><ul><ul><li>Ground Truth </li></ul></ul></ul><ul><ul><ul><ul><li>181 speaker changes </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Manually annotated </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Speaker segments durations </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Maximum ~= 120 secs </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Minimum = 2.25 secs </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Mean = 19.81 secs </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Std.Dev. = 27.08 secs </li></ul></ul></ul></ul></ul>
  38. 38. Experimental Results <ul><li>Speaker Datasets: </li></ul><ul><ul><li>TIMIT/AUTH dataset: </li></ul></ul><ul><ul><ul><li>Sources: </li></ul></ul></ul><ul><ul><ul><ul><li>TIMIT database </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>630 English speakers </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>6300 sentences </li></ul></ul></ul></ul></ul><ul><ul><ul><li>56 minutes of speaker audio </li></ul></ul></ul><ul><ul><ul><ul><li>16 bit @ 22.05kHz PCM, single-channel </li></ul></ul></ul></ul><ul><ul><ul><li>Ground Truth </li></ul></ul></ul><ul><ul><ul><ul><li>983 speaker changes </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Manually annotated </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Speaker segments durations </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Maximum ~= 12 secs </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Minimum = 1.139 secs </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Mean = 3.28 secs </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Std.Dev. = 1.52 secs </li></ul></ul></ul></ul></ul>
  39. 39. Experimental Results <ul><li>Efficiency Measures </li></ul>
  40. 40. Experimental Results <ul><li>System’s Parameters fine-tuning </li></ul><ul><ul><li>Parameters </li></ul></ul><ul><ul><ul><li>Dynamic Threshold : α and nr. of previous frames </li></ul></ul></ul><ul><ul><ul><li>BIC: λ </li></ul></ul></ul><ul><ul><ul><li>qGMM: mixture creation thresholds </li></ul></ul></ul><ul><ul><ul><li>Detection Tolerance Interval: set to [-1;+1] secs. </li></ul></ul></ul><ul><ul><li>tune system to higher FAR & lower MDR </li></ul></ul><ul><ul><ul><li>Missed speaker changes can not be recovered by subsequent processing </li></ul></ul></ul><ul><ul><ul><li>False speaker changes will hopefully be discarded by subsequent processing </li></ul></ul></ul><ul><ul><ul><ul><li>Speaker Tracking module (future work) </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Merge adjacent segments identified as belonging to the same speaker </li></ul></ul></ul></ul></ul>
  41. 41. Experimental Results <ul><li>Dynamic Threshold and BIC parameters ( α and λ ) </li></ul><ul><ul><li>Best Results found for: α = 0.8 λ = 0.6 </li></ul></ul>
  42. 42. Experimental Results <ul><li>INESC Porto dataset evaluation (1) </li></ul>INESC System ver.1 INESC System ver.2 <ul><li>Features: </li></ul><ul><ul><li>LSP </li></ul></ul><ul><li>Voiced Filter disabled </li></ul><ul><li>On-line processing (realtime) </li></ul><ul><li>Uses BIC </li></ul><ul><li>Features: </li></ul><ul><ul><li>LSP </li></ul></ul><ul><li>Voiced Filter enabled </li></ul><ul><li>On-line processing (realtime) </li></ul><ul><li>Uses BIC </li></ul>
  43. 43. Experimental Results <ul><li>TIMIT/AUTH dataset evaluation (1) </li></ul>INESC System ver.1 INESC System ver.2 <ul><li>Features: </li></ul><ul><ul><li>LSP </li></ul></ul><ul><li>Voiced Filter disabled </li></ul><ul><li>On-line processing (realtime) </li></ul><ul><li>Uses BIC </li></ul><ul><li>Features: </li></ul><ul><ul><li>LSP </li></ul></ul><ul><li>Voiced Filter enabled </li></ul><ul><li>On-line processing (realtime) </li></ul><ul><li>Uses BIC </li></ul>
  44. 44. Experimental Results <ul><li>INESC Porto dataset evaluation (2) </li></ul>INESC System ver.2 <ul><li>Features: </li></ul><ul><ul><li>LSP </li></ul></ul><ul><li>Voiced Filter disabled </li></ul><ul><li>On-line processing (realtime) </li></ul><ul><li>Uses BIC </li></ul>AUTH System 1 <ul><li>Features: </li></ul><ul><ul><li>AudioSpectrumCentroid </li></ul></ul><ul><ul><li>AudioWaveformEnvelope </li></ul></ul><ul><li>Multiple-pass (non-realtime) </li></ul><ul><li>Uses BIC </li></ul>
  45. 45. Experimental Results <ul><li>TIMIT/AUTH dataset evaluation (2) </li></ul>AUTH System 1 INESC System ver.2 <ul><li>Features: </li></ul><ul><ul><li>AudioSpectrumCentroid </li></ul></ul><ul><ul><li>AudioWaveformEnvelope </li></ul></ul><ul><li>Multiple-pass (non-realtime) </li></ul><ul><li>Uses BIC </li></ul><ul><li>Features: </li></ul><ul><ul><li>LSP </li></ul></ul><ul><li>Voiced Filter disabled </li></ul><ul><li>On-line processing (realtime) </li></ul><ul><li>Uses BIC </li></ul>
  46. 46. Experimental Results <ul><li>TIMIT/AUTH dataset evaluation (3) </li></ul>AUTH System 2 INESC System ver.2 <ul><li>Features: </li></ul><ul><ul><li>DFT Mag </li></ul></ul><ul><ul><li>STE </li></ul></ul><ul><ul><li>AudioWaveformEnvelope </li></ul></ul><ul><ul><li>AudioSpectrumCentroid </li></ul></ul><ul><ul><li>MFCC </li></ul></ul><ul><li>Fast system (realtime?) </li></ul><ul><li>Uses BIC </li></ul><ul><li>Features: </li></ul><ul><ul><li>LSP </li></ul></ul><ul><li>Voiced Filter disabled </li></ul><ul><li>On-line processing (realtime) </li></ul><ul><li>Uses BIC </li></ul>
  47. 47. Experimental Results <ul><li>Time Shifts on the detected Speaker Change Points </li></ul><ul><ul><li>Detection tolerance interval = [-1, 1] secs </li></ul></ul>INESC System ver.1
  48. 48. Achievements <ul><li>Software </li></ul><ul><ul><li>C++ routines </li></ul></ul><ul><ul><ul><li>Numerical routines </li></ul></ul></ul><ul><ul><ul><ul><li>Matrix Determinant </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Polynomial Roots </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Levinson-Durbin </li></ul></ul></ul></ul><ul><ul><ul><li>LPC (adapted from Marsyas) </li></ul></ul></ul><ul><ul><ul><li>LSP </li></ul></ul></ul><ul><ul><ul><li>Divergence and Bhattacharyya Shape metrics </li></ul></ul></ul><ul><ul><ul><li>BIC </li></ul></ul></ul><ul><ul><ul><li>Quasi-GMM modeling class </li></ul></ul></ul><ul><ul><li>Automatic Speaker Segment prototype application </li></ul></ul><ul><ul><ul><li>As a Library (DLL) </li></ul></ul></ul><ul><ul><ul><ul><li>Integrated into “4VDO - Annotator” </li></ul></ul></ul></ul><ul><ul><ul><li>As a stand-alone application </li></ul></ul></ul><ul><li>Reports </li></ul><ul><ul><li>VISNET deliverables </li></ul></ul><ul><ul><ul><li>D29, D30, D31, D40, D41 </li></ul></ul></ul><ul><li>Publications (co-author) </li></ul><ul><ul><li>“ Speaker Change Detection using BIC: A comparison on two datasets” </li></ul></ul><ul><ul><ul><li>Accepted to the ISCCSP2006 </li></ul></ul></ul><ul><ul><li>“ Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches” </li></ul></ul><ul><ul><ul><li>Submitted to ICME2006 </li></ul></ul></ul>DEMO
  49. 49. Conclusions <ul><li>Open Issues… </li></ul><ul><ul><li>Voiced detection procedure </li></ul></ul><ul><ul><ul><li>Results should improve… </li></ul></ul></ul><ul><ul><li>Parameter fine-tuning </li></ul></ul><ul><ul><ul><li>Dynamic Threshold </li></ul></ul></ul><ul><ul><ul><li>BIC parameter </li></ul></ul></ul><ul><ul><ul><li>Quasi-GMM Model </li></ul></ul></ul><ul><li>Further Work </li></ul><ul><ul><li>Audio Features </li></ul></ul><ul><ul><ul><li>Evaluate other features for speaker segmentation, tracking and identification </li></ul></ul></ul><ul><ul><ul><ul><li>Pitch </li></ul></ul></ul></ul><ul><ul><ul><ul><li>MFCC </li></ul></ul></ul></ul><ul><ul><ul><ul><li>… </li></ul></ul></ul></ul><ul><ul><li>Speaker Tracking </li></ul></ul><ul><ul><ul><li>Clustering of speaker segments </li></ul></ul></ul><ul><ul><ul><li>Evaluation </li></ul></ul></ul><ul><ul><ul><ul><li>Ground Truth  Needs manual annotation work </li></ul></ul></ul></ul><ul><ul><li>Speaker Identification </li></ul></ul><ul><ul><ul><li>Speaker Model Training </li></ul></ul></ul><ul><ul><ul><li>Evaluation </li></ul></ul></ul><ul><ul><ul><ul><li>Ground Truth  Needs manual annotation work </li></ul></ul></ul></ul>
  50. 50. Contributors <ul><li>INESC Porto </li></ul><ul><ul><li>Rui Costa </li></ul></ul><ul><ul><li>Jaime Cardoso </li></ul></ul><ul><ul><li>Luís Filipe Teixeira </li></ul></ul><ul><ul><li>Sílvio Macedo </li></ul></ul><ul><li>VISNET </li></ul><ul><ul><ul><li>Aristotle University of Thessaloniki (AUTH), Greece </li></ul></ul></ul><ul><ul><ul><ul><li>Margarita Kotti </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Emmanuoil Benetos </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Constantine Kotropoulos </li></ul></ul></ul></ul>
  51. 51. Thank you! <ul><li>Questions? </li></ul><ul><li>[email_address] </li></ul>

×