Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Continuous Speech Keyword Spotting        In   by Jesse Sampermans (502400)
Overview
Overview1. Hypothesis
Overview1. Hypothesis2. Historic Overview
Overview1. Hypothesis2. Historic Overview3. Human Speech Organ
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
1.Overview                      Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5...
1. Hypothesis
1. Hypothesis“Is it possible, with today’s known technology, to automatically trigger a recording device       with a rand...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview                2. Historic Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & S...
2. Historic Overview
2. Historic Overview• Early Days (1700 - 1900)
2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer
2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer - late 1700’s: Russian professor      Ch...
2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer  - late 1700’s: Russian professor       ...
2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer      Enhancement  - late 1700’s: Russian...
2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer      Enhancement  - late 1700’s: Russian...
2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer          Enhancement  - late 1700’s: Rus...
2. Historic OverviewWheatstone Resonator
2. Historic Overview• Early Days (1700 - 1900)
2. Historic Overview• Early Days (1700 - 1900)                     1881: Gramophone
2. Historic Overview• Early Days (1700 - 1900)                      1881: Gramophone                    Alexander Graham B...
2. Historic Overview• Early Days (1700 - 1900)                      1881: Gramophone                    Alexander Graham B...
2. Historic Overview• Early Days (1700 - 1900)
2. Historic Overview• Early Days (1700 - 1900)                   1939 World Fair: VODER
2. Historic Overview• Early Days (1700 - 1900)                   1939 World Fair: VODER                        Homer Dudley
2. Historic Overview• Early Days (1700 - 1900)                   1939 World Fair: VODER                        Homer Dudle...
2. Historic Overview• Early Days (1700 - 1900)                   1939 World Fair: VODER                        Homer Dudle...
2. Historic Overview• First Speech Recognizers (1950 - 1980)
2. Historic Overview• First Speech Recognizers (1950 - 1980)                          Vs.
2. Historic Overview• First Speech Recognizers (1950 - 1980)                                Vs.   - Digit Recognition Syst...
2. Historic Overview• First Speech Recognizers (1950 - 1980)                                Vs.   - Digit Recognition Syst...
2. Historic Overview• First Speech Recognizers (1950 - 1980)                                Vs.   - Digit Recognition Syst...
2. Historic Overview• First Speech Recognizers (1950 - 1980)
2. Historic Overview• First Speech Recognizers (1950 - 1980)                Commercialization 1960’s
2. Historic Overview• First Speech Recognizers (1950 - 1980)                Commercialization 1960’s                      ...
2. Historic Overview• First Speech Recognizers (1950 - 1980)                Commercialization 1960’s                      ...
2. Historic Overview• First Speech Recognizers (1950 - 1980)                  Commercialization 1960’s                    ...
2. Historic Overview• First Speech Recognizers (1950 - 1980)                  Commercialization 1960’s                    ...
2. Historic Overview• First Speech Recognizers (1950 - 1980)                  Commercialization 1960’s                    ...
2. Historic Overview• First Speech Recognizers (1950 - 1980)                  Commercialization 1960’s                    ...
2. Historic Overview• First Speech Recognizers (1950 - 1980)                  Commercialization 1960’s                    ...
2. Historic Overview
2. Historic Overview• Modern evolutions (1980 - ...)
2. Historic Overview• Modern evolutions (1980 - ...)                    - Hidden Markov Models
2. Historic Overview• Modern evolutions (1980 - ...)                    - Hidden Markov Models                    - CMU “S...
2. Historic Overview• Modern evolutions (1980 - ...)                    - Hidden Markov Models                    - CMU “S...
2. Historic Overview• Modern evolutions (1980 - ...)                    - Hidden Markov Models                    - CMU “S...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
3. Human Speech Organ                   Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics...
3. Human Speech Organ
3. Human Speech Organ
3. Human Speech Organ           - Lungs: pump air
3. Human Speech Organ           - Lungs: pump air           - Larynx (Vocal Folds)
3. Human Speech Organ           - Lungs: pump air           - Larynx (Vocal Folds)           - Articulators (Tongue, Lips,...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
4. Phonetics & Speech Perception                     Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4...
4. Phonetics & Speech Perception
4. Phonetics & Speech Perception• Phonetics
4. Phonetics & Speech Perception• Phonetics
4. Phonetics & Speech Perception• Phonetics                           - Smallest part of human speech
4. Phonetics & Speech Perception• Phonetics                           - Smallest part of human speech                     ...
4. Phonetics & Speech Perception• Phonetics                           - Smallest part of human speech                     ...
4. Phonetics & Speech Perception• Phonetics                           - Smallest part of human speech                     ...
4. Phonetics & Speech Perception
4. Phonetics & Speech Perception• Speech Perception
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues      Voice Onset Time: Unaspirated plosives (near 0 ms)
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues      Voice Onset Time: Unaspirated plosives (near 0 m...
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues      Voice Onset Time: Unaspirated plosives (near 0 m...
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues      Voice Onset Time: Unaspirated plosives (near 0 m...
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues      Voice Onset Time: Unaspirated plosives (near 0 m...
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues      Voice Onset Time: Unaspirated plosives (near 0 m...
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues      Voice Onset Time: Unaspirated plosives (near 0 m...
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues      Voice Onset Time: Unaspirated plosives (near 0 m...
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues      Voice Onset Time: Unaspirated plosives (near 0 m...
4. Phonetics & Speech Perception• Speech Perception  - Acoustic Cues      Voice Onset Time: Unaspirated plosives (near 0 m...
4. Phonetics & Speech Perception• Speech Perception
4. Phonetics & Speech Perception• Speech Perception  - Variations in speech
4. Phonetics & Speech Perception• Speech Perception  - Variations in speech       Phonetic environment can alter the sound...
4. Phonetics & Speech Perception• Speech Perception  - Variations in speech       Phonetic environment can alter the sound...
4. Phonetics & Speech Perception• Speech Perception  - Variations in speech       Phonetic environment can alter the sound...
4. Phonetics & Speech Perception• Speech Perception  - Variations in speech       Phonetic environment can alter the sound...
4. Phonetics & Speech Perception• Speech Perception  - Variations in speech       Phonetic environment can alter the sound...
4. Phonetics & Speech Perception• Speech Perception  - Variations in speech       Phonetic environment can alter the sound...
4. Phonetics & Speech Perception• Speech Perception  - Variations in speech       Phonetic environment can alter the sound...
4. Phonetics & Speech Perception• Speech Perception  - Variations in speech       Phonetic environment can alter the sound...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
5. Telephone Speech Coding & Compression                      Overview1.   Hypothesis2.   Historic Overview3.   Human Spee...
5. Telephone Speech Coding & Compression
5. Telephone Speech Coding & Compression• Early days:   Analog
5. Telephone Speech Coding & Compression• Early days:   Analog  - Speech converted to control voltage in the phone
5. Telephone Speech Coding & Compression• Early days:   Analog  - Speech converted to control voltage in the phone  - Pass...
5. Telephone Speech Coding & Compression• Early days:   Analog  - Speech converted to control voltage in the phone  - Pass...
5. Telephone Speech Coding & Compression• Early days:    Analog  - Speech converted to control voltage in the phone  - Pas...
5. Telephone Speech Coding & Compression• Early days:    Analog  - Speech converted to control voltage in the phone  - Pas...
5. Telephone Speech Coding & Compression
5. Telephone Speech Coding & Compression• Now:   Mobile Phones
5. Telephone Speech Coding & Compression• Now:   Mobile Phones - GSM: Speech
5. Telephone Speech Coding & Compression• Now:   Mobile Phones - GSM: Speech - UMTS: Data
5. Telephone Speech Coding & Compression• Now:   Mobile Phones - GSM: Speech - UMTS: Data - Frequency content of 3100 kHz
5. Telephone Speech Coding & Compression• Now:    Mobile Phones - GSM: Speech - UMTS: Data - Frequency content of 3100 kHz...
5. Telephone Speech Coding & Compression• Now:     Mobile Phones  - GSM: Speech  - UMTS: Data  - Frequency content of 3100...
5. Telephone Speech Coding & Compression• Now:     Mobile Phones  - GSM: Speech  - UMTS: Data  - Frequency content of 3100...
5. Telephone Speech Coding & Compression• Now:     Mobile Phones  - GSM: Speech  - UMTS: Data  - Frequency content of 3100...
5. Telephone Speech Coding & Compression• Now:     Mobile Phones  - GSM: Speech  - UMTS: Data  - Frequency content of 3100...
5. Telephone Speech Coding & Compression• Now:     Mobile Phones  - GSM: Speech  - UMTS: Data  - Frequency content of 3100...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
6. Speech Enhancement                     Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phoneti...
6. Speech Enhancement
6. Speech Enhancement• Pre-Filtering
6. Speech Enhancement• Pre-Filtering  - Frequency based
6. Speech Enhancement• Pre-Filtering  - Frequency based  - Filter banks
6. Speech Enhancement• Pre-Filtering  - Frequency based  - Filter banks          - Commonly know as an equalizer
6. Speech Enhancement• Pre-Filtering  - Frequency based  - Filter banks          - Commonly know as an equalizer          ...
6. Speech Enhancement• Pre-Filtering  - Frequency based  - Filter banks          - Commonly know as an equalizer          ...
6. Speech Enhancement• Pre-Filtering  - Frequency based  - Filter banks          - Commonly know as an equalizer          ...
6. Speech Enhancement
6. Speech Enhancement• Noise-Filtering
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective         - Uses the amplitud...
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective         - Uses the amplitud...
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective         - Uses the amplitud...
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective         - Uses the amplitud...
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective         - Uses the amplitud...
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective         - Uses the amplitud...
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective         - Uses the amplitud...
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective         - Uses the amplitud...
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective         - Uses the amplitud...
6. Speech Enhancement• Noise-Filtering  - Spectral Substraction         - Simple and effective         - Uses the amplitud...
6. Speech Enhancement
6. Speech Enhancement• Spectral Restoration
6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal.
6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal. - Works on a small scale
6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal. - Works on a small scale - Adds filtered full b...
6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal. - Works on a small scale - Adds filtered full b...
6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal. - Works on a small scale - Adds filtered full b...
6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal. - Works on a small scale - Adds filtered full b...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
7. Speech Recognition Engine                    Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   P...
7. Speech Recognition Engine
7. Speech Recognition Engine• Dynamic Time Warping (DTW)
7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days
7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate ...
7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate ...
7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate ...
7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate ...
7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate ...
7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate ...
7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate ...
7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate ...
7. Speech Recognition Engine
7. Speech Recognition Engine• Statistically Based Speech Recognition
7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models
7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statisticall...
7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statisticall...
7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statisticall...
7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statisticall...
7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statisticall...
7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statisticall...
7. Speech Recognition Engine• Statistically Based Speech Recognition
7. Speech Recognition Engine• Statistically Based Speech Recognition  Acoustic Model
7. Speech Recognition Engine• Statistically Based Speech Recognition  Acoustic Model  - Gathers statistical information fo...
7. Speech Recognition Engine• Statistically Based Speech Recognition  Acoustic Model  - Gathers statistical information fo...
7. Speech Recognition Engine• Statistically Based Speech Recognition  Acoustic Model  - Gathers statistical information fo...
7. Speech Recognition Engine• Statistically Based Speech Recognition  Acoustic Model  - Gathers statistical information fo...
7. Speech Recognition Engine• Statistically Based Speech Recognition
7. Speech Recognition Engine• Statistically Based Speech Recognition  Language Model
7. Speech Recognition Engine• Statistically Based Speech Recognition  Language Model  - Tries to predict the next word
7. Speech Recognition Engine• Statistically Based Speech Recognition  Language Model  - Tries to predict the next word  - ...
7. Speech Recognition Engine• Statistically Based Speech Recognition  Language Model  - Tries to predict the next word  - ...
7. Speech Recognition Engine• Statistically Based Speech Recognition  Language Model  - Tries to predict the next word  - ...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
8. Speech Analytics                     Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics...
8. Speech Analytics
8. Speech Analytics- Separate engine
8. Speech Analytics- Separate engine- Analyze gender, age, identity and topic discussed
8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining
8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining - Analyzes audio a...
8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining - Analyzes audio a...
8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining - Analyzes audio a...
8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining - Analyzes audio a...
8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining - Analyzes audio a...
8. Speech Analytics
8. Speech Analytics• Keyword Spotting
8. Speech Analytics• Keyword Spotting  2 kinds:
8. Speech Analytics• Keyword Spotting  2 kinds:  Isolated word
8. Speech Analytics• Keyword Spotting  2 kinds:  Isolated word       - clearly enforced breaks
8. Speech Analytics• Keyword Spotting  2 kinds:  Isolated word       - clearly enforced breaks       - non-spontaneous
8. Speech Analytics• Keyword Spotting  2 kinds:  Isolated word       - clearly enforced breaks       - non-spontaneous    ...
8. Speech Analytics• Keyword Spotting  2 kinds:  Isolated word       - clearly enforced breaks       - non-spontaneous    ...
8. Speech Analytics• Keyword Spotting  2 kinds:  Isolated word       - clearly enforced breaks       - non-spontaneous    ...
8. Speech Analytics• Keyword Spotting  2 kinds:  Isolated word       - clearly enforced breaks       - non-spontaneous    ...
8. Speech Analytics• Keyword Spotting
8. Speech Analytics• Keyword Spotting    2 methods
8. Speech Analytics• Keyword Spotting    2 methods      - filler method (garbage-method): entire string of speech is analyzed
8. Speech Analytics• Keyword Spotting    2 methods      - filler method (garbage-method): entire string of speech is analyz...
8. Speech Analytics• Keyword Spotting    2 methods      - filler method (garbage-method): entire string of speech is analyz...
8. Speech Analytics• Keyword Spotting    2 methods      - filler method (garbage-method): entire string of speech is analyz...
8. Speech Analytics• Keyword Spotting    2 methods      - filler method (garbage-method): entire string of speech is analyz...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
Overview1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Speech Perception5.   Telephone Speec...
9.Overview                      Conclusion1.   Hypothesis2.   Historic Overview3.   Human Speech Organ4.   Phonetics & Spe...
9. Conclusion
9. Conclusion“Is it possible, with today’s known technology, to automatically trigger a recording device       with a rand...
9. Conclusion“Is it possible, with today’s known technology, to automatically trigger a recording device       with a rand...
9. Conclusion“Is it possible, with today’s known technology, to automatically trigger a recording device       with a rand...
9. Conclusion
9. Conclusion- Keyword spotting algorithm based on a statistically based SRE
9. Conclusion- Keyword spotting algorithm based on a statistically based SRE- Appropriate acoustic model
9. Conclusion- Keyword spotting algorithm based on a statistically based SRE- Appropriate acoustic model- ISIP Switchboard...
9. Conclusion- Keyword spotting algorithm based on a statistically based SRE- Appropriate acoustic model- ISIP Switchboard...
9. Conclusion- Keyword spotting algorithm based on a statistically based SRE- Appropriate acoustic model- ISIP Switchboard...
9. Conclusion- Keyword spotting algorithm based on a statistically based SRE- Appropriate acoustic model- ISIP Switchboard...
Q&A
Jesse Sampermans Research Project Presentation
Jesse Sampermans Research Project Presentation
Upcoming SlideShare
Loading in …5
×

Jesse Sampermans Research Project Presentation

1,098 views

Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

Jesse Sampermans Research Project Presentation

  1. 1. Continuous Speech Keyword Spotting In by Jesse Sampermans (502400)
  2. 2. Overview
  3. 3. Overview1. Hypothesis
  4. 4. Overview1. Hypothesis2. Historic Overview
  5. 5. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ
  6. 6. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception
  7. 7. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression
  8. 8. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement
  9. 9. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine
  10. 10. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics
  11. 11. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  12. 12. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  13. 13. 1.Overview Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  14. 14. 1. Hypothesis
  15. 15. 1. Hypothesis“Is it possible, with today’s known technology, to automatically trigger a recording device with a random word in a sentence over a telephone line?”
  16. 16. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  17. 17. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  18. 18. Overview 2. Historic Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  19. 19. 2. Historic Overview
  20. 20. 2. Historic Overview• Early Days (1700 - 1900)
  21. 21. 2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer
  22. 22. 2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer - late 1700’s: Russian professor Christian Kratzenstein
  23. 23. 2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer - late 1700’s: Russian professor Christian Kratzenstein - Resonant tube attached to pipe organ
  24. 24. 2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer Enhancement - late 1700’s: Russian professor Christian Kratzenstein - Resonant tube attached to pipe organ
  25. 25. 2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer Enhancement - late 1700’s: Russian professor - mid 1800’s: Charles Christian Kratzenstein Wheatstone - Resonant tube attached to pipe organ
  26. 26. 2. Historic Overview• Early Days (1700 - 1900) First artificial speech synthesizer Enhancement - late 1700’s: Russian professor - mid 1800’s: Charles Christian Kratzenstein Wheatstone - Resonant tube attached to pipe - Replace tubes with leather organ resonators
  27. 27. 2. Historic OverviewWheatstone Resonator
  28. 28. 2. Historic Overview• Early Days (1700 - 1900)
  29. 29. 2. Historic Overview• Early Days (1700 - 1900) 1881: Gramophone
  30. 30. 2. Historic Overview• Early Days (1700 - 1900) 1881: Gramophone Alexander Graham Bell
  31. 31. 2. Historic Overview• Early Days (1700 - 1900) 1881: Gramophone Alexander Graham Bell - Dictation purposes
  32. 32. 2. Historic Overview• Early Days (1700 - 1900)
  33. 33. 2. Historic Overview• Early Days (1700 - 1900) 1939 World Fair: VODER
  34. 34. 2. Historic Overview• Early Days (1700 - 1900) 1939 World Fair: VODER Homer Dudley
  35. 35. 2. Historic Overview• Early Days (1700 - 1900) 1939 World Fair: VODER Homer Dudley - Based on Wheatstone Resonator
  36. 36. 2. Historic Overview• Early Days (1700 - 1900) 1939 World Fair: VODER Homer Dudley - Based on Wheatstone Resonator - Electrical & Mechanical Parts
  37. 37. 2. Historic Overview• First Speech Recognizers (1950 - 1980)
  38. 38. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Vs.
  39. 39. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Vs. - Digit Recognition System based on speech formants
  40. 40. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Vs. - Digit Recognition System - 10 syllable recognizer based on speech formants
  41. 41. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Vs. - Digit Recognition System - 10 syllable recognizer based on speech formants - Dynamic Time Warping
  42. 42. 2. Historic Overview• First Speech Recognizers (1950 - 1980)
  43. 43. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Commercialization 1960’s
  44. 44. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Commercialization 1960’s Vs.
  45. 45. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Commercialization 1960’s Vs. Office Automation
  46. 46. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Commercialization 1960’s Vs. Office Automation - Voice typewriter
  47. 47. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Commercialization 1960’s Vs. Office Automation - Voice typewriter - Trained databases
  48. 48. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Commercialization 1960’s Vs. Office Automation Telecom Automation - Voice typewriter - Trained databases
  49. 49. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Commercialization 1960’s Vs. Office Automation Telecom Automation - Voice typewriter - Keyword Spotting - Trained databases
  50. 50. 2. Historic Overview• First Speech Recognizers (1950 - 1980) Commercialization 1960’s Vs. Office Automation Telecom Automation - Voice typewriter - Keyword Spotting - Trained databases - Large Audience
  51. 51. 2. Historic Overview
  52. 52. 2. Historic Overview• Modern evolutions (1980 - ...)
  53. 53. 2. Historic Overview• Modern evolutions (1980 - ...) - Hidden Markov Models
  54. 54. 2. Historic Overview• Modern evolutions (1980 - ...) - Hidden Markov Models - CMU “Sphynx” = commercial success
  55. 55. 2. Historic Overview• Modern evolutions (1980 - ...) - Hidden Markov Models - CMU “Sphynx” = commercial success - DARPA (Defense Advances Research Projects Agency) investments
  56. 56. 2. Historic Overview• Modern evolutions (1980 - ...) - Hidden Markov Models - CMU “Sphynx” = commercial success - DARPA (Defense Advances Research Projects Agency) investments - Battle Management
  57. 57. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  58. 58. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  59. 59. 3. Human Speech Organ Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  60. 60. 3. Human Speech Organ
  61. 61. 3. Human Speech Organ
  62. 62. 3. Human Speech Organ - Lungs: pump air
  63. 63. 3. Human Speech Organ - Lungs: pump air - Larynx (Vocal Folds)
  64. 64. 3. Human Speech Organ - Lungs: pump air - Larynx (Vocal Folds) - Articulators (Tongue, Lips, ...)
  65. 65. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  66. 66. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  67. 67. 4. Phonetics & Speech Perception Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  68. 68. 4. Phonetics & Speech Perception
  69. 69. 4. Phonetics & Speech Perception• Phonetics
  70. 70. 4. Phonetics & Speech Perception• Phonetics
  71. 71. 4. Phonetics & Speech Perception• Phonetics - Smallest part of human speech
  72. 72. 4. Phonetics & Speech Perception• Phonetics - Smallest part of human speech - Originated in India around 2500 BC
  73. 73. 4. Phonetics & Speech Perception• Phonetics - Smallest part of human speech - Originated in India around 2500 BC - IPA (International Phonetic Alphabet)
  74. 74. 4. Phonetics & Speech Perception• Phonetics - Smallest part of human speech - Originated in India around 2500 BC - IPA (International Phonetic Alphabet) - 44 phonemes in American English
  75. 75. 4. Phonetics & Speech Perception
  76. 76. 4. Phonetics & Speech Perception• Speech Perception
  77. 77. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues
  78. 78. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues Voice Onset Time: Unaspirated plosives (near 0 ms)
  79. 79. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues Voice Onset Time: Unaspirated plosives (near 0 ms) Aspirated plosives (> 30 ms)
  80. 80. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues Voice Onset Time: Unaspirated plosives (near 0 ms) Aspirated plosives (> 30 ms) Voiced plosives (< 0 ms)
  81. 81. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues Voice Onset Time: Unaspirated plosives (near 0 ms) Aspirated plosives (> 30 ms) Voiced plosives (< 0 ms) - Speech Segmentation
  82. 82. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues Voice Onset Time: Unaspirated plosives (near 0 ms) Aspirated plosives (> 30 ms) Voiced plosives (< 0 ms) - Speech Segmentation Identifying boundaries between words (lexical) or phonemes (phonetic)
  83. 83. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues Voice Onset Time: Unaspirated plosives (near 0 ms) Aspirated plosives (> 30 ms) Voiced plosives (< 0 ms) - Speech Segmentation Identifying boundaries between words (lexical) or phonemes (phonetic) [k] in “kit” and “caught” and [i] in “kit” and “kick”
  84. 84. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues Voice Onset Time: Unaspirated plosives (near 0 ms) Aspirated plosives (> 30 ms) Voiced plosives (< 0 ms) - Speech Segmentation Identifying boundaries between words (lexical) or phonemes (phonetic) [k] in “kit” and “caught” and [i] in “kit” and “kick” - Categorical Perception
  85. 85. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues Voice Onset Time: Unaspirated plosives (near 0 ms) Aspirated plosives (> 30 ms) Voiced plosives (< 0 ms) - Speech Segmentation Identifying boundaries between words (lexical) or phonemes (phonetic) [k] in “kit” and “caught” and [i] in “kit” and “kick” - Categorical Perception Identifying words from different speakers
  86. 86. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues Voice Onset Time: Unaspirated plosives (near 0 ms) Aspirated plosives (> 30 ms) Voiced plosives (< 0 ms) - Speech Segmentation Identifying boundaries between words (lexical) or phonemes (phonetic) [k] in “kit” and “caught” and [i] in “kit” and “kick” - Categorical Perception Identifying words from different speakers Categorize phonemes in brain
  87. 87. 4. Phonetics & Speech Perception• Speech Perception - Acoustic Cues Voice Onset Time: Unaspirated plosives (near 0 ms) Aspirated plosives (> 30 ms) Voiced plosives (< 0 ms) - Speech Segmentation Identifying boundaries between words (lexical) or phonemes (phonetic) [k] in “kit” and “caught” and [i] in “kit” and “kick” - Categorical Perception Identifying words from different speakers Categorize phonemes in brain Only native speakers
  88. 88. 4. Phonetics & Speech Perception• Speech Perception
  89. 89. 4. Phonetics & Speech Perception• Speech Perception - Variations in speech
  90. 90. 4. Phonetics & Speech Perception• Speech Perception - Variations in speech Phonetic environment can alter the sound of a phoneme
  91. 91. 4. Phonetics & Speech Perception• Speech Perception - Variations in speech Phonetic environment can alter the sound of a phoneme [o] in “Bob” and [u] in “vulture”
  92. 92. 4. Phonetics & Speech Perception• Speech Perception - Variations in speech Phonetic environment can alter the sound of a phoneme [o] in “Bob” and [u] in “vulture” Speed of speech
  93. 93. 4. Phonetics & Speech Perception• Speech Perception - Variations in speech Phonetic environment can alter the sound of a phoneme [o] in “Bob” and [u] in “vulture” Speed of speech Fast → Shorter vowels, less pronounced stops, bad articulation
  94. 94. 4. Phonetics & Speech Perception• Speech Perception - Variations in speech Phonetic environment can alter the sound of a phoneme [o] in “Bob” and [u] in “vulture” Speed of speech Fast → Shorter vowels, less pronounced stops, bad articulation Speaker identity
  95. 95. 4. Phonetics & Speech Perception• Speech Perception - Variations in speech Phonetic environment can alter the sound of a phoneme [o] in “Bob” and [u] in “vulture” Speed of speech Fast → Shorter vowels, less pronounced stops, bad articulation Speaker identity - Gender and age differences
  96. 96. 4. Phonetics & Speech Perception• Speech Perception - Variations in speech Phonetic environment can alter the sound of a phoneme [o] in “Bob” and [u] in “vulture” Speed of speech Fast → Shorter vowels, less pronounced stops, bad articulation Speaker identity - Gender and age differences - Vocal chord size and hormone levels
  97. 97. 4. Phonetics & Speech Perception• Speech Perception - Variations in speech Phonetic environment can alter the sound of a phoneme [o] in “Bob” and [u] in “vulture” Speed of speech Fast → Shorter vowels, less pronounced stops, bad articulation Speaker identity - Gender and age differences - Vocal chord size and hormone levels - Place of birth
  98. 98. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  99. 99. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  100. 100. 5. Telephone Speech Coding & Compression Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  101. 101. 5. Telephone Speech Coding & Compression
  102. 102. 5. Telephone Speech Coding & Compression• Early days: Analog
  103. 103. 5. Telephone Speech Coding & Compression• Early days: Analog - Speech converted to control voltage in the phone
  104. 104. 5. Telephone Speech Coding & Compression• Early days: Analog - Speech converted to control voltage in the phone - Passed through copper lines → crosstalk
  105. 105. 5. Telephone Speech Coding & Compression• Early days: Analog - Speech converted to control voltage in the phone - Passed through copper lines → crosstalk• 1980’s - present day: Digital
  106. 106. 5. Telephone Speech Coding & Compression• Early days: Analog - Speech converted to control voltage in the phone - Passed through copper lines → crosstalk• 1980’s - present day: Digital - Main advantages: Longer distance / greater speed / less carrier noise
  107. 107. 5. Telephone Speech Coding & Compression• Early days: Analog - Speech converted to control voltage in the phone - Passed through copper lines → crosstalk• 1980’s - present day: Digital - Main advantages: Longer distance / greater speed / less carrier noise - Use of Optic Fiber lines → no crosstalk
  108. 108. 5. Telephone Speech Coding & Compression
  109. 109. 5. Telephone Speech Coding & Compression• Now: Mobile Phones
  110. 110. 5. Telephone Speech Coding & Compression• Now: Mobile Phones - GSM: Speech
  111. 111. 5. Telephone Speech Coding & Compression• Now: Mobile Phones - GSM: Speech - UMTS: Data
  112. 112. 5. Telephone Speech Coding & Compression• Now: Mobile Phones - GSM: Speech - UMTS: Data - Frequency content of 3100 kHz
  113. 113. 5. Telephone Speech Coding & Compression• Now: Mobile Phones - GSM: Speech - UMTS: Data - Frequency content of 3100 kHz - Compressed full-rate (13 kbit/s) or half-rate (6,5 kbit/s) with 8kHz SR
  114. 114. 5. Telephone Speech Coding & Compression• Now: Mobile Phones - GSM: Speech - UMTS: Data - Frequency content of 3100 kHz - Compressed full-rate (13 kbit/s) or half-rate (6,5 kbit/s) with 8kHz SR• Technique: Linear Predictive Coding (LPC)
  115. 115. 5. Telephone Speech Coding & Compression• Now: Mobile Phones - GSM: Speech - UMTS: Data - Frequency content of 3100 kHz - Compressed full-rate (13 kbit/s) or half-rate (6,5 kbit/s) with 8kHz SR• Technique: Linear Predictive Coding (LPC) - Formants (human resonance) are removed from speech
  116. 116. 5. Telephone Speech Coding & Compression• Now: Mobile Phones - GSM: Speech - UMTS: Data - Frequency content of 3100 kHz - Compressed full-rate (13 kbit/s) or half-rate (6,5 kbit/s) with 8kHz SR• Technique: Linear Predictive Coding (LPC) - Formants (human resonance) are removed from speech - What is left = sine wave → digitized with Fourier transform
  117. 117. 5. Telephone Speech Coding & Compression• Now: Mobile Phones - GSM: Speech - UMTS: Data - Frequency content of 3100 kHz - Compressed full-rate (13 kbit/s) or half-rate (6,5 kbit/s) with 8kHz SR• Technique: Linear Predictive Coding (LPC) - Formants (human resonance) are removed from speech - What is left = sine wave → digitized with Fourier transform - Formants are synthesized again in the receivers cellphone
  118. 118. 5. Telephone Speech Coding & Compression• Now: Mobile Phones - GSM: Speech - UMTS: Data - Frequency content of 3100 kHz - Compressed full-rate (13 kbit/s) or half-rate (6,5 kbit/s) with 8kHz SR• Technique: Linear Predictive Coding (LPC) - Formants (human resonance) are removed from speech - What is left = sine wave → digitized with Fourier transform - Formants are synthesized again in the receivers cellphone - Of great interest for speech recognition
  119. 119. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  120. 120. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  121. 121. 6. Speech Enhancement Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  122. 122. 6. Speech Enhancement
  123. 123. 6. Speech Enhancement• Pre-Filtering
  124. 124. 6. Speech Enhancement• Pre-Filtering - Frequency based
  125. 125. 6. Speech Enhancement• Pre-Filtering - Frequency based - Filter banks
  126. 126. 6. Speech Enhancement• Pre-Filtering - Frequency based - Filter banks - Commonly know as an equalizer
  127. 127. 6. Speech Enhancement• Pre-Filtering - Frequency based - Filter banks - Commonly know as an equalizer - Used adaptively to suppress unwanted frequencies
  128. 128. 6. Speech Enhancement• Pre-Filtering - Frequency based - Filter banks - Commonly know as an equalizer - Used adaptively to suppress unwanted frequencies - Boost low-end lost due to telephone coding
  129. 129. 6. Speech Enhancement• Pre-Filtering - Frequency based - Filter banks - Commonly know as an equalizer - Used adaptively to suppress unwanted frequencies - Boost low-end lost due to telephone coding - Improve audibility
  130. 130. 6. Speech Enhancement
  131. 131. 6. Speech Enhancement• Noise-Filtering
  132. 132. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction
  133. 133. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective
  134. 134. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective - Uses the amplitude of the noise
  135. 135. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective - Uses the amplitude of the noise - “Underwater” effect if overused
  136. 136. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective - Uses the amplitude of the noise - “Underwater” effect if overused - Wiener Filtering
  137. 137. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective - Uses the amplitude of the noise - “Underwater” effect if overused - Wiener Filtering - Invented in 1940’s by Norbert Wiener
  138. 138. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective - Uses the amplitude of the noise - “Underwater” effect if overused - Wiener Filtering - Invented in 1940’s by Norbert Wiener - Uses Fourier transform to detect noise
  139. 139. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective - Uses the amplitude of the noise - “Underwater” effect if overused - Wiener Filtering - Invented in 1940’s by Norbert Wiener - Uses Fourier transform to detect noise - Stationary (non-adaptive)
  140. 140. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective - Uses the amplitude of the noise - “Underwater” effect if overused - Wiener Filtering - Invented in 1940’s by Norbert Wiener - Uses Fourier transform to detect noise - Stationary (non-adaptive) - Uses deconvolution to remove noise
  141. 141. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective - Uses the amplitude of the noise - “Underwater” effect if overused - Wiener Filtering - Invented in 1940’s by Norbert Wiener - Uses Fourier transform to detect noise - Stationary (non-adaptive) - Uses deconvolution to remove noise - Signal Subspace approach
  142. 142. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective - Uses the amplitude of the noise - “Underwater” effect if overused - Wiener Filtering - Invented in 1940’s by Norbert Wiener - Uses Fourier transform to detect noise - Stationary (non-adaptive) - Uses deconvolution to remove noise - Signal Subspace approach - Represents noise and original signal in “layers”
  143. 143. 6. Speech Enhancement• Noise-Filtering - Spectral Substraction - Simple and effective - Uses the amplitude of the noise - “Underwater” effect if overused - Wiener Filtering - Invented in 1940’s by Norbert Wiener - Uses Fourier transform to detect noise - Stationary (non-adaptive) - Uses deconvolution to remove noise - Signal Subspace approach - Represents noise and original signal in “layers” - Assigns vectors to high and low amplitudes
  144. 144. 6. Speech Enhancement
  145. 145. 6. Speech Enhancement• Spectral Restoration
  146. 146. 6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal.
  147. 147. 6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal. - Works on a small scale
  148. 148. 6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal. - Works on a small scale - Adds filtered full band noise in the gaps
  149. 149. 6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal. - Works on a small scale - Adds filtered full band noise in the gaps - Listener perceives the signal as whole
  150. 150. 6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal. - Works on a small scale - Adds filtered full band noise in the gaps - Listener perceives the signal as whole - Bad results with SREs
  151. 151. 6. Speech Enhancement• Spectral Restoration - Fixes dropouts in the signal. - Works on a small scale - Adds filtered full band noise in the gaps - Listener perceives the signal as whole - Bad results with SREs - Most SREs can fill the gap in a different way
  152. 152. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  153. 153. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  154. 154. 7. Speech Recognition Engine Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  155. 155. 7. Speech Recognition Engine
  156. 156. 7. Speech Recognition Engine• Dynamic Time Warping (DTW)
  157. 157. 7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days
  158. 158. 7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate with complex speech
  159. 159. 7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate with complex speech - Measures similarities in time and speed
  160. 160. 7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate with complex speech - Measures similarities in time and speed - e.g. A video is played twice. One time fast and one time slow. A DTW based algorithm will see that it is the same video
  161. 161. 7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate with complex speech - Measures similarities in time and speed - e.g. A video is played twice. One time fast and one time slow. A DTW based algorithm will see that it is the same video - Compares speech to a speech database
  162. 162. 7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate with complex speech - Measures similarities in time and speed - e.g. A video is played twice. One time fast and one time slow. A DTW based algorithm will see that it is the same video - Compares speech to a speech database - Needs training most of the time
  163. 163. 7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate with complex speech - Measures similarities in time and speed - e.g. A video is played twice. One time fast and one time slow. A DTW based algorithm will see that it is the same video - Compares speech to a speech database - Needs training most of the time - Does not use phonemes
  164. 164. 7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate with complex speech - Measures similarities in time and speed - e.g. A video is played twice. One time fast and one time slow. A DTW based algorithm will see that it is the same video - Compares speech to a speech database - Needs training most of the time - Does not use phonemes - Uses interval-based vectors.
  165. 165. 7. Speech Recognition Engine• Dynamic Time Warping (DTW) - Mostly used in the early days - Fast & simple but not accurate with complex speech - Measures similarities in time and speed - e.g. A video is played twice. One time fast and one time slow. A DTW based algorithm will see that it is the same video - Compares speech to a speech database - Needs training most of the time - Does not use phonemes - Uses interval-based vectors. - Vector taken at the wrong time = bad representation
  166. 166. 7. Speech Recognition Engine
  167. 167. 7. Speech Recognition Engine• Statistically Based Speech Recognition
  168. 168. 7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models
  169. 169. 7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statistically based SREs
  170. 170. 7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statistically based SREs - Allows use by people with different accents / dialects
  171. 171. 7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statistically based SREs - Allows use by people with different accents / dialects - Markov Model: “predict” the future by knowing the current state
  172. 172. 7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statistically based SREs - Allows use by people with different accents / dialects - Markov Model: “predict” the future by knowing the current state - Hidden Markov model: “predict” the current state by knowing the future
  173. 173. 7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statistically based SREs - Allows use by people with different accents / dialects - Markov Model: “predict” the future by knowing the current state - Hidden Markov model: “predict” the current state by knowing the future - Future = grammar file
  174. 174. 7. Speech Recognition Engine• Statistically Based Speech Recognition Hidden Markov Models - Heart and soul of statistically based SREs - Allows use by people with different accents / dialects - Markov Model: “predict” the future by knowing the current state - Hidden Markov model: “predict” the current state by knowing the future - Future = grammar file - Statistically rules out possibilities as the word progresses
  175. 175. 7. Speech Recognition Engine• Statistically Based Speech Recognition
  176. 176. 7. Speech Recognition Engine• Statistically Based Speech Recognition Acoustic Model
  177. 177. 7. Speech Recognition Engine• Statistically Based Speech Recognition Acoustic Model - Gathers statistical information for the HMM
  178. 178. 7. Speech Recognition Engine• Statistically Based Speech Recognition Acoustic Model - Gathers statistical information for the HMM - Does this by analyzing a speech corpus (read or continuous)
  179. 179. 7. Speech Recognition Engine• Statistically Based Speech Recognition Acoustic Model - Gathers statistical information for the HMM - Does this by analyzing a speech corpus (read or continuous) - Different corpus (language, gender, frequency range)
  180. 180. 7. Speech Recognition Engine• Statistically Based Speech Recognition Acoustic Model - Gathers statistical information for the HMM - Does this by analyzing a speech corpus (read or continuous) - Different corpus (language, gender, frequency range) - ISIP Switchboard corpus: 240h of speech, 500 talkers. Telephone quality
  181. 181. 7. Speech Recognition Engine• Statistically Based Speech Recognition
  182. 182. 7. Speech Recognition Engine• Statistically Based Speech Recognition Language Model
  183. 183. 7. Speech Recognition Engine• Statistically Based Speech Recognition Language Model - Tries to predict the next word
  184. 184. 7. Speech Recognition Engine• Statistically Based Speech Recognition Language Model - Tries to predict the next word - Uses a grammar file
  185. 185. 7. Speech Recognition Engine• Statistically Based Speech Recognition Language Model - Tries to predict the next word - Uses a grammar file - E.g. “Phone Steve Young; Phone Young; Phone Steve; Phone Young Steve”
  186. 186. 7. Speech Recognition Engine• Statistically Based Speech Recognition Language Model - Tries to predict the next word - Uses a grammar file - E.g. “Phone Steve Young; Phone Young; Phone Steve; Phone Young Steve” - Multiple can be combined to predict entire sentences
  187. 187. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  188. 188. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  189. 189. 8. Speech Analytics Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  190. 190. 8. Speech Analytics
  191. 191. 8. Speech Analytics- Separate engine
  192. 192. 8. Speech Analytics- Separate engine- Analyze gender, age, identity and topic discussed
  193. 193. 8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining
  194. 194. 8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining - Analyzes audio as soon as it enters the signal
  195. 195. 8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining - Analyzes audio as soon as it enters the signal - Useful with background noise
  196. 196. 8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining - Analyzes audio as soon as it enters the signal - Useful with background noise - Matches source to a speech database
  197. 197. 8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining - Analyzes audio as soon as it enters the signal - Useful with background noise - Matches source to a speech database e.g.: Emotion detection with customer services
  198. 198. 8. Speech Analytics - Separate engine - Analyze gender, age, identity and topic discussed• Audio Mining - Analyzes audio as soon as it enters the signal - Useful with background noise - Matches source to a speech database e.g.: Emotion detection with customer services Music recognition software (“Shazam”, “Soundhound”)
  199. 199. 8. Speech Analytics
  200. 200. 8. Speech Analytics• Keyword Spotting
  201. 201. 8. Speech Analytics• Keyword Spotting 2 kinds:
  202. 202. 8. Speech Analytics• Keyword Spotting 2 kinds: Isolated word
  203. 203. 8. Speech Analytics• Keyword Spotting 2 kinds: Isolated word - clearly enforced breaks
  204. 204. 8. Speech Analytics• Keyword Spotting 2 kinds: Isolated word - clearly enforced breaks - non-spontaneous
  205. 205. 8. Speech Analytics• Keyword Spotting 2 kinds: Isolated word - clearly enforced breaks - non-spontaneous - user knows he is talking to an SRE
  206. 206. 8. Speech Analytics• Keyword Spotting 2 kinds: Isolated word - clearly enforced breaks - non-spontaneous - user knows he is talking to an SRE Unconstrained spotting
  207. 207. 8. Speech Analytics• Keyword Spotting 2 kinds: Isolated word - clearly enforced breaks - non-spontaneous - user knows he is talking to an SRE Unconstrained spotting - continuous speech KWS
  208. 208. 8. Speech Analytics• Keyword Spotting 2 kinds: Isolated word - clearly enforced breaks - non-spontaneous - user knows he is talking to an SRE Unconstrained spotting - continuous speech KWS - difficult due to speech segmentation
  209. 209. 8. Speech Analytics• Keyword Spotting
  210. 210. 8. Speech Analytics• Keyword Spotting 2 methods
  211. 211. 8. Speech Analytics• Keyword Spotting 2 methods - filler method (garbage-method): entire string of speech is analyzed
  212. 212. 8. Speech Analytics• Keyword Spotting 2 methods - filler method (garbage-method): entire string of speech is analyzed excess words too (=garbage)
  213. 213. 8. Speech Analytics• Keyword Spotting 2 methods - filler method (garbage-method): entire string of speech is analyzed excess words too (=garbage) - sliding model: interval based analyzing
  214. 214. 8. Speech Analytics• Keyword Spotting 2 methods - filler method (garbage-method): entire string of speech is analyzed excess words too (=garbage) - sliding model: interval based analyzing uses Hidden Markov Models & grammar file
  215. 215. 8. Speech Analytics• Keyword Spotting 2 methods - filler method (garbage-method): entire string of speech is analyzed excess words too (=garbage) - sliding model: interval based analyzing uses Hidden Markov Models & grammar file resource intensive
  216. 216. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  217. 217. Overview1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  218. 218. 9.Overview Conclusion1. Hypothesis2. Historic Overview3. Human Speech Organ4. Phonetics & Speech Perception5. Telephone Speech Coding & Compression6. Speech Enhancement7. Speech Recognition Engine8. Speech Analytics9. Conclusion
  219. 219. 9. Conclusion
  220. 220. 9. Conclusion“Is it possible, with today’s known technology, to automatically trigger a recording device with a random word in a sentence over a telephone line?”
  221. 221. 9. Conclusion“Is it possible, with today’s known technology, to automatically trigger a recording device with a random word in a sentence over a telephone line?” Answer:
  222. 222. 9. Conclusion“Is it possible, with today’s known technology, to automatically trigger a recording device with a random word in a sentence over a telephone line?” Answer: YES
  223. 223. 9. Conclusion
  224. 224. 9. Conclusion- Keyword spotting algorithm based on a statistically based SRE
  225. 225. 9. Conclusion- Keyword spotting algorithm based on a statistically based SRE- Appropriate acoustic model
  226. 226. 9. Conclusion- Keyword spotting algorithm based on a statistically based SRE- Appropriate acoustic model- ISIP Switchboard speech corpus: telephone compressed source
  227. 227. 9. Conclusion- Keyword spotting algorithm based on a statistically based SRE- Appropriate acoustic model- ISIP Switchboard speech corpus: telephone compressed source- Grammar file? → Maybe but will be big
  228. 228. 9. Conclusion- Keyword spotting algorithm based on a statistically based SRE- Appropriate acoustic model- ISIP Switchboard speech corpus: telephone compressed source- Grammar file? → Maybe but will be big- Normal speech corpus? → A lot of pre-filtering / might nog be successful
  229. 229. 9. Conclusion- Keyword spotting algorithm based on a statistically based SRE- Appropriate acoustic model- ISIP Switchboard speech corpus: telephone compressed source- Grammar file? → Maybe but will be big- Normal speech corpus? → A lot of pre-filtering / might nog be successful- LPC? → artifacts in output due to 2x LPC filtering
  230. 230. Q&A

×