Research Developments and Directions in Speech Recognition and Understanding2009/09/05 @ JCMG, NTUReference:* J. M. Baker et al., “Research developments and directions in speech recognition and understanding, Part 1,” IEEE Signal Processing Magazine, vol. 26, no. 3, pp. 75-80, 2009.
Off-Topic“Conditional Random Fields (CRF)” will be presented in my next turnC. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.C. Sutton and A. McCallum, “An introduction to conditional random fields for relational learning,” Introduction to Statistical Relational Learning, MIT Press, 2007.RahulGupta, “Conditional random fields,” Dept. of Computer Science and Engg., IIT Bombay, India.2
OutlineIntroductionSignificant Developments in Speech Recognition and UnderstandingGrand Challenges: Major Potential Programs Of ResearchImproving Infrastructure for Future ASR Research3
Introduction (1/2)It is important to identify promising future research directionsEspecially those that have not been adequately pursued or funded in the pastPurposes of this article:To advance researchTo elicit from the human language technology (HLT) community a set of well-considered directions or rich areas for future research that could lead to major paradigm shifts in the field of automatic speech recognition (ASR) and understanding4Introduction
Introduction (2/2)Part 1:Focuses on historically significant developments in the ASR area, including several major research efforts that were guided by different funding agenciesSuggests general areas in which to focus researchPart 2:Explores in more detail several new avenues holding promise for substantial improvements in ASR performanceThe cross-disciplinary research and dealing with realistic tasks are entailed5Introduction
OutlineIntroductionSignificant Developments in Speech Recognition and UnderstandingGrand Challenges: Major Potential Programs Of ResearchImproving Infrastructure for Future ASR Research6
Significant DevelopmentsASR still remains far from being a solved problemDespite its many achievements since the mid-1970sWe expect that further research and development will enable us to create increasingly powerful systems, deployable on a worldwide basisFive highlights of major developments in ASRInfrastructureKnowledge representationModels and algorithmsSearchMetadata7
Infrastructure (1/3)HardwareMoore’s Law ASR researchers were enabled to run increasingly complex algorithms in sufficiently short time frames to make great progress since 1975e.g., meaningful experiments that can be done in less than a day8Significant Developments
Infrastructure (2/3)CorporaThe availability of common speech corpora has been critical, allowing the creation of complex systems of ever increasing capabilitiesSpeech is a highly variable signal, characterized by many parameters, and thus large corpora are critical in modeling it well enough for automated systems to achieve proficiencySources: National Institute of Science and Technology (NIST), the Linguistic Data Consortium (LDC)9Significant Developments
Infrastructure (3/3)Benchmark evaluations and standardsNurtured by NIST and othersResearch toolsCarnegie-Mellon University Language Model (CMULM) toolkitHidden Markov Model Toolkit (HTK)Sphinx, and Stanford Research Institute Language Modeling (SRILM)Extensive research supportThe U.S. Department of Defense Advanced Research Projects Agency (DARPA) and others10Significant Developments
Knowledge Representation (1/3)Speech signal representationsPerceptually motivated mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) coefficientsNormalizations via cepstral mean subtraction (CMS), relative spectral (RASTA) filtering, and vocal tract length normalization (VTLN)Graph representationsAllow multiple sources of knowledge to be incorporated into a common probabilistic framework11Significant Developments
Knowledge Representation (2/3)Graph representations - Noncompositional methodsMultiple speech streamsMultiple probability estimatorsMultiple recognition systems combined at the hypothesis level (e.g., Recognition Output Voting Error Reduction (ROVER)Multipass systems with increasing constraints (bigram versus four-gram, within word dependencies versus cross-word, etc.)12Significant Developments
Knowledge Representation (3/3)Multiple algorithmsFeature-based transformations: heteroscedastic linear discriminant analysis (HLDA)feature-space minimum phone error (fMPE)Neural net-based features (Tandem) ** H. Hermansky et al., “Tandem connectionist feature extraction for conventional HMM systems,” in Proc. IEEE ICASSP, Istanbul, Turkey, June 2000, vol. 3, pp. 1635-1638.13Significant Developments
Models and Algorithms (1/3)Statistical methods: stochastic processing with hidden Markov models (HMMs)The most significant paradigm shift for speech-recognition progressIntroduced in the early 1970s *The expectation maximization (EM) algorithm and the forward-backward or Baum-Welch algorithm have been the principal means by which the HMMs are trained from data* H. Poor, An Introduction to Signal Detection and Estimation (Springer Texts in Electrical Engineering), J. Thomas, Ed. New York: Springer-Verlag, 1988.14Significant DevelopmentsWhat’s the diff. between EM and BW?
Models and Algorithms (2/3)N-gram language models have proved remarkably powerful and resilientDecision trees have been widely used to categorize sets of features, such as pronunciations from training dataStatistical discriminative training techniques are typically based on utilizing maximum mutual information (MMI) and the minimum-error model parameters15Significant Developments
Models and Algorithms (3/3)Deterministic approaches include corrective training * and some neural network techniquesAdaptation:to accommodate a wide range of variable conditions for the channel, environment, speaker, vocabulary, topic domain, etc.Training can take place on the basis of small amounts of data from new tasks or domains that provide additional training materialMaximum a posteriori probability (MAP) estimationMaximum likelihood linear regression (MLLR)Eigenvoices* L. R. Bahl et al, “Estimating hidden Markov model parameters so as to maximize speech recognition accuracy,” IEEE Trans. Speech Audio Processing, vol. 1, no. 1, pp. 77–83, 199316Significant Developments
SearchKey decoding or search strategiesStack decoding (A* search)Derived from communications and information theoryViterbi or N-best searchBroadly applied to search alternative hypothesesDerives from dynamic programming in the 1950s and was subsequently used in speech applications from the 1960s to the 1980s and beyond17Significant Developments
MetadataAutomatic determination for sentence and speaker segmentation as well as punctuation has become a key feature in some processing systemsStarting in the early 1990s, audio indexing and mining have enabled high-performance automatic topic detection and tracking, as well as applications for language and speaker identification18Significant DevelopmentsMetadata: data about data
OutlineIntroductionSignificant Developments in Speech Recognition and UnderstandingGrand Challenges: Major Potential Programs Of ResearchImproving Infrastructure for Future ASR Research19
Grand ChallengesDefinitions:Ambitious but achievable three-to five year research program initiatives that will significantly advance the state of the art in speech recognition and understandingSix potential programsEveryday audioRapid portability to emerging languagesSelf-adaptive language capabilitiesDetection of rare, key eventsCognition-derived speech and language systemsSpoken-language comprehension20
Everyday Audio (1/3)A wide range of speech, speaker, channel, and environmental conditions that people typically encounter and routinely adapt to in responding and recognizing speech signalsDifficultiesASR systems deliver significantly degraded performance when they encounter audio signals that differ from the limited conditions under which they were originally developed and trained21Grand Challenges
Everyday Audio (2/3)Goals:To Create and develop systems that would be much more robust against variability and shifts in acoustic environments, reverberation, external noise sources, communication channels,  speaker characteristics, and language characteristicsTo deliver accurate and useful speech transcripts automatically under many more environments and diverse circumstances than is now possible, thereby enabling many more applications22Grand Challenges
Everyday Audio (3/3)SynergiesNatural-language processingInformation retrievalCognitive science23Grand Challenges
Rapid Portability to Emerging Languages (1/3)The language models in today’s state-of-the-art ASR systems are using a large collection of domain-specific speech and text examplesDifficulties:For many languages, this set of language resources is often not readily availableGoals:To create spoken-language technologies that are rapidly portable24Grand Challenges
Rapid Portability to Emerging Languages (2/3)To prepare for rapid development of such spoken-language systems, a new paradigm is needed to study speech and acoustic units that are more language-universal than language-specific phonesCross-language acoustic modeling of speech and acoustic units for a new target languageCross-lingual lexical modeling of word pronunciations for new languageCross-lingual language modeling25Grand Challenges
Rapid Portability to Emerging Languages (3/3)AdaptationCross-language features, such as language clustering and universal acoustic modeling, could be utilized to facilitate rapid adaptation of acoustic and language modelsBootstrapping techniques are also keys to building preliminary systemsThe minimum amount of supervised label information required to create a reasonable system?SynergiesMachine translationNatural-language processingInformation retrieval26Grand Challenges
Self-Adaptive Language Capabilities (1/4)State-of-the-art systems for speech transcription, speaker verification, and language identification are all based on statistical models estimated from labeled training dataDifficulties:Environments changeContrast with humansGoals:To create self-adaptive (or self-learning) speech technology27Grand Challenges
Self-Adaptive Language Capabilities (2/4)There is a need for learning at all levels of speech and language processingTo cope with changing environments, non-speech sounds, speakers, pronunciations, dialects, accents, words, meanings, and topicsTo name but a few sources of variation over the lifetime of a deployed systemThe system would engage in automatic pattern discovery, active learning, and adaptation28Grand Challenges
Self-Adaptive Language Capabilities (3/4)Research in this area must address both the learning of new models and the integration of such models into preexisting knowledge sourcesAn important aspect of learning is being able to discern when something has been learned and how to apply the resultLearning from multiple concurrent modalities may also be necessaryExploitation of unlabeled or partially labeled data would be necessary for such learning29Grand Challenges
Self-Adaptive Language Capabilities (4/4)SynergiesNatural-language processingInformation retrievalCognitive science30Grand Challenges
Detection of Rare, Key Events (1/4)Difficulties:Current ASR systems have difficulty in handling unexpected--  and thus often the most information-rich-- lexical itemsInterjectionsForeignOut-of-vocabulary wordsA common outcome in this situation is that high-value terms are overconfidently misrecognized as some other common and similar-sounding wordGoals:To create systems that reliably detect when they do not know a valid word31Grand Challenges
Detection of Rare, Key Events (2/4)A clue to the occurrence of such error events is the mismatch between an analysis of a purely sensory signal unencumbered by prior knowledge often encoded in a language modelA key component of this research would therefore be the development of novel confidence measures and accurate models of uncertainty based on the discrepancy between sensory evidence and a priori beliefs32Grand Challenges
Detection of Rare, Key Events (3/4)A natural sequel to detection of such events would be to transcribe them phonetically when the system is confident that its word hypothesis is unreliable and to devise error-correction schemesOne immediate application that such detection would enable is subword (e.g., phonetic) indexing and search of speech regions where the system suspects the presence of errors33Grand Challenges
Detection of Rare, Key Events (4/4)SynergiesNatural-language processingInformation retrieval34Grand Challenges
Cognition-derived Speech and Language Systems (1/3)The focus of this project would be to understand and emulate relevant human capabilities and to incorporate these strategies into automatic speech systemsDifficultiesIt is not possible to predict and collect separate data for any and all types of speech, topic domains, etc., it is important to enable automatic systems to learn and generalize even from single instances (episodic learning) or limited samples of data, so that new or changed signals (e.g., accented speech, noise adaptation) could be correctly understood35Grand Challenges
Cognition-derived Speech and Language Systems (2/3)An additional impetus for looking now at how the brain processes speech and language is provided by the dramatic improvements made over the last several years in the field of brain and cognitive scienceEspecially with regard to the cortical imaging of speech and language processingIt is now possible to follow instantaneously the different paths and courses of cortical excitation as a function of differing speech and language stimuli36Grand Challenges
Cognition-derived Speech and Language Systems (3/3)Goals:To understand how significant cortical information processing capabilities beyond signal processing are achievedto leverage that knowledge in our automated speech and language systemsThe ramifications of such an understanding could be very far-reachingSynergiesBrain and cognitive sciencenatural-language processingInformation retrieval37Grand Challenges
Spoken-Language Comprehension (1/3)To achieve a broad level of speech-understanding capabilities, it is essential that the speech research community explore building language comprehension systems that could be improved by the gradual accumulation of knowledge and language skillsComparing an ASR system with the speech performance of children less than ten years of age in listening-comprehension skillGoals:To help develop technologies that enable language comprehension38Grand Challenges
Spoken-Language Comprehension (2/3)It is clear such evaluations would emphasize the accurate detection of information-bearing elements in speech rather than basic word error rateFour key research topics need to be exploredPartial understanding of spoken and written materials, with a focused attention on information-bearing componentsSentence segmentation and name entry extraction from given test passagesInformation retrieval from the knowledge sources acquired in the learning phaseRepresentation and database organization of knowledge sources39Grand Challenges
Spoken-Language Comprehension (3/3)Collaboration between speech and language processing communities is a key element to the potential success of such a programThe outcomes of this research could provide a paradigm shift for building domain-specific language understanding systems and significantly affect the education and learning communities40Grand Challenges
OutlineIntroductionSignificant Developments in Speech Recognition and UnderstandingGrand Challenges: Major Potential Programs Of ResearchImproving Infrastructure for Future ASR Research41
Improving Infrastructure forFuture ASR ResearchCreation of High-quality Annotated CorporaNovel High-Volume Data SourcesTools for Collecting and Processing Large Quantities of Speech Data42
Creation of High-Quality Annotated Corpora (1/4)The single simplest, best way to improve recognition performance is to increase the amount of task-relevant training data from which its models are constructedTo capture the tremendous variability inherent in speechImportant in increasing the facility of learning, understanding, and subsequently automatically recognizing a wide variety of languagesCritical for transcription within any given language, spoken language machine translation, cross-language information retrieval, etc.43Improving Infrastructure
Creation of High-Quality Annotated Corpora (2/4)Well-labeled speech corpora have been the cornerstone on which today’s systems have been developed and evolvedThe sine qua non for rigorous comparative system evaluations and competitive analysesAbout labeling:Typically at the word levelSome annotation at a finer level (e.g., syllables, phones, features) is important44Improving Infrastructure
Creation of High-Quality Annotated Corpora (3/4)The single most popular speech database available from the Linguistic Data Consortium (LDC) is TIMITA very compact acoustic-phonetic database created by MIT and Texas InstrumentsWith subword(phonetic) transcriptionsOther corpora:Call Home, Switchboard, Wall Street Journal,Buckeye45Improving Infrastructure
Creation of High-Quality Annotated Corpora (4/4)Above the word levelDatabases need to be labeled to indicate aspects of emotion, dialog acts, and semanticsFor system trainingWe must design ASR systems that are tolerant of labeling errors46Improving Infrastructure
Novel High-Volume Data SourcesData sources:Some of it is of quite variable and often poor quality, such as user-posted material from YouTubeBetter-quality audio materials are reflected in the diverse oral histories recorded by organizations such as StoryCorpsUniversity course lectures, seminars, and similar material make up another rich sourceLess formal, more spontaneous, and natural form of speech“Weak” transcripts (such as closed captioning and subtitles)Systems will become more capable in increasing robustness47Improving Infrastructure
Tools for Collecting and Processing Large Quantities of Speech DataNew Web-based tools could be made available to collect, annotate, and then process substantial quantities of speech very cost effectively in many languagesNew initiativesDigital library technology aiming to scan huge amounts of textThe Million Book ProjectThe creation of large scale speech corporaThe Million Hour Speech CorpusThey will also provide rich resources for strong research into the fundamental nature of speech and language itself* J. K. Baker, “Spoken language digital libraries: The million hour speech, project,” in Proc. Int.Conf. Universal Digital Libraries, Invited Paper, Alexandria, Egypt, 2006.48Improving Infrastructure

Research Developments and Directions in Speech Recognition and ...

  • 1.
    Research Developments andDirections in Speech Recognition and Understanding2009/09/05 @ JCMG, NTUReference:* J. M. Baker et al., “Research developments and directions in speech recognition and understanding, Part 1,” IEEE Signal Processing Magazine, vol. 26, no. 3, pp. 75-80, 2009.
  • 2.
    Off-Topic“Conditional Random Fields(CRF)” will be presented in my next turnC. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.C. Sutton and A. McCallum, “An introduction to conditional random fields for relational learning,” Introduction to Statistical Relational Learning, MIT Press, 2007.RahulGupta, “Conditional random fields,” Dept. of Computer Science and Engg., IIT Bombay, India.2
  • 3.
    OutlineIntroductionSignificant Developments inSpeech Recognition and UnderstandingGrand Challenges: Major Potential Programs Of ResearchImproving Infrastructure for Future ASR Research3
  • 4.
    Introduction (1/2)It isimportant to identify promising future research directionsEspecially those that have not been adequately pursued or funded in the pastPurposes of this article:To advance researchTo elicit from the human language technology (HLT) community a set of well-considered directions or rich areas for future research that could lead to major paradigm shifts in the field of automatic speech recognition (ASR) and understanding4Introduction
  • 5.
    Introduction (2/2)Part 1:Focuseson historically significant developments in the ASR area, including several major research efforts that were guided by different funding agenciesSuggests general areas in which to focus researchPart 2:Explores in more detail several new avenues holding promise for substantial improvements in ASR performanceThe cross-disciplinary research and dealing with realistic tasks are entailed5Introduction
  • 6.
    OutlineIntroductionSignificant Developments inSpeech Recognition and UnderstandingGrand Challenges: Major Potential Programs Of ResearchImproving Infrastructure for Future ASR Research6
  • 7.
    Significant DevelopmentsASR stillremains far from being a solved problemDespite its many achievements since the mid-1970sWe expect that further research and development will enable us to create increasingly powerful systems, deployable on a worldwide basisFive highlights of major developments in ASRInfrastructureKnowledge representationModels and algorithmsSearchMetadata7
  • 8.
    Infrastructure (1/3)HardwareMoore’s LawASR researchers were enabled to run increasingly complex algorithms in sufficiently short time frames to make great progress since 1975e.g., meaningful experiments that can be done in less than a day8Significant Developments
  • 9.
    Infrastructure (2/3)CorporaThe availabilityof common speech corpora has been critical, allowing the creation of complex systems of ever increasing capabilitiesSpeech is a highly variable signal, characterized by many parameters, and thus large corpora are critical in modeling it well enough for automated systems to achieve proficiencySources: National Institute of Science and Technology (NIST), the Linguistic Data Consortium (LDC)9Significant Developments
  • 10.
    Infrastructure (3/3)Benchmark evaluationsand standardsNurtured by NIST and othersResearch toolsCarnegie-Mellon University Language Model (CMULM) toolkitHidden Markov Model Toolkit (HTK)Sphinx, and Stanford Research Institute Language Modeling (SRILM)Extensive research supportThe U.S. Department of Defense Advanced Research Projects Agency (DARPA) and others10Significant Developments
  • 11.
    Knowledge Representation (1/3)Speechsignal representationsPerceptually motivated mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) coefficientsNormalizations via cepstral mean subtraction (CMS), relative spectral (RASTA) filtering, and vocal tract length normalization (VTLN)Graph representationsAllow multiple sources of knowledge to be incorporated into a common probabilistic framework11Significant Developments
  • 12.
    Knowledge Representation (2/3)Graphrepresentations - Noncompositional methodsMultiple speech streamsMultiple probability estimatorsMultiple recognition systems combined at the hypothesis level (e.g., Recognition Output Voting Error Reduction (ROVER)Multipass systems with increasing constraints (bigram versus four-gram, within word dependencies versus cross-word, etc.)12Significant Developments
  • 13.
    Knowledge Representation (3/3)MultiplealgorithmsFeature-based transformations: heteroscedastic linear discriminant analysis (HLDA)feature-space minimum phone error (fMPE)Neural net-based features (Tandem) ** H. Hermansky et al., “Tandem connectionist feature extraction for conventional HMM systems,” in Proc. IEEE ICASSP, Istanbul, Turkey, June 2000, vol. 3, pp. 1635-1638.13Significant Developments
  • 14.
    Models and Algorithms(1/3)Statistical methods: stochastic processing with hidden Markov models (HMMs)The most significant paradigm shift for speech-recognition progressIntroduced in the early 1970s *The expectation maximization (EM) algorithm and the forward-backward or Baum-Welch algorithm have been the principal means by which the HMMs are trained from data* H. Poor, An Introduction to Signal Detection and Estimation (Springer Texts in Electrical Engineering), J. Thomas, Ed. New York: Springer-Verlag, 1988.14Significant DevelopmentsWhat’s the diff. between EM and BW?
  • 15.
    Models and Algorithms(2/3)N-gram language models have proved remarkably powerful and resilientDecision trees have been widely used to categorize sets of features, such as pronunciations from training dataStatistical discriminative training techniques are typically based on utilizing maximum mutual information (MMI) and the minimum-error model parameters15Significant Developments
  • 16.
    Models and Algorithms(3/3)Deterministic approaches include corrective training * and some neural network techniquesAdaptation:to accommodate a wide range of variable conditions for the channel, environment, speaker, vocabulary, topic domain, etc.Training can take place on the basis of small amounts of data from new tasks or domains that provide additional training materialMaximum a posteriori probability (MAP) estimationMaximum likelihood linear regression (MLLR)Eigenvoices* L. R. Bahl et al, “Estimating hidden Markov model parameters so as to maximize speech recognition accuracy,” IEEE Trans. Speech Audio Processing, vol. 1, no. 1, pp. 77–83, 199316Significant Developments
  • 17.
    SearchKey decoding orsearch strategiesStack decoding (A* search)Derived from communications and information theoryViterbi or N-best searchBroadly applied to search alternative hypothesesDerives from dynamic programming in the 1950s and was subsequently used in speech applications from the 1960s to the 1980s and beyond17Significant Developments
  • 18.
    MetadataAutomatic determination forsentence and speaker segmentation as well as punctuation has become a key feature in some processing systemsStarting in the early 1990s, audio indexing and mining have enabled high-performance automatic topic detection and tracking, as well as applications for language and speaker identification18Significant DevelopmentsMetadata: data about data
  • 19.
    OutlineIntroductionSignificant Developments inSpeech Recognition and UnderstandingGrand Challenges: Major Potential Programs Of ResearchImproving Infrastructure for Future ASR Research19
  • 20.
    Grand ChallengesDefinitions:Ambitious butachievable three-to five year research program initiatives that will significantly advance the state of the art in speech recognition and understandingSix potential programsEveryday audioRapid portability to emerging languagesSelf-adaptive language capabilitiesDetection of rare, key eventsCognition-derived speech and language systemsSpoken-language comprehension20
  • 21.
    Everyday Audio (1/3)Awide range of speech, speaker, channel, and environmental conditions that people typically encounter and routinely adapt to in responding and recognizing speech signalsDifficultiesASR systems deliver significantly degraded performance when they encounter audio signals that differ from the limited conditions under which they were originally developed and trained21Grand Challenges
  • 22.
    Everyday Audio (2/3)Goals:ToCreate and develop systems that would be much more robust against variability and shifts in acoustic environments, reverberation, external noise sources, communication channels, speaker characteristics, and language characteristicsTo deliver accurate and useful speech transcripts automatically under many more environments and diverse circumstances than is now possible, thereby enabling many more applications22Grand Challenges
  • 23.
    Everyday Audio (3/3)SynergiesNatural-languageprocessingInformation retrievalCognitive science23Grand Challenges
  • 24.
    Rapid Portability toEmerging Languages (1/3)The language models in today’s state-of-the-art ASR systems are using a large collection of domain-specific speech and text examplesDifficulties:For many languages, this set of language resources is often not readily availableGoals:To create spoken-language technologies that are rapidly portable24Grand Challenges
  • 25.
    Rapid Portability toEmerging Languages (2/3)To prepare for rapid development of such spoken-language systems, a new paradigm is needed to study speech and acoustic units that are more language-universal than language-specific phonesCross-language acoustic modeling of speech and acoustic units for a new target languageCross-lingual lexical modeling of word pronunciations for new languageCross-lingual language modeling25Grand Challenges
  • 26.
    Rapid Portability toEmerging Languages (3/3)AdaptationCross-language features, such as language clustering and universal acoustic modeling, could be utilized to facilitate rapid adaptation of acoustic and language modelsBootstrapping techniques are also keys to building preliminary systemsThe minimum amount of supervised label information required to create a reasonable system?SynergiesMachine translationNatural-language processingInformation retrieval26Grand Challenges
  • 27.
    Self-Adaptive Language Capabilities(1/4)State-of-the-art systems for speech transcription, speaker verification, and language identification are all based on statistical models estimated from labeled training dataDifficulties:Environments changeContrast with humansGoals:To create self-adaptive (or self-learning) speech technology27Grand Challenges
  • 28.
    Self-Adaptive Language Capabilities(2/4)There is a need for learning at all levels of speech and language processingTo cope with changing environments, non-speech sounds, speakers, pronunciations, dialects, accents, words, meanings, and topicsTo name but a few sources of variation over the lifetime of a deployed systemThe system would engage in automatic pattern discovery, active learning, and adaptation28Grand Challenges
  • 29.
    Self-Adaptive Language Capabilities(3/4)Research in this area must address both the learning of new models and the integration of such models into preexisting knowledge sourcesAn important aspect of learning is being able to discern when something has been learned and how to apply the resultLearning from multiple concurrent modalities may also be necessaryExploitation of unlabeled or partially labeled data would be necessary for such learning29Grand Challenges
  • 30.
    Self-Adaptive Language Capabilities(4/4)SynergiesNatural-language processingInformation retrievalCognitive science30Grand Challenges
  • 31.
    Detection of Rare,Key Events (1/4)Difficulties:Current ASR systems have difficulty in handling unexpected-- and thus often the most information-rich-- lexical itemsInterjectionsForeignOut-of-vocabulary wordsA common outcome in this situation is that high-value terms are overconfidently misrecognized as some other common and similar-sounding wordGoals:To create systems that reliably detect when they do not know a valid word31Grand Challenges
  • 32.
    Detection of Rare,Key Events (2/4)A clue to the occurrence of such error events is the mismatch between an analysis of a purely sensory signal unencumbered by prior knowledge often encoded in a language modelA key component of this research would therefore be the development of novel confidence measures and accurate models of uncertainty based on the discrepancy between sensory evidence and a priori beliefs32Grand Challenges
  • 33.
    Detection of Rare,Key Events (3/4)A natural sequel to detection of such events would be to transcribe them phonetically when the system is confident that its word hypothesis is unreliable and to devise error-correction schemesOne immediate application that such detection would enable is subword (e.g., phonetic) indexing and search of speech regions where the system suspects the presence of errors33Grand Challenges
  • 34.
    Detection of Rare,Key Events (4/4)SynergiesNatural-language processingInformation retrieval34Grand Challenges
  • 35.
    Cognition-derived Speech andLanguage Systems (1/3)The focus of this project would be to understand and emulate relevant human capabilities and to incorporate these strategies into automatic speech systemsDifficultiesIt is not possible to predict and collect separate data for any and all types of speech, topic domains, etc., it is important to enable automatic systems to learn and generalize even from single instances (episodic learning) or limited samples of data, so that new or changed signals (e.g., accented speech, noise adaptation) could be correctly understood35Grand Challenges
  • 36.
    Cognition-derived Speech andLanguage Systems (2/3)An additional impetus for looking now at how the brain processes speech and language is provided by the dramatic improvements made over the last several years in the field of brain and cognitive scienceEspecially with regard to the cortical imaging of speech and language processingIt is now possible to follow instantaneously the different paths and courses of cortical excitation as a function of differing speech and language stimuli36Grand Challenges
  • 37.
    Cognition-derived Speech andLanguage Systems (3/3)Goals:To understand how significant cortical information processing capabilities beyond signal processing are achievedto leverage that knowledge in our automated speech and language systemsThe ramifications of such an understanding could be very far-reachingSynergiesBrain and cognitive sciencenatural-language processingInformation retrieval37Grand Challenges
  • 38.
    Spoken-Language Comprehension (1/3)Toachieve a broad level of speech-understanding capabilities, it is essential that the speech research community explore building language comprehension systems that could be improved by the gradual accumulation of knowledge and language skillsComparing an ASR system with the speech performance of children less than ten years of age in listening-comprehension skillGoals:To help develop technologies that enable language comprehension38Grand Challenges
  • 39.
    Spoken-Language Comprehension (2/3)Itis clear such evaluations would emphasize the accurate detection of information-bearing elements in speech rather than basic word error rateFour key research topics need to be exploredPartial understanding of spoken and written materials, with a focused attention on information-bearing componentsSentence segmentation and name entry extraction from given test passagesInformation retrieval from the knowledge sources acquired in the learning phaseRepresentation and database organization of knowledge sources39Grand Challenges
  • 40.
    Spoken-Language Comprehension (3/3)Collaborationbetween speech and language processing communities is a key element to the potential success of such a programThe outcomes of this research could provide a paradigm shift for building domain-specific language understanding systems and significantly affect the education and learning communities40Grand Challenges
  • 41.
    OutlineIntroductionSignificant Developments inSpeech Recognition and UnderstandingGrand Challenges: Major Potential Programs Of ResearchImproving Infrastructure for Future ASR Research41
  • 42.
    Improving Infrastructure forFutureASR ResearchCreation of High-quality Annotated CorporaNovel High-Volume Data SourcesTools for Collecting and Processing Large Quantities of Speech Data42
  • 43.
    Creation of High-QualityAnnotated Corpora (1/4)The single simplest, best way to improve recognition performance is to increase the amount of task-relevant training data from which its models are constructedTo capture the tremendous variability inherent in speechImportant in increasing the facility of learning, understanding, and subsequently automatically recognizing a wide variety of languagesCritical for transcription within any given language, spoken language machine translation, cross-language information retrieval, etc.43Improving Infrastructure
  • 44.
    Creation of High-QualityAnnotated Corpora (2/4)Well-labeled speech corpora have been the cornerstone on which today’s systems have been developed and evolvedThe sine qua non for rigorous comparative system evaluations and competitive analysesAbout labeling:Typically at the word levelSome annotation at a finer level (e.g., syllables, phones, features) is important44Improving Infrastructure
  • 45.
    Creation of High-QualityAnnotated Corpora (3/4)The single most popular speech database available from the Linguistic Data Consortium (LDC) is TIMITA very compact acoustic-phonetic database created by MIT and Texas InstrumentsWith subword(phonetic) transcriptionsOther corpora:Call Home, Switchboard, Wall Street Journal,Buckeye45Improving Infrastructure
  • 46.
    Creation of High-QualityAnnotated Corpora (4/4)Above the word levelDatabases need to be labeled to indicate aspects of emotion, dialog acts, and semanticsFor system trainingWe must design ASR systems that are tolerant of labeling errors46Improving Infrastructure
  • 47.
    Novel High-Volume DataSourcesData sources:Some of it is of quite variable and often poor quality, such as user-posted material from YouTubeBetter-quality audio materials are reflected in the diverse oral histories recorded by organizations such as StoryCorpsUniversity course lectures, seminars, and similar material make up another rich sourceLess formal, more spontaneous, and natural form of speech“Weak” transcripts (such as closed captioning and subtitles)Systems will become more capable in increasing robustness47Improving Infrastructure
  • 48.
    Tools for Collectingand Processing Large Quantities of Speech DataNew Web-based tools could be made available to collect, annotate, and then process substantial quantities of speech very cost effectively in many languagesNew initiativesDigital library technology aiming to scan huge amounts of textThe Million Book ProjectThe creation of large scale speech corporaThe Million Hour Speech CorpusThey will also provide rich resources for strong research into the fundamental nature of speech and language itself* J. K. Baker, “Spoken language digital libraries: The million hour speech, project,” in Proc. Int.Conf. Universal Digital Libraries, Invited Paper, Alexandria, Egypt, 2006.48Improving Infrastructure