Deep generative and discriminative models for speech recognition. The document outlines the history of speech recognition models including early neural networks, hidden dynamic models, and deep belief networks. It describes how deep learning entered speech recognition around 2009 through the collaboration of Microsoft Research and academics. This led to replacing generative models with discriminative deep neural networks which achieved large error reductions. The talk outlines further innovations in deep learning for speech including context-dependent models and better optimization techniques.
Zero shot learning through cross-modal transferRoelof Pieters
review of the paper "Zero-Shot Learning Through Cross-Modal Transfer" by Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christopher D. Manning, Andrew Y. Ng.
at KTH's Deep Learning reading group:
www.csc.kth.se/cvap/cvg/rg/
Deep learning for natural language embeddingsRoelof Pieters
This document discusses approaches to understanding natural language through deep learning techniques. It begins by outlining some of the challenges of language understanding, such as ambiguity and productivity. It then discusses using neural networks for natural language processing tasks like language modeling, sentiment analysis and machine translation. Recurrent and recursive neural networks are presented as approaches to model the compositionality of language. Different methods for obtaining word embeddings like Word2Vec, GloVe and earlier distributional semantic models are also summarized.
This document provides an overview of deep learning for information retrieval. It begins with background on the speaker and discusses how the data landscape is changing with increasing amounts of diverse data types. It then introduces neural networks and how deep learning can learn hierarchical representations from data. Key aspects of deep learning that help with natural language processing tasks like word embeddings and modeling compositionality are discussed. Several influential papers that advanced word embeddings and recursive neural networks are also summarized.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
Deep learning uses neural networks with multiple layers to learn representations of data with multiple levels of abstraction. Word embeddings represent words as dense vectors in a vector space such that words with similar meanings have similar vectors. Recursive neural tensor networks learn compositional distributed representations of phrases and sentences according to the parse tree by combining the vector representations of constituent words according to the tree structure. This allows modeling the meaning of complex expressions based on the meanings of their parts and the rules for combining them.
This document provides an outline for a tutorial on deep learning for natural language processing. It begins with an introduction to deep learning and its history, then discusses how neural methods have become prominent in natural language processing. The rest of the tutorial is outlined covering deep semantic models for text, recurrent neural networks for text generation, neural question answering models, and deep reinforcement learning for dialog systems.
Zero shot learning through cross-modal transferRoelof Pieters
review of the paper "Zero-Shot Learning Through Cross-Modal Transfer" by Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christopher D. Manning, Andrew Y. Ng.
at KTH's Deep Learning reading group:
www.csc.kth.se/cvap/cvg/rg/
Deep learning for natural language embeddingsRoelof Pieters
This document discusses approaches to understanding natural language through deep learning techniques. It begins by outlining some of the challenges of language understanding, such as ambiguity and productivity. It then discusses using neural networks for natural language processing tasks like language modeling, sentiment analysis and machine translation. Recurrent and recursive neural networks are presented as approaches to model the compositionality of language. Different methods for obtaining word embeddings like Word2Vec, GloVe and earlier distributional semantic models are also summarized.
This document provides an overview of deep learning for information retrieval. It begins with background on the speaker and discusses how the data landscape is changing with increasing amounts of diverse data types. It then introduces neural networks and how deep learning can learn hierarchical representations from data. Key aspects of deep learning that help with natural language processing tasks like word embeddings and modeling compositionality are discussed. Several influential papers that advanced word embeddings and recursive neural networks are also summarized.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
Deep learning uses neural networks with multiple layers to learn representations of data with multiple levels of abstraction. Word embeddings represent words as dense vectors in a vector space such that words with similar meanings have similar vectors. Recursive neural tensor networks learn compositional distributed representations of phrases and sentences according to the parse tree by combining the vector representations of constituent words according to the tree structure. This allows modeling the meaning of complex expressions based on the meanings of their parts and the rules for combining them.
This document provides an outline for a tutorial on deep learning for natural language processing. It begins with an introduction to deep learning and its history, then discusses how neural methods have become prominent in natural language processing. The rest of the tutorial is outlined covering deep semantic models for text, recurrent neural networks for text generation, neural question answering models, and deep reinforcement learning for dialog systems.
This document provides an overview of deep learning techniques for natural language processing (NLP). It discusses some of the challenges in language understanding like ambiguity and productivity. It then covers traditional ML approaches to NLP problems and how deep learning improves on these approaches. Some key deep learning techniques discussed include word embeddings, recursive neural networks, and language models. Word embeddings allow words with similar meanings to have similar vector representations, improving tasks like sentiment analysis. Recursive neural networks can model hierarchical structures like sentences. Language models assign probabilities to word sequences.
This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introductions to the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques. The workshop is divided into four modules: word vectors, sentence/paragraph/document vectors, and character vectors. The document provides background on why text representation is important for NLP, and discusses older techniques like one-hot encoding, bag-of-words, n-grams, and TF-IDF. It also introduces newer distributed representation techniques like word2vec's skip-gram and CBOW models, GloVe, and the use of neural networks for language modeling.
This document discusses deep learning applications for natural language processing (NLP). It begins by explaining what deep learning and deep neural networks are, and how they build upon older neural network models by adding multiple hidden layers. It then discusses why deep learning is now more viable due to factors like increased computational power from GPUs and improved training methods. The document outlines several NLP tasks that benefit from deep learning techniques, such as word embeddings, dependency parsing, sentiment analysis. It also provides examples of tools used for deep learning NLP and discusses building a sentence classifier to identify funding sentences from news articles.
Sign Language Recognition based on Hands symbols ClassificationTriloki Gupta
Communication is always having a great impact in every domain and how it is considered the meaning of the thoughts and expressions that attract the researchers to bridge this gap for every living being.
The objective of this project is to identify the symbolic expression through images so that the communication gap between a normal and hearing impaired person can be easily bridged.
Github Link:https://github.com/TrilokiDA/Hand_Sign_Language
1. The document discusses various applications of deep learning algorithms for speaker identification and recognition, including convolutional deep belief networks (CDBN) and deep neural networks (DNN).
2. CDBN was shown to outperform traditional MFCC and raw features for audio classification tasks including speech and music recognition.
3. DNN approaches have demonstrated lower error rates than GMM-HMM models for speech recognition across multiple languages.
4. SIDEKIT is an open source Python toolkit that can implement state-of-the-art methods for speaker identification, including GMM-HMM, and has potential to incorporate DNN approaches.
ODSC East: Effective Transfer Learning for NLPindico data
Presented by indico co-founder Madison May at ODSC East.
Abstract: Transfer learning, the practice of applying knowledge gained on one machine learning task to aid the solution of a second task, has seen historic success in the field of computer vision. The output representations of generic image classification models trained on ImageNet have been leveraged to build models that detect the presence of custom objects in natural images. Image classification tasks that would typically require hundreds of thousands of images can be tackled with mere dozens of training examples per class thanks to the use of these pretrained reprsentations. The field of natural language processing, however, has seen more limited gains from transfer learning, with most approaches limited to the use of pretrained word representations. In this talk, we explore parameter and data efficient mechanisms for transfer learning on text, and show practical improvements on real-world tasks. In addition, we demo the use of Enso, a newly open-sourced library designed to simplify benchmarking of transfer learning methods on a variety of target tasks. Enso provides tools for the fair comparison of varied feature representations and target task models as the amount of training data made available to the target model is incrementally increased.
The project was started with a sole aim in mind that the design should be able to recognize the voice of a person by analyzing the speech signal. The simulation is done in MATLAB. The design of the project is based on using the Linear prediction filter coefficient (LPC) and Principal component analysis (PCA) on data (princomp) for the speech signal analysis. The Sample Collection process is accomplished by using the microphone to record the speech of male/female. After executing the program the speech is analyzed by the analysis part of our MATLAB program code and our design should be able to identify and give the judgment that the recorded speech signal is same as that of our desired output.
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
Retrouvez la présentation de notre Meetup du 27 septembre 2017 présenté par notre collaborateur Abdelwahab HEBA : Deep Learning in practice : Speech recognition and beyond
Convolutional neural networks for sentiment classificationYunchao He
This document discusses various techniques for using convolutional neural networks for sentiment classification. It describes using word embeddings as network parameters that are learned during training or initialized from pre-trained models. It also discusses using sentence matrices and different types of convolutional and pooling layers. Specific CNN models discussed include using different channels, dynamic k-max pooling, semantic clustering, enriching word vectors, and multichannel variable-size convolution. References are provided for several papers on applying CNNs to sentiment classification.
This document provides a summary of Nicolae Duta's education, professional experience, research interests, publications, and skills. He has a Ph.D in Computer Science from Michigan State University and has worked as a researcher at Nuance Communications and BBN Technologies, focusing on natural language modeling and understanding. His research interests include pattern recognition techniques applied to vision, speech, language, biometrics, robotics and genetics. He has over 20 publications in journals and conferences and has developed several systems for tasks like object detection, face recognition, and medical image segmentation.
Convolutional neural networks (CNNs) have traditionally been used for computer vision tasks but recent work has applied them to language modeling as well. CNNs treat sequences of words as signals over time rather than independent units. They use convolution and pooling layers to identify important n-gram features. Results show CNNs can be effective for classification tasks like sentiment analysis but have had less success with sequence modeling tasks. Overall, CNNs provide an alternative to recurrent neural networks for certain natural language processing problems and help understand each model's strengths and weaknesses.
Recurrent networks and beyond by Tomas MikolovBhaskar Mitra
The document summarizes Tomas Mikolov's talk on recurrent neural networks and directions for future research. The key points are:
1) Recurrent networks have seen renewed success since 2010 due to simple tricks like gradient clipping that allow them to be trained more stably. Structurally constrained recurrent networks (SCRNs) provide longer short-term memory than simple RNNs without complex architectures.
2) While RNNs have achieved strong performance on many tasks, they struggle with algorithmic patterns requiring memorization of sequences or counting. Stack augmented RNNs add structured memory to address such limitations.
3) To build truly intelligent machines, we need to focus on developing skills like communication, learning new tasks quickly from few examples
The document discusses acoustic speech recognition techniques. It provides an introduction and historical survey, then describes the general model of ASR which involves feature extraction and hidden Markov models. It notes the development of early systems in the 1950s-1980s and more recent neural network approaches. The document advocates for expanding ASR systems to support local languages to improve human-machine interaction and information access for more people.
A critical insight into multi-languages speech emotion databasesjournalBEEI
With increased interest of human-computer/human-human interactions, systems deducing and identifying emotional aspects of a speech signal has emerged as a hot research topic. Recent researches are directed towards the development of automated and intelligent analysis of human utterances. Although numerous researches have been put into place for designing systems, algorithms, classifiers in the related field; however the things are far from standardization yet. There still exists considerable amount of uncertainty with regard to aspects such as determining influencing features, better performing algorithms, number of emotion classification etc. Among the influencing factors, the uniqueness between speech databases such as data collection method is accepted to be significant among the research community. Speech emotion database is essentially a repository of varied human speech samples collected and sampled using a specified method. This paper reviews 34 `speech emotion databases for their characteristics and specifications. Furthermore critical insight into the imitational aspects for the same have also been highlighted.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Asian EFL Journal 2010 conference presentationTakeshi Sato
This document summarizes a study that compared the effectiveness of 2D and 3D images as visual glosses for learning the meanings of English spatial prepositions. 24 participants were randomly assigned to learn vocabulary using either a 2D or 3D dictionary. A pre-test and post-test assessed learning of physical and metaphorical meanings. Results showed no significant difference in learning between the 2D and 3D image conditions. The researchers suggest that 3D images may distract from transforming schematic images, and deliberate use of multimedia is needed in CALL.
This document discusses automatic speech recognition, including defining the task, the main challenges, and common approaches. The key difficulties identified are digitizing speech, separating speech from noise, dealing with variability between individuals, identifying phonemes, disambiguating homophones, handling continuous speech, and interpreting prosodic features. Common approaches are template matching, rule-based systems, and statistical/machine learning methods like hidden Markov models. Remaining challenges include robustness, adaptability, language modeling, and handling spontaneous speech.
This document summarizes research on speaker recognition in noisy environments. It begins with an introduction discussing the goals of speaker identification and verification and their applications. It then provides details on the basic components of a speaker recognition system, including feature extraction and classification. The document focuses on methods for modeling noise, including generating multiple noisy training conditions and focusing matching on unaffected features. Experimental results are shown through snapshots of a prototype system interface that allows adding and recognizing speakers based on voice samples. The system is able to identify speakers in the presence of noise by comparing features to stored codebooks generated during training.
The document describes a lecture on deep learning for information processing and artificial intelligence given by Li Deng at Tianjin University in China from July 2-5, 2013. The lecture covered the basics of deep learning, including restricted Boltzmann machines, deep belief networks, deep neural networks, and applications to speech recognition, language modeling, and other domains. It also provided references to related tutorials, books, and research groups working on deep learning techniques.
This document outlines Koichi Shinoda's background and research interests in multimedia information processing, including statistical speech recognition and video information retrieval. It provides an overview of his educational background, work experience developing speech recognition systems at NEC Corporation, and current research areas focusing on statistical pattern recognition applied to speech and video data. Key topics covered include hidden Markov models, speaker adaptation techniques, video scene extraction, and multimodal interfaces.
This document provides an overview of deep learning techniques for natural language processing (NLP). It discusses some of the challenges in language understanding like ambiguity and productivity. It then covers traditional ML approaches to NLP problems and how deep learning improves on these approaches. Some key deep learning techniques discussed include word embeddings, recursive neural networks, and language models. Word embeddings allow words with similar meanings to have similar vector representations, improving tasks like sentiment analysis. Recursive neural networks can model hierarchical structures like sentences. Language models assign probabilities to word sequences.
This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introductions to the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques. The workshop is divided into four modules: word vectors, sentence/paragraph/document vectors, and character vectors. The document provides background on why text representation is important for NLP, and discusses older techniques like one-hot encoding, bag-of-words, n-grams, and TF-IDF. It also introduces newer distributed representation techniques like word2vec's skip-gram and CBOW models, GloVe, and the use of neural networks for language modeling.
This document discusses deep learning applications for natural language processing (NLP). It begins by explaining what deep learning and deep neural networks are, and how they build upon older neural network models by adding multiple hidden layers. It then discusses why deep learning is now more viable due to factors like increased computational power from GPUs and improved training methods. The document outlines several NLP tasks that benefit from deep learning techniques, such as word embeddings, dependency parsing, sentiment analysis. It also provides examples of tools used for deep learning NLP and discusses building a sentence classifier to identify funding sentences from news articles.
Sign Language Recognition based on Hands symbols ClassificationTriloki Gupta
Communication is always having a great impact in every domain and how it is considered the meaning of the thoughts and expressions that attract the researchers to bridge this gap for every living being.
The objective of this project is to identify the symbolic expression through images so that the communication gap between a normal and hearing impaired person can be easily bridged.
Github Link:https://github.com/TrilokiDA/Hand_Sign_Language
1. The document discusses various applications of deep learning algorithms for speaker identification and recognition, including convolutional deep belief networks (CDBN) and deep neural networks (DNN).
2. CDBN was shown to outperform traditional MFCC and raw features for audio classification tasks including speech and music recognition.
3. DNN approaches have demonstrated lower error rates than GMM-HMM models for speech recognition across multiple languages.
4. SIDEKIT is an open source Python toolkit that can implement state-of-the-art methods for speaker identification, including GMM-HMM, and has potential to incorporate DNN approaches.
ODSC East: Effective Transfer Learning for NLPindico data
Presented by indico co-founder Madison May at ODSC East.
Abstract: Transfer learning, the practice of applying knowledge gained on one machine learning task to aid the solution of a second task, has seen historic success in the field of computer vision. The output representations of generic image classification models trained on ImageNet have been leveraged to build models that detect the presence of custom objects in natural images. Image classification tasks that would typically require hundreds of thousands of images can be tackled with mere dozens of training examples per class thanks to the use of these pretrained reprsentations. The field of natural language processing, however, has seen more limited gains from transfer learning, with most approaches limited to the use of pretrained word representations. In this talk, we explore parameter and data efficient mechanisms for transfer learning on text, and show practical improvements on real-world tasks. In addition, we demo the use of Enso, a newly open-sourced library designed to simplify benchmarking of transfer learning methods on a variety of target tasks. Enso provides tools for the fair comparison of varied feature representations and target task models as the amount of training data made available to the target model is incrementally increased.
The project was started with a sole aim in mind that the design should be able to recognize the voice of a person by analyzing the speech signal. The simulation is done in MATLAB. The design of the project is based on using the Linear prediction filter coefficient (LPC) and Principal component analysis (PCA) on data (princomp) for the speech signal analysis. The Sample Collection process is accomplished by using the microphone to record the speech of male/female. After executing the program the speech is analyzed by the analysis part of our MATLAB program code and our design should be able to identify and give the judgment that the recorded speech signal is same as that of our desired output.
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
Retrouvez la présentation de notre Meetup du 27 septembre 2017 présenté par notre collaborateur Abdelwahab HEBA : Deep Learning in practice : Speech recognition and beyond
Convolutional neural networks for sentiment classificationYunchao He
This document discusses various techniques for using convolutional neural networks for sentiment classification. It describes using word embeddings as network parameters that are learned during training or initialized from pre-trained models. It also discusses using sentence matrices and different types of convolutional and pooling layers. Specific CNN models discussed include using different channels, dynamic k-max pooling, semantic clustering, enriching word vectors, and multichannel variable-size convolution. References are provided for several papers on applying CNNs to sentiment classification.
This document provides a summary of Nicolae Duta's education, professional experience, research interests, publications, and skills. He has a Ph.D in Computer Science from Michigan State University and has worked as a researcher at Nuance Communications and BBN Technologies, focusing on natural language modeling and understanding. His research interests include pattern recognition techniques applied to vision, speech, language, biometrics, robotics and genetics. He has over 20 publications in journals and conferences and has developed several systems for tasks like object detection, face recognition, and medical image segmentation.
Convolutional neural networks (CNNs) have traditionally been used for computer vision tasks but recent work has applied them to language modeling as well. CNNs treat sequences of words as signals over time rather than independent units. They use convolution and pooling layers to identify important n-gram features. Results show CNNs can be effective for classification tasks like sentiment analysis but have had less success with sequence modeling tasks. Overall, CNNs provide an alternative to recurrent neural networks for certain natural language processing problems and help understand each model's strengths and weaknesses.
Recurrent networks and beyond by Tomas MikolovBhaskar Mitra
The document summarizes Tomas Mikolov's talk on recurrent neural networks and directions for future research. The key points are:
1) Recurrent networks have seen renewed success since 2010 due to simple tricks like gradient clipping that allow them to be trained more stably. Structurally constrained recurrent networks (SCRNs) provide longer short-term memory than simple RNNs without complex architectures.
2) While RNNs have achieved strong performance on many tasks, they struggle with algorithmic patterns requiring memorization of sequences or counting. Stack augmented RNNs add structured memory to address such limitations.
3) To build truly intelligent machines, we need to focus on developing skills like communication, learning new tasks quickly from few examples
The document discusses acoustic speech recognition techniques. It provides an introduction and historical survey, then describes the general model of ASR which involves feature extraction and hidden Markov models. It notes the development of early systems in the 1950s-1980s and more recent neural network approaches. The document advocates for expanding ASR systems to support local languages to improve human-machine interaction and information access for more people.
A critical insight into multi-languages speech emotion databasesjournalBEEI
With increased interest of human-computer/human-human interactions, systems deducing and identifying emotional aspects of a speech signal has emerged as a hot research topic. Recent researches are directed towards the development of automated and intelligent analysis of human utterances. Although numerous researches have been put into place for designing systems, algorithms, classifiers in the related field; however the things are far from standardization yet. There still exists considerable amount of uncertainty with regard to aspects such as determining influencing features, better performing algorithms, number of emotion classification etc. Among the influencing factors, the uniqueness between speech databases such as data collection method is accepted to be significant among the research community. Speech emotion database is essentially a repository of varied human speech samples collected and sampled using a specified method. This paper reviews 34 `speech emotion databases for their characteristics and specifications. Furthermore critical insight into the imitational aspects for the same have also been highlighted.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Asian EFL Journal 2010 conference presentationTakeshi Sato
This document summarizes a study that compared the effectiveness of 2D and 3D images as visual glosses for learning the meanings of English spatial prepositions. 24 participants were randomly assigned to learn vocabulary using either a 2D or 3D dictionary. A pre-test and post-test assessed learning of physical and metaphorical meanings. Results showed no significant difference in learning between the 2D and 3D image conditions. The researchers suggest that 3D images may distract from transforming schematic images, and deliberate use of multimedia is needed in CALL.
This document discusses automatic speech recognition, including defining the task, the main challenges, and common approaches. The key difficulties identified are digitizing speech, separating speech from noise, dealing with variability between individuals, identifying phonemes, disambiguating homophones, handling continuous speech, and interpreting prosodic features. Common approaches are template matching, rule-based systems, and statistical/machine learning methods like hidden Markov models. Remaining challenges include robustness, adaptability, language modeling, and handling spontaneous speech.
This document summarizes research on speaker recognition in noisy environments. It begins with an introduction discussing the goals of speaker identification and verification and their applications. It then provides details on the basic components of a speaker recognition system, including feature extraction and classification. The document focuses on methods for modeling noise, including generating multiple noisy training conditions and focusing matching on unaffected features. Experimental results are shown through snapshots of a prototype system interface that allows adding and recognizing speakers based on voice samples. The system is able to identify speakers in the presence of noise by comparing features to stored codebooks generated during training.
The document describes a lecture on deep learning for information processing and artificial intelligence given by Li Deng at Tianjin University in China from July 2-5, 2013. The lecture covered the basics of deep learning, including restricted Boltzmann machines, deep belief networks, deep neural networks, and applications to speech recognition, language modeling, and other domains. It also provided references to related tutorials, books, and research groups working on deep learning techniques.
This document outlines Koichi Shinoda's background and research interests in multimedia information processing, including statistical speech recognition and video information retrieval. It provides an overview of his educational background, work experience developing speech recognition systems at NEC Corporation, and current research areas focusing on statistical pattern recognition applied to speech and video data. Key topics covered include hidden Markov models, speaker adaptation techniques, video scene extraction, and multimodal interfaces.
This document discusses the rise of artificial intelligence through deep learning. It begins with an introduction of the author, Rehan Guha, and his background in data science and machine learning. It then provides definitions of artificial intelligence, machine learning, and deep learning. Deep learning is a type of machine learning that uses neural networks inspired by the human brain. The document reviews the history and evolution of deep learning technologies. It also discusses recent advancements in deep learning applications such as image recognition, language processing, and adversarial examples.
Research Developments and Directions in Speech Recognition and ...butest
The document discusses research developments and directions in speech recognition and understanding. It outlines significant developments including improvements to infrastructure like hardware, corpora, and evaluation benchmarks. It also discusses advances in knowledge representation like acoustic features and graph representations, as well as models and algorithms like hidden Markov models and discriminative training. The document proposes six potential "grand challenges" for future research, including creating systems robust to everyday audio variability, rapidly developing speech technologies for emerging languages with limited resources, and enabling spoken language comprehension.
neural based_context_representation_learning_for_dialog_act_classificationJEE HYUN PARK
The document presents a neural network model for dialog act classification that incorporates context representations. It uses a CNN to represent each utterance, applies an internal attention mechanism, and models context with RNNs. As baselines, it uses a single-utterance CNN and concatenation of utterances. Results show RNNs better learn context representations and attention mechanisms improve performance, though the optimal attention placement depends on the dataset. The best-performing models outperform the previous state-of-the-art on benchmark datasets.
Deep Learning: Changing the Playing Field of Artificial Intelligence - MaRS G...MaRS Discovery District
Deep learning is changing the field of artificial intelligence and revolutionizing our online experience, with applications including speech and image recognition. Information and communications technology giants such as Google, Facebook, IBM and Baidu, among others, are rapidly deploying deep learning into new products and services.
Behind all of the present-day excitement about deep learning are years of high risk and hard work by a small group of eminent computer scientists and theorists connected through the Canadian Institute for Advanced Research (CIFAR).
Towards Machine Comprehension of Spoken ContentNVIDIA Taiwan
Hung-yi Lee gave a presentation on developing machine comprehension of spoken content. He discussed using deep learning for speech recognition, spoken content retrieval, key term extraction, summarization, question answering, and organizing information. The goal is for machines to understand spoken audio data by recognizing speech, extracting useful information, and interacting with users to provide relevant results. Several challenges were mentioned, such as the lack of annotated training data for many languages. Preliminary research on learning directly from audio without transcription was also presented.
“Automatically learning multiple levels of representations of the underlying distribution of the data to be modelled”
Deep learning algorithms have shown superior learning and classification performance.
In areas such as transfer learning, speech and handwritten character recognition, face recognition among others.
(I have referred many articles and experimental results provided by Stanford University)
Symbol Emergence in Robotics: Language Acquisition via Real-world Sensorimoto...Tadahiro Taniguchi
This document summarizes a talk given by Tadahiro Taniguchi on symbol emergence in robotics through language acquisition via real-world sensorimotor information. The talk discusses several lexical acquisition tasks performed by robots, including direct phoneme and word discovery from speech signals, simultaneous acquisition of word units and multimodal categories, and online spatial concept acquisition. It outlines Taniguchi's research interests and background, and describes computational models and experiments for language acquisition through self-organizational learning based on real-world sensorimotor information.
"The Transformative Power of AI and Open Challenges" by Dr. Manish Gupta, GoogleISPMAIndia
We begin by presenting the recent advances in the area of artificial intelligence, and the high level ideas underlying the progressively narrower domains of machine learning, deep learning, and foundation models, which have emerged over time as dominant paradigms for artificial intelligence. We describe the tremendous progress of these models on problems ranging from understanding, prediction and creativity on one hand, and open technical challenges like safety, fairness and transparency on the other hand. These challenges are further amplified as we seek to advance Inclusive AI to tackle problems for over a billion human beings in the context of India and the Global South. We present our work on multilingual models to democratize information access in a diverse set of Indian languages, on healthcare in environments where we lack data in digital form to begin with, and on analysis of satellite imagery to help transform agriculture and improve the lives of farmers. Through these examples, we hope to convey the excitement of the potential of AI to make a difference to the world, and also a fascinating set of open problems to tackle.
This document provides a summary of Nicolae Duta's education, professional experience, research interests, publications, and skills. He has a Ph.D in Computer Science from Michigan State University and has worked as a researcher at Nuance Communications and BBN Technologies, focusing on natural language modeling and understanding. His research interests include pattern recognition techniques applied to vision, speech, language, biometrics, robotics and genetics. He has over 20 publications in journals and conferences and has developed several systems for tasks like object detection, face recognition, and medical image segmentation.
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...summersocialwebshop
The document discusses extracting network data from text documents. It outlines extracting named entities like people, places, events as nodes and linking them based on proximity, syntax, and statistics to build networks. Examples are provided of networks constructed from news articles about Sudan showing prominent individuals and organizations and how their centrality changes over time. The goal is to understand socio-technical networks and their co-evolution with knowledge and structure from large-scale text data.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
This document provides an overview of deep learning and natural language processing techniques. It begins with a history of machine learning and how deep learning advanced beyond early neural networks using methods like backpropagation. Deep learning methods like convolutional neural networks and word embeddings are discussed in the context of natural language processing tasks. Finally, the document proposes some graph-based approaches to combining deep learning with NLP, such as encoding language structures in graphs or using finite state graphs trained with genetic algorithms.
This document summarizes the agenda and key topics for a CIS 890 project final presentation on topics modelling with LDA. The presentation will cover LDA modelling, HMMLDA modelling, LDA with collocations modelling, and experimental results on the NIPS collection. It will discuss topic modelling approaches like LDA, discriminative vs generative methods, and limitations of bag-of-words assumptions.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The document summarizes key topics from ICASSP 2022, including general trends in speech and audio processing, self-supervised and contrastive learning approaches, security applications, and topics related to tasks like multilingualism and keyword spotting. Some of the main models and techniques discussed are Wav2vec, HuBERT, contrastive learning using Conformers, intermediate layer supervision in self-supervised learning, and anonymization of speech data for privacy.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
A tale of scale & speed: How the US Navy is enabling software delivery from l...
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
1. Deep Generative & Discriminative
Models for Speech Recognition
Li Deng
Deep Learning Technology Center
Microsoft Research, Redmond, USA
Machine Learning Conference, Berlin, Oct 5, 2015
Thanks go to many colleagues at DLTC/MSR, collaborating universities,
and at Microsoft’s engineering groups
2. Outline
2
• Introduction
• Part I: Some history
• How deep learning entered speech recognition (2009-2010)
• Generative & discriminative models prior to 2009
• Part II: State of the art
• Dominated by deep discriminative models
• Deep neural nets (DNN) & many of the variants
• Part III: Prospects
• Integrating deep generative/discriminative models
• Deep multimodal and semantic modeling
3. Recent Books on the Topic
3
L. Deng and D. Yu (2014) "Deep Learning: Methods and Applications,“ NOW Publishers
http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol7-SIG-039.pdf
D. Yu and L. Deng (2014). "Automatic Speech Recognition: A Deep Learning Approach,”
Springer.
A. Graves (2012) “Supervised Sequence Labeling with Recurrent Neural Nets,” Springer.
J. Li, L. Deng, R. Haeb-Umbach, Y. Gong (2015) “Robust Automatic Speech Recognition -
-- A bridge to practical applications,” Academic Press. A Deep-Learning
Approach
5. Deep Discriminative NN Deep Generative Models
Structure Graphical; info flow: bottom-up Graphical; info flow: top-down
Incorp constraints &
domain knowledge
Harder; less fine-grained Easier; more fine grained
Semi/unsupervised Hard or impossible Easier, at least possible
Interpretation Harder Easy (generative “story” on data and hidden variables)
Representation Distributed Localist (mostly); can be distributed also
Inference/decode Easy Harder (but note recent progress)
Scalability/compute Easier (regular computes/GPU) Harder (but note recent progress)
Incorp. uncertainty Hard Easy
Empirical goal Classification, feature learning, … Classification (via Bayes rule), latent
variable inference…
Terminology Neurons, activation/gate functions,
weights …
Random vars, stochastic “neurons”,
potential function, parameters …
Learning algorithm A single, unchallenged, algorithm --
BackProp
A major focus of open research, many
algorithms, & more to come
Evaluation On a black-box score – end
performance
On almost every intermediate quantity
Implementation Hard (but increasingly easier) Standardized but insights needed
Experiments Massive, real data Modest, often simulated data
Parameterization Dense matrices Sparse (often PDFs); can be dense
6. Outline
6
• Introduction
• Part I: Some history
• How deep learning entered speech recognition (2009-2010)
• Generative & discriminative models prior to 2009
• Part II: State of the art
• Dominated by deep discriminative models
• Deep neural nets (DNN) & many of the variants
• Part III: Prospects
• Integrating deep generative/discriminative models
• Deep multimodal and semantic modeling
7. (Shallow) Neural Networks for Speech Recognition
(prior to the rise of deep learning 2009-2010)
Temporal & Time-Delay (1-D Convolutional) Neural Nets
• Atlas, Homma, and Marks, “An Artificial Neural Network for Spatio-Temporal Bipolar Patterns,
Application to Phoneme Classification,” NIPS, 1988.
• Waibel, Hanazawa, Hinton, Shikano, Lang. “Phoneme recognition using time-delay neural
networks.” IEEE Transactions on Acoustics, Speech and Signal Processing, 1989.
Hybrid Neural Nets-HMM
• Morgan and Bourlard. “Continuous speech recognition using MLP with HMMs,” ICASSP, 1990.
Recurrent Neural Nets
• Bengio. “Artificial Neural Networks and their Application to Speech/Sequence Recognition”, Ph.D.
thesis, 1991.
• Robinson. “A real-time recurrent error propagation network word recognition system,” ICASSP
1992.
Neural-Net Nonlinear Prediction
• Deng, Hassanein, Elmasry. “Analysis of correlation structure for a neural predictive model with
applications to speech recognition,” Neural Networks, vol. 7, No. 2, 1994.
Bidirectional Recurrent Neural Nets
• Schuster, Paliwal. "Bidirectional recurrent neural networks," IEEE Trans. Signal Processing, 1997.
Neural-Net TANDEM
• Hermansky, Ellis, Sharma. "Tandem connectionist feature extraction for conventional HMM
systems." ICASSP 2000.
• Morgan, Zhu, Stolcke, Sonmez, Sivadas, Shinozaki, Ostendorf, Jain, Hermansky, Ellis, Doddington,
Chen, Cretin, Bourlard, Athineos, “Pushing the envelope - aside [speech recognition],” IEEE Signal
Processing Magazine, vol. 22, no. 5, 2005.
DARPA EARS Program 2001-2004: Novel Approach I (Novel Approach II: Deep Generative Model)
Bottle-neck Features Extracted from Neural-Nets
• Grezl, Karafiat, Kontar & Cernocky. “Probabilistic and bottle-neck features for LVCSR of meetings,”
ICASSP, 2007.
7
1988
1989
1990
1991
1992
1994
1997
2000
2005
2007
9. Deep Generative Models for Speech Recognition
(prior to the rise of deep learning)
Segment & Nonstationary-State Models
• Digalakis, Rohlicek, Ostendorf. “ML estimation of a stochastic linear system with the EM alg &
application to speech recognition,” IEEE T-SAP, 1993
• Deng, Aksmanovic, Sun, Wu, Speech recognition using HMM with polynomial regression functions
as nonstationary states,” IEEE T-SAP, 1994.
Hidden Dynamic Models (HDM)
• Deng, Ramsay, Sun. “Production models as a structural basis for automatic speech recognition,”
Speech Communication, vol. 33, pp. 93–111, 1997.
• Bridle et al. “An investigation of segmental hidden dynamic models of speech coarticulation for
speech recognition,” Final Report Workshop on Language Engineering, Johns Hopkins U, 1998.
• Picone et al. “Initial evaluation of hidden dynamic models on conversational speech,” ICASSP, 1999.
• Deng and Ma. “Spontaneous speech recognition using a statistical co-articulatory model for the
vocal tract resonance dynamics,” JASA, 2000.
Structured Hidden Trajectory Models (HTM)
• Zhou, et al. “Coarticulation modeling by embedding a target-directed hidden trajectory model into
HMM,” ICASSP, 2003. DARPA EARS Program 2001-2004: Novel Approach II
• Deng, Yu, Acero. “Structured speech modeling,” IEEE Trans. on Audio, Speech and Language
Processing, vol. 14, no. 5, 2006.
Switching Nonlinear State-Space Models
• Deng. “Switching Dynamic System Models for Speech Articulation and Acoustics,” in Mathematical Foundations of
Speech and Language Processing, vol. 138, pp. 115 - 134, Springer, 2003.
• Lee et al. “A Multimodal Variational Approach to Learning and Inference in Switching State Space
Models,” ICASSP, 2004. 9
1993
1994
1997
1998
1999
2000
2003
2006
10. 10
Bridle, Deng, Picone, Richards, Ma, Kamm, Schuster, Pike, Reagan. Final Report for Workshop on Language Engineering,
Johns Hopkins University, 1998.
Deterministic
HDM
Statistical
HDM
11. HDM: A Deep Generative Model of Speech Production/Perception
--- Perception as “variational inference” (1997-2007)
message
Speech Acoustics
SPEAKER
articulation
targets
distortion-free acoustics
distorted acoustics
distortion factors &
feedback to articulation
13. Surprisingly Good Inference Results for
Continuous Hidden States
13
• By-product: accurately
tracking dynamics of
resonances (formants) in
vocal tract (TIMIT & SWBD).
• Best formant tracker by
then in speech analysis;
used as basis to form a
formant database as
“ground truth”
• We thought we solved the
ASR problem, except
• “Intractable” for decoding
Deng & Huang, Challenges in Adopting Speech Recognition, Communications of the ACM, vol. 47, pp. 69-75, 2004.
Deng, Cui, Pruvenok, Huang, Momen, Chen, Alwan, A Database of Vocal Tract Resonance Trajectories for Research in Speech , ICASSP, 2006.
14. Another Deep Generative Model
(developed outside speech)
14
• Sigmoid belief nets & wake/sleep alg. (1992)
• Deep belief nets (DBN, 2006);
Start of deep learning
• Totally non-obvious result:
Stacking many RBMs (undirected)
not Deep Boltzmann Machine (DBM, undirected)
but a DBN (directed, generative model)
• Excellent in generating images & speech synthesis
• Similar type of deep generative models to HDM
• But simpler: no temporal dynamics
• With very different parameterization
• Most intriguing of DBN: inference is easy
(i.e. no need for approximate variational Bayes)
”Restriction” of connections in RBM
• Pros/cons analysis Hinton coming to MSR 2009
15. This is a very different kind of deep generative model
(Deng et al., 2006; Deng & Yu, 2007)
(Mohamed, Dahl, Hinton, 2009, 2012)
(after adding Backprop to the generative DBN)
16. d
16
• Elegant model formulation & knowledge
incorporation
• Strong empirical results: 96% TIMIT accuracy
with Nbest=1001; 75.2% lattice decoding w.
monophones; fast approx. training
• Still very expensive for decoding; could not
ship (very frustrating!)
Error Analysis
-- DBN/DNN made many new errors
on short, undershoot vowels
-- 11 frames contain too much “noise”
17. Academic-Industrial Collaboration (2009,2010)
• I invited Geoff Hinton to work with me at MSR, Redmond
• Well-timed academic-industrial collaboration:
– Speech industry searching for new solutions while
“principled” deep generative approaches could not deliver
– Academia developed deep learning tools (e.g. DBN 2006)
looking for applications
– Add Backprop to deep generative models (DBN) DBN-DNN
(hybrid generative/discriminative)
– Advent of GPU computing (Nvidia CUDA library released
2007-2008)
– Big training data in speech recognition were already available
17
18. 18
Mohamed, Dahl, Hinton, Deep belief networks for phone recognition, NIPS 2009 Workshop on Deep Learning, 2009
Yu, Deng, Wang, Learning in the Deep-Structured Conditional Random Fields, NIPS 2009 Workshop on Deep Learning, 2009
…, …, …
Invitee 1: give me one week
to decide …,…
Not worth my time to fly to
Vancouver for this…
19. Expanding DNN at Industry Scale
• Scale DNN’s success to large speech tasks (2010-2011)
– Grew output neurons from context-independent phone states (100-200) to context-dependent
ones (1k-30k) CD-DNN-HMM for Bing Voice Search and then to SWBD tasks
– Motivated initially by saving huge MSFT investment in the speech decoder software
infrastructure
– CD-DNN-HMM also gave much higher accuracy than CI-DNN-HMM
– Discovered that with large training data Backprop works well without DBN pre-training by
understanding why gradients often vanish (patent filed for “discriminative pre-training” 2011)
• Engineering for building large speech systems:
– Combined expertise in DNN (esp. with GPU implementation) and speech recognition
– Collaborations among MSRR, MSRA, academic researchers
• Yu, Deng, Dahl, Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition, in NIPS Workshop on Deep Learning, 2010.
• Dahl, Yu, Deng, Acero, Large Vocabulary Continuous Speech Recognition With Context-Dependent DBN-HMMS, in Proc. ICASSP, 2011.
• Dahl, Yu, Deng, Acero, Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, in IEEE Transactions on Audio, Speech, and Language
Processing (2013 IEEE SPS Best Paper Award) , vol. 20, no. 1, pp. 30-42, January 2012.
• Seide, Li, Yu, "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks", Interspeech 2011, pp. 437-440.
• Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath, Kingsbury, Deep Neural Networks for Acoustic Modeling in SpeechRecognition, in IEEE Signal
Processing Magazine, vol. 29, no. 6, pp. 82-97, November 2012
• Sainath, T., Kingsbury, B., Ramabhadran, B., Novak, P., and Mohamed, A. “Making deep belief networks effective for large vocabulary continuous speech recognition,” Proc.
ASRU, 2011.
• Sainath, T., Kingsbury, B., Soltau, H., and Ramabhadran, B. “Optimization Techniques to Improve Training Speed of Deep Neural Networks for Large Speech Tasks,” IEEE
Transactions on Audio, Speech, and Language Processing, vol.21, no.11, pp.2267-2276, Nov. 2013.
• Jaitly, N., Nguyen, P., Senior, A., and Vanhoucke, V. “Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition,” Proc. Interspeech, 2012.
19
21. DNN vs. Pre-DNN Prior-Art
Features Setup Error Rates
Pre-DNN GMM-HMM with BMMI 23.6%
DNN 7 layers x 2048 15.8%
Features Setup Error Rates
Pre-DNN GMM-HMM with MPE 36.2%
DNN 5 layers x 2048 30.1%
• Table: Voice Search SER (24-48 hours of training)
Table: SwitchBoard WER (309 hours training)
Table: TIMIT Phone recognition (3 hours of training)
Features Setup Error Rates
Pre-DNN Deep Generative Model 24.8%
DNN 5 layers x 2048 23.4%
~10% relative
improvement
~20% relative
improvement
~30% relative
Improvement
21
For DNN, the more data, the better!
22. Scientists See Promise in Deep-Learning Programs
John Markoff
November 23, 2012
Rick Rashid in Tianjin, China, October, 25, 2012
Deep learning
technology enabled
speech-to-speech
translation
A voice recognition program translated a speech given by
Richard F. Rashid, Microsoft’s top scientist, into Mandarin Chinese.
23. CD-DNN-HMM
Dahl, Yu, Deng, and Acero, “Context-Dependent Pre-
trained Deep Neural Networks for Large Vocabulary
Speech Recognition,” IEEE Trans. ASLP, Jan. 2012 (also
ICASSP 2011)
Seide et al, Interspeech, 2011.
After no improvement for 10+ years
by the research community…
…MSR reduced error from ~23% to
<13% (and under 7% for Rick
Rashid’s S2S demo in 2012)!
27. Outline
27
• Introduction
• Part I: Some history
• How deep learning entered speech recognition (2009-2010)
• Generative & discriminative models prior to 2009
• Part II: State of the art w. further innovations
• Dominated by deep discriminative models
• Deep neural nets (DNN) & many of the variants
• Part III: Prospects
• Integrating deep generative/discriminative models
• Deep multimodal and semantic modeling
28. Innovation: Better Optimization
28
Input data X
• Sequence discriminative training for DNN:
- Mohamed, Yu, Deng: “Investigation of full-sequence training of deep
belief networks for speech recognition,” Interspeech, 2010.
- Kingsbury, Sainath, Soltau. “Scalable minimum Bayes risk training of
DNN acoustic models using distributed hessian-free optimization,”
Interspeech, 2012.
- Su, Li, Yu, Seide. “Error back propagation for sequence training of CD
deep networks for conversational speech transcription,” ICASSP, 2013.
- Vesely, Ghoshal, Burget, Povey. “Sequence-discriminative training of
deep neural networks, Interspeech, 2013.
• Distributed asynchronous SGD
- Dean, Corrado,…Senior, Ng. “Large Scale Distributed Deep Networks,”
NIPS, 2012.
- Sak, Vinyals, Heigold, Senior, McDermott, Monga, Mao. “Sequence
Discriminative Distributed Training of Long Short-Term Memory
Recurrent Neural Networks,” Interspeech,2014.
29. Innovation: Towards Raw Inputs
29
Input data X
• Bye-Bye MFCCs (no more cosine transform, Mel-scaling?)
- Deng, Seltzer, Yu, Acero, Mohamed, Hinton. “Binary coding of speech spectrograms using a deep
auto-encoder,” Interspeech, 2010.
- Mohamed, Hinton, Penn. “Understanding how deep belief networks perform acoustic modeling,”
ICASSP, 2012.
- Li, Yu, Huang, Gong, “Improving wideband speech recognition using mixed-bandwidth training data
in CD-DNN-HMM” SLT, 2012
- Deng, J. Li, Huang, Yao, Yu, Seide, Seltzer, Zweig, He, Williams, Gong, Acero. “Recent advances in
deep learning for speech research at Microsoft,” ICASSP, 2013.
- Sainath, Kingsbury, Mohamed, Ramabhadran. “Learning filter banks within a deep neural network
framework,” ASRU, 2013.
• Bye-Bye Fourier transforms?
- Jaitly and Hinton. “Learning a better representation of speech sound waves using RBMs,” ICASSP,
2011.
- Tuske, Golik, Schluter, Ney. “Acoustic modeling with deep neural networks using raw time signal for
LVCSR,” Interspeech, 2014.
- Golik et al, “Convolutional NNs for acoustic modeling of raw time signals in LVCSR,” Interspeech,
2015.
- Sainath et al. “Learning the Speech Front-End with Raw Waveform CLDNNs,” Interspeech, 2015
• DNN as hierarchical nonlinear feature extractors:
- Seide, Li, Chen, Yu. “Feature engineering in context-dependent deep neural networks for
conversational speech transcription, ASRU, 2011.
- Yu, Seltzer, Li, Huang, Seide. “Feature learning in deep neural networks - Studies on speech
recognition tasks,” ICLR, 2013.
- Yan, Huo, Xu. “A scalable approach to using DNN-derived in GMM-HMM based acoustic modeling in
LVCSR,” Interspeech, 2013.
- Deng, Chen. “Sequence classification using high-level features extracted from deep neural
networks,” ICASSP, 2014.
34. Innovation: Ensemble Deep Learning
34
• Ensembles of RNN/LSTM, DNN, & Conv Nets (CNN) give
huge gains (state of the art):
• T. Sainath, O. Vinyals, A. Senior, H. Sak. “Convolutional, Long Short-Term Memory, Fully
Connected Deep Neural Networks,” ICASSP 2015.
• L. Deng and John Platt, Ensemble Deep Learning for Speech Recognition, Interspeech,
2014.
• G. Saon, H. Kuo, S. Rennie, M. Picheny. “The IBM 2015 English conversational telephone
speech recognition system,” arXiv, May 2015. (8% WER on SWB-309h)
35. Innovation: Better learning objectives/methods
35
• Use of CTC as a new objective in RNN/LSTM with
end2end learning drastically simplifies ASR
systems
• Predict graphemes or words directly; no pron.
dictionaries; no CD; no decision trees
• Use of “Blank” symbols may be equivalent to a
special HMM state tying scheme
CTC/RNN has NOT replaced HMM (left-to-right)
• Relative 8% gain by CTC has been shown by a very
limited number of labs
• A. Graves and N. Jaitly. “Towards End-to-End Speech Recognition
with Recurrent Neural Networks,” ICML, 2014.
• A. Hannun, A. Ng et al. “DeepSpeech: Scaling up End-to-End
Speech Recognition,” arXiv Nov. 2014.
• A. Maas et al. “Lexicon-Free Conversational ASR with NN,” NAACL,
2015
• H. Sak et al. “Learning Acoustic Frame Labeling for ASR with RNN,”
ICASSP, 2015
• H. Sak, A. Senior, K. Rao, F. Beaufays. “Fast and Accurate Recurrent
Neural Network Acoustic Models for Speech Recognition,”
Interspeech, 2015
36. Innovation: A new paradigm for speech recognition
36
• Seq2seq learning with attention
mechanism (borrowed from NLP-MT)
• W. Chan, N. Jaitly, Q. Le, O. Vinyals. “Listen, attend, and spell,”
arXiv, 2015.
• J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio.
“Attention-Based Models for Speech Recognition,” arXiv, 2015
(accepted to NIPS).
37. A Perspective on Recent Innovations
• All above deep learning innovations are based on
supervised, discriminative learning of DNN and recurrent
variants
• Capitalizing on big, labeled data
• Incorporating monotonic-sequential structure of speech
(non-monotonic for language, later)
• Hard to incorporate many other aspects of speech
knowledge with (e.g. speech distortion model)
• Hard to do semi- and unsupervised learning
Deep generative modeling may overcome such difficulties
37
Li Deng and Roberto Togneri, Chapter 6: Deep Dynamic Models for Learning Hidden Representations of Speech Features, pp. 153-196,
Springer, December 2014.
Li Deng and Navdeep Jaitly, Chapter 2: Deep discriminative and generative models for pattern recognition, ~30 pages, in Handbook of
Pattern Recognition and Computer Vision: 5th Edition, World Scientific Publishing, Jan 2016.
38. Outline
38
• Introduction
• Part I: Some history
• How deep learning entered speech recognition (2009-2010)
• Generative & discriminative models prior to 2009
• Part II: State of the art
• Dominated by deep discriminative models
• Deep neural nets (DNN) & many of the variants
• Part III: Prospects
• Integrating deep generative/discriminative models
• Deep multimodal and semantic modeling
39. Deep Neural Nets Deep Generative Models
Structure Graphical; info flow: bottom-up Graphical; info flow: top-down
Incorp constraints &
domain knowledge
Hard Easy
Unsupervised Hard or impossible Not easy, but at least possible
Interpretation Harder Easier (generative “story” on data and hidden variables)
Bring back deep generative models
• New advances in variational inference/learning
inspired by DNN (since 2014)
• Strengths of generative models missing in DNNs
Domain knowledge and dependency modeling (2015 book): e.g.
- how noisy speech observations are composed of clean speech & distortions
- vs. brute-force simulating noisy speech data
Unsupervised learning:
- millions of hrs of speech data available without labels
- vs. thousands of hrs of data with labels to train SOTA speech systems today
40. On unsupervised speech recognition using joint
deep generative/discriminative models
40
• Deep generative models not successful in 90’s & early 2000’s
• Computers were too slow
• Models were too simple from text to speech waves
• Still true need speech scientists to work harder with technologists
• And when generative models are not good enough, discriminative models
and learning (e.g., RNN) can help a lot
• Further, can iterate between the two, like wake-sleep (algorithm)
• Inference/learning methods for deep generative models not mature
at that time
• Only partially true today
• due to recent big advances in machine learning
• Based on new ways of thinking about generative graphical modeling
motivated by the availability of deep learning tools (e.g. DNN)
41. Advances in Inference Algms for Deep Generative Models
41
-Variational Bayesian Inference with Stochastic Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]
-Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression [T. Salimans and A. Knowles, 2013].
-Black Box Variational Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]
-Stochastic Variational Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]
-Estimating or propagating gradients through stochastic neurons. [Y. Bengio, 2013].
-Neural Variational Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014, ICML]
-Stochastic backprop & approximation inference in deep generative models [D. Rezende, S. Mohamed, D. Wierstra, 2014]
-Semi-supervised learning with deep generative models [K. Kingma, D. Rezende, S. Mohamed, M. Welling, 2014, NIPS]
-auto-encoding variational Bayes [K. Kingma, M. Welling, 2014, ICML]
-Learning stochastic recurrent networks [Bayer and Osendorfer, 2015 ICLR]
-DRAW: A recurrent neural network for image generation. [K. Gregor, Danihelka, Rezende, Wierstra, 2015]
- Plus a number of NIPS-2015 papers, to appear.
Kingma & Welling 2014 (ICLR/ICML/NIPS); Rezende et al. 2014 (ICML); Ba et al, 2015 (NIPS)
Other solutions to solve the "large variance problem“ in variational inference:
Slide provided by Max Welling (ICML-2014 tutorial) w. my updates on references of 2015 and late 2014
ICML-2014 Talk Monday June 23, 15:20
In Track F (Deep Learning II)
“Efficient Gradient Based Inference
through Transformations between
Bayes Nets and Neural Nets”
42. Another direction: Multi-Modal Learning (text, image, speech)
--- a “human-like” speech acquisition & image understanding model
• Distant supervised learning
• Project speech, image, and text to the
same “semantic” space
• Learn associations between difference
modalities
Refs: Huang et al. “Learning deep structured semantic models (DSSM)
for web search using clickthrough data,” CIKM, 2013; etc.
42
Image features i
H1
H2
H3
W1
W2
W3
W4
Input i
H3
Text string t
H1
H2
H3
W1
W2
W3
Input t
H3
Distance(i,t)
W4
Raw Image
pixels
Convolution/pooling
Convolution/pooling
Convolution/pooling
Convolution/pooling
Convolution/pooling
Fully connected
Fully connected
Softmax layer
Speech features s
H1
H2
H3
W1
W2
W3
W4
Input s
H3
Distance(s,t)
43. Summary
• Speech recognition is the first success case of deep learning at industry scale
• Early NN & deep/dynamic generative models did not succeed, but
• They provided seeding work for inroads of DNNs to speech recognition (2009)
• Academic/industrial collaborations; Careful comparisons of two types of models
• Use of (big) context-dependent HMM states as DNN output layers lead to huge
error reduction while keeping decoding efficiency similar to old GMM-HMMs.
• With large labeled training data, no need for generative pre-training of
DNNs/RNNs (2010 at MSR)
• Current SOTA is based on LSTM-RNN, CNN, DNN with ensemble learning, no
generative components
• Many weaknesses of this deep discriminative architecture are identified &
analyzed the need to bring back deep generative models
43
44. Summary: Future Directions
• With supervised data, what will be the limit for growing accuracy wrt
increasing amounts of labeled speech data?
• Beyond this limit or when labeled data are exhausted or non-economical
to collect, how will unsupervised deep learning emerge?
• Three axes for future advances:
44
Semi/Unsupervised deep learning:
-integrated generative/discriminative models
-generative: embed knowledge/constraints;
exploit powerful priors;
define high-capacity NN architectures for discriminative nets;
Multimodal learning
(distant supervised learning)
Further push to the limit of
supervised deep learning
46. 46
Microsoft Research
• Auli, M., Galley, M., Quirk, C. and Zweig, G., 2013. Joint language and translation modeling with recurrent neural networks. In EMNLP.
• Auli, M., and Gao, J., 2014. Decoder integration and expected bleu training for recurrent neural network language models. In ACL.
• Baker, J., Li Deng, Jim Glass, S. Khudanpur, C.-H. Lee, N. Morgan, and D. O'Shgughnessy, Research Developments and Directions in Speech Recognition and
Understanding, Part 1, in IEEE Signal Processing Magazine, vol. 26, no. 3, pp. 75-80, 2009.
Updated MINDS Report on Speech Recognition and Understanding
• Bengio, Y., 2009. Learning deep architectures for AI. Foundumental Trends Machine Learning, vol. 2.
• Bengio, Y., Courville, A., and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE Trans. PAMI, vol. 38, pp. 1798-1828.
• Bengio, Y., Ducharme, R., and Vincent, P., 2000. A Neural Probabilistic Language Model, in NIPS.
• Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P., 2011. Natural language processing (almost) from scratch. in JMLR, vol. 12.
• Dahl, G., Yu, D., Deng, L., and Acero, 2012. A. Context-dependent, pre-trained deep neural networks for large vocabulary speech recognition, IEEE Trans. Audio,
Speech, & Language Proc., Vol. 20 (1), pp. 30-42.
• Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T., and Harshman, R. 1990. Indexing by latent semantic analysis. J. American Society for Information Science,
41(6): 391-407
• Deng, L. A dynamic, feature-based approach to the interface between phonology & phonetics for speech modeling and recognition, Speech Communication, vol. 24, no.
4, pp. 299-323, 1998.
Computational Models for Speech Production
Switching Dynamic System Models for Speech Articulation and Acoustics
• Deng L., G. Ramsay, and D. Sun, Production models as a structural basis for automatic speech recognition," Speech Communication (special issue on speech production
modeling), in Speech Communication, vol. 22, no. 2, pp. 93-112, August 1997.
• Deng L. and J. Ma, Spontaneous Speech Recognition Using a Statistical Coarticulatory Model for the Vocal Tract Resonance Dynamics, Journal of the Acoustical Society
of America, 2000.
DEEP LEARNING: Methods and Applications
Machine Learning Paradigms for Speech Recognition: An Overview
• Deng L. and Xuedong Huang, Challenges in Adopting Speech Recognition, in Communications of the ACM, vol. 47, no. 1, pp. 11-13, January 2004.
• Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G., Binary Coding of Speech Spectrograms Using a Deep Auto-encoder, Interspeech, 2010.
• Deng, L., Tur, G, He, X, and Hakkani-Tur, D. 2012. Use of kernel deep convex networks and end-to-end learning for spoken language understanding, Proc. IEEE
Workshop on Spoken Language Technologies.
• Deng, L., Yu, D. and Acero, A. 2006. Structured speech modeling, IEEE Trans. on Audio, Speech and Language Processing, vol. 14, no. 5, pp. 1492-1504.
New types of deep neural network learning for speech recognition and related applications: An overview
Use of Differential Cepstra as Acoustic Features in Hidden Trajectory Modeling for Phonetic Recognition
Deep Stacking Networks for Information Retrieval
Deep Convex Network: A Scalable Architecture for Speech Pattern Classification
49. Unsupervised learning using deep
generative model (ACL, 2013)
• Distorted character string Images Text
• Easier than unsupervised Speech Text
• 47% error reduction over Google’s open-source OCR system
49
Motivated me to
think about
unsupervised ASR
and NLP
50. Power: Character-level LM & generative modeling for
unsupervised learning
50
• “Image” data are naturally “generated” by the model quite accurately (like “computer graphics”)
• I had the same idea for unsupervised generative Speech-to-Text in 90’s
• Not successful because 1) Deep generative models were too simple for generating speech waves
2) Inference/learning methods for deep generative models not mature then
3) Computers were too slow
51. 51
Deep Generative Model for Image-Text Deep Generative Model for Speech-Text
L.Deng, A dynamic, feature-based approach to the interface between phonology
& phonetics for speech modeling and recognition, Speech Communication,
vol. 24, no. 4, pp. 299-323, 1998.
(Berg-Kirkpatrick et al., 2013, 2015) (Deng, 1998; Deng et al, 1997, 2000, 2003, 2006)
52. 52
Deep Generative Model for Image-Text Deep Generative Model for Speech-Text
(Berg-Kirkpatrick et al., 2013, 2015) (Deng, 1998; Deng et al, 2000, 2003, 2006)
Word-level
Language model
Plus
Feature-level
Pronunciation model
53. 53
Deep Generative Model for Image-Text Deep Generative Model for Speech-Text
Easy: likely no “explaining away” problem in
inference and learning
Hard: pervasive “explaining away” problem
due to speech dynamics
(Berg-Kirkpatrick et al., 2013, 2015) (Deng, 1998; Deng et al, 2000, 2003, 2006)
Articulatory
dynamics
54. 54
Deep Generative Model for Image-Text Deep Generative Model for Speech-Text
(Berg-Kirkpatrick et al., 2013, 2015) (Deng, 1998; Deng et al, 2000, 2003, 2006)
Articulatory
To
Acoustics
mapping
55. 55
• In contrast, articulatory-to-acoustics
mapping in ASR is much more complex
• During 1997-2000, shallow NNs were used
for this as “universal approximator”
• Not successful
• Now we have better DNN tool
• Even RNN/LSTM tool for dynamic modeling
Very simple, & easy to model accurately
57. Further thoughts on unsupervised ASR
57
• Deep generative modeling experiments not successful in 90’s
• Computers were too slow
• Models were too simple from text to speech waves
• Still true need speech scientists to work harder with technologists
• And when generative models are not good enough, discriminative models
and learning (e.g., RNN) can help a lot; but to do it? A hint next
• Further, can iterate between the two, like wake-sleep (algorithm)
• Inference/learning methods for deep generative models not mature
at that time
• Only partially true today
• due to recent big advances in machine learning
• Based on new ways of thinking about generative graphical modeling
motivated by the availability of deep learning tools (e.g. DNN)
• A brief review next
59. RNN vs. Generative HDM
e.g. Generative pre-training
(analogous to generative DBN pretraining for DNN)
NIPS-2015 paper to appear on simpler dynamic models for a non-ASR application
~DBN
~DNN
Better ways of integrating deep generative/discriminative models are possible
- Hint: Example 1 where generative models are used to define the DNN architecture
60. Interpretable deep learning: Deep generative model
• Constructing interpretable DNNs based on generative topic models
• End-to-end learning by mirror-descent backpropagation
to maximize posterior probability p(y|x)
• y: output (win/loss), and x: input feature vector
Mirror Descent
Algorithm (MDA)
J. Chen, J. He, Y. Shen, L. Xiao, X. He, J. Gao, X. Song, and L. Deng, “End-to-end Learning of Latent Dirichlet Allocation by Mirror-Descent Back
Propagation”, accepted NIPS2015.
61. Recall: (Shallow) Generative Model
war animals computers
“TOPICS”
as hidden layer
…
Iraqi the Matlab
slide revised from: Max Welling
Editor's Notes
Expanding from Eric Xing’s list
Pros and cons of both machineries; Let’s take the best of both worlds, as the main theme of this talk.
For “learning algorithms”: A major focus of open research, many algorithms, & more to come this includes back-prop (hot in graphical models now).
Arden House, NY, 1988; TDNN beating HMM; second year; reverse by BBN
Many very influential NN researchers left NN to pursue graphical modeling (M Jordon, e.g.; also hinton on generative models) amusing piece of history
1994 NN and drop out to pursue deep generative DBN while others pursuing NN; 2009 returning to NN...
With Hinton; on RBM/DBN/DNN
With Dong, remove DBN pre-training with larger data; patent filed.
To show the un-popularity of NN, I show this historical report written, sponsored by the Predecessor of I-ARPA.
After 4 days of close-door meetings, this 20+ page report was written, where in the ML section, nowhere neural nets were mentioned. Rather, graphical models and CRFs are at a higher priority
Two versions of the HDM, one deterministic (backprop for learning), the other statistical (EM and EKF for learning). Both making lots of “reasonable” assumptions.
Nonlinear mappings here are done by a shallow NN in both versions, believing they are universal approximators.
Evaluated on Switchboard task - very minor improvements, to our disappointment. We now understand why and you will during Part II in discussing RNN vs. HDM. but nevertheless this is a step stone to DNN
Note in both Bridle/deng models: local, not distributed, representations of linguistic outputs are used, much like GMM-HMM. (Hinton in 2009 asked me to show state of the art gmm-hmm system. Mlf files. “Crazy idea to use so many GMMs for each label”. Not the case in DNN: weights are inherently shared across all classes to represent them jointly. This is called distributed representations. Now back to deep generative models. Despite the use of NN and back-prop and despite the ability to incorporate problem constraints, the representations are not distributed.
Crazy idea of variational Bayes for learning such intractable deep generative model:
During E step, assume factorization of posterior probabilities --- destroying essential dynamics and hoping M step can recover the errors.
So only need simplification of this deep generative model to solve “remaining” engineering problem
Not as good for inference of top-layer, symbolic sequences (i.e. ASR)
Jelinek gave good advice on why and how to simplify.
In the mean time, there was parallel development of deep generative models in ML/NIPS community. NN is very unpopular; papers automatically rejected.
APSIPA tutorial on DBN for image generation, HM picked it up quickly and helped create one of the first deep-learning based speech synthesizer
This is a simplified model, hoping to speed up decoding
Fred Jelinek idea at JHU (on Nbest 1001)
Alternative Kluge 2 for inference problem
If no Kluge 1 but kluge 2 RNN
Due to frustrations of both industry and academic.
Overcoming limitations of DBN/DNN:
RNN: Reminiscent of RNN with thin frames vs. DNN with thick frames (see Google’s paper on LSTM-RNN using single frames)
Mixed generative/discriminative
No need for DBN pre-training when large training data are available
Several in the audience attended and gave invited talks at workshop (“deep speech recognition” is born; inroad of Deep learning into ASR)
In planning this workshop, GH and I were looking at who to invite.
Remember a very well-known person (nameless now) in all NN, deep generative model, and ASR, declined (something like “give me one week to think whether it is worth my time to travel to Whistler”); one week later, answer is “no”. Other invitees who declined --- “crazy idea of using waveform for ASR, like using photons for image recognition” (now we have a good paper at Interspeech-2014, Hermann Ney’s group, on this “crazy idea”)
Let’s first review the record-breaking performance of the DNN technology
TIMIT--- earlier experiment 2009-2010; 7700 test tokens; remember most of them
10%
20%
30%
With 3 orders of magnitude increase of the training size.
Opposite to other technologies; smaller tasks harder rather than easier (need to regularize more)
This set of results, coming from MSR, really excited the entire speech recog research community and industry.
RNN-LSTM: harder for GPU; glad to see CPU-cluster alternative (requiring new alg.)
Expanding from Eric Xing’s list
Pros and cons of both machineries; Let’s take the best of both worlds, as the main theme of this talk.
For “learning algorithms”: A major focus of open research, many algorithms, & more to come this includes back-prop (hot in graphical models now).
Many subtleties; boxes in image need to align with phrases in the sentence description of the corresponding caption
Parameterizations: HDM --- sparse matrix designed to produce critical damped dynamic behavior with parsimonious set of HYPER-parameters, not learned. We now know how to learn them via BackProp discriminatively.
Mark Liberman emailed:
“… wondered whether there are any real-world problems for which the recurrent neural nets (which can be seen as non-linear autoregressive models) do a better job than (piecewise) linear AR models do. That last question is what led us to your 2007 paper -- it seems clear that the same sort of approach could be used in pitch tracking, which seems like a simpler sort of problem”
----------
From: Mark Liberman [mailto:markyliberman@gmail.com] Sent: Tuesday, January 29, 2013 4:44 PMTo: Li DengCc: Mark Hasegawa-Johnson (jhasegaw@gmail.com); Douglas O'Shaughnessy; Alex Acero
----------
GH: Inference is difficult in directed models of time series if they are non-linear and they use distributed representations.
So people tend to avoid distributed representations and use exponentially weaker methods (HMM’s) that are based on the idea that each visible frame of data has a single hidden cause. During generation from an HMM, each frame comes from one hidden state of the HMM
Parameterizations: HDM --- sparse matrix designed to produce critical damped dynamic behavior with parsimonious set of HYPER-parameters, not learned. We now know how to learn them via BackProp discriminatively.
Mark Liberman emailed:
“… wondered whether there are any real-world problems for which the recurrent neural nets (which can be seen as non-linear autoregressive models) do a better job than (piecewise) linear AR models do. That last question is what led us to your 2007 paper -- it seems clear that the same sort of approach could be used in pitch tracking, which seems like a simpler sort of problem”
----------
From: Mark Liberman [mailto:markyliberman@gmail.com] Sent: Tuesday, January 29, 2013 4:44 PMTo: Li DengCc: Mark Hasegawa-Johnson (jhasegaw@gmail.com); Douglas O'Shaughnessy; Alex Acero
----------
GH: Inference is difficult in directed models of time series if they are non-linear and they use distributed representations.
So people tend to avoid distributed representations and use exponentially weaker methods (HMM’s) that are based on the idea that each visible frame of data has a single hidden cause. During generation from an HMM, each frame comes from one hidden state of the HMM