The document describes improving speech intelligibility through spectral style conversion. It discusses using machine learning algorithms to automatically convert a speaker's habitual speech into a clearer speaking style to increase intelligibility in noise. Specifically, it aims to 1) determine effective spectral features for conversion, 2) develop methods for converting typical and dysarthric speech into clearer styles, and 3) develop methods for converting alaryngeal speech into intelligible speech. It also evaluates two new sets of spectral features - probabilistic peak tracking features and manifold features - for speech reconstruction and style conversion tasks.
Comparative study of Text-to-Speech Synthesis for Indian Languages by using S...ravi sharma
This paper explores syllable approach to building language independent text to speech systems for Indian Languages. The use of common phone set, common question set and borrowing context-independent monophone models along with syllable approach across languages makes the procedure easier and less time-consuming, without compromising the synthesized speech quality. Systems can be built without even knowing the language. This is especially quite beneficial in the Indian scenario.
An Introduction to Various Features of Speech SignalSpeech featuresSivaranjan Goswami
An overview of various temporal, spectral and cepstral features of speech signal used in digital speech processing.
For more tutorials visit:
https://sites.google.com/site/enggprojectece
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
In the present-day communications speech signals get contaminated due to
various sorts of noises that degrade the speech quality and adversely impacts
speech recognition performance. To overcome these issues, a novel approach
for speech enhancement using Modified Wiener filtering is developed and
power spectrum computation is applied for degraded signal to obtain the
noise characteristics from a noisy spectrum. In next phase, MMSE technique
is applied where Gaussian distribution of each signal i.e. original and noisy
signal is analyzed. The Gaussian distribution provides spectrum estimation
and spectral coefficient parameters which can be used for probabilistic model
formulation. Moreover, a-priori-SNR computation is also incorporated for
coefficient updation and noise presence estimation which operates similar to
the conventional VAD. However, conventional VAD scheme is based on the
hard threshold which is not capable to derive satisfactory performance and a
soft-decision based threshold is developed for improving the performance of
speech enhancement. An extensive simulation study is carried out using
MATLAB simulation tool on NOIZEUS speech database and a comparative
study is presented where proposed approach is proved better in comparison
with existing technique.
Comparative study of Text-to-Speech Synthesis for Indian Languages by using S...ravi sharma
This paper explores syllable approach to building language independent text to speech systems for Indian Languages. The use of common phone set, common question set and borrowing context-independent monophone models along with syllable approach across languages makes the procedure easier and less time-consuming, without compromising the synthesized speech quality. Systems can be built without even knowing the language. This is especially quite beneficial in the Indian scenario.
An Introduction to Various Features of Speech SignalSpeech featuresSivaranjan Goswami
An overview of various temporal, spectral and cepstral features of speech signal used in digital speech processing.
For more tutorials visit:
https://sites.google.com/site/enggprojectece
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
In the present-day communications speech signals get contaminated due to
various sorts of noises that degrade the speech quality and adversely impacts
speech recognition performance. To overcome these issues, a novel approach
for speech enhancement using Modified Wiener filtering is developed and
power spectrum computation is applied for degraded signal to obtain the
noise characteristics from a noisy spectrum. In next phase, MMSE technique
is applied where Gaussian distribution of each signal i.e. original and noisy
signal is analyzed. The Gaussian distribution provides spectrum estimation
and spectral coefficient parameters which can be used for probabilistic model
formulation. Moreover, a-priori-SNR computation is also incorporated for
coefficient updation and noise presence estimation which operates similar to
the conventional VAD. However, conventional VAD scheme is based on the
hard threshold which is not capable to derive satisfactory performance and a
soft-decision based threshold is developed for improving the performance of
speech enhancement. An extensive simulation study is carried out using
MATLAB simulation tool on NOIZEUS speech database and a comparative
study is presented where proposed approach is proved better in comparison
with existing technique.
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
LPC Models and Different Speech Enhancement Techniques- A Reviewijiert bestjournal
Author has already published one review paper on the quality enhancement of a speech signal by minimizing the noise. This is a second paper of same series. In last two decades the researchers have taken continuous efforts to reduce the noise signal from the speech signal. Th is paper comments on,various study carried out and analysis propos als of the researchers for en hancement of the quality of speech signal. Various models,coding,speech quality improvement methods,speaker dependent codebooks,autocorrelation subtraction,speech restoration,producing speech at low bit rates,compression and enhancement are the vari ous aspects of speech enhancement. We have presented the review of all above mentioned technologies in this paper and also willing to examine few of the techniques in order to analyze the factors affecting them in upcoming paper of the series.
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
Direct Punjabi to English Speech Translation using Discrete UnitsIJCI JOURNAL
Speech-to-speech translation is yet to reach the same level of coverage as text-to-text translation systems. The current speech technology is highly limited in its coverage of over 7000 languages spoken worldwide, leaving more than half of the population deprived of such technology and shared experiences. With voice-assisted technology (such as social robots and speech-to-text apps) and auditory content (such as podcasts and lectures) on the rise, ensuring that the technology is available for all is more important than ever. Speech translation can play a vital role in mitigating technological disparity and creating a more inclusive society. With a motive to contribute towards speech translation research for low-resource languages, our work presents a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English. Additionally, we explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model. The model, abbreviated as Unit-to-Unit Translation (U2UT), takes a sequence of discrete units of the source language (the language being translated from) and outputs a sequence of discrete units of the target language (the language being translated to). Our results show that the U2UT model performs better than the Speechto-Unit Translation (S2UT) model by a 3.69 BLEU score.
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Kotaro Hara
Our talk at CHI2015 in Seoul, South Korea. Find more information at www.kotarohara.com .
YouTube: https://youtu.be/isqsYLkX9gA
Makeability Lab: http://www.cs.umd.edu/~jonf/
Microsoft Research: http://research.microsoft.com/
ABSTRACT
Language barrier is the primary challenge for effective cross-lingual conversations. Spoken language translation (SLT) is perceived as a cost-effective alternative to less affordable human interpreters, but little research has been done on how people interact with such technology. Using a prototype translator application, we performed a formative evaluation to elicit how people interact with the technology and adapt their conversation style. We conducted two sets of studies with a total of 23 pairs (46 participants). Participants worked on storytelling tasks to simulate natural conversations with 3 different interface settings. Our findings show that collocutors naturally adapt their style of speech production and comprehension to compensate for inadequacies in SLT. We conclude the paper with the design guidelines that emerged from the analysis.
ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...sipij
The transcription accuracy of automatic speech recognition (ASR) system may suffer when recognizing
accented speech. The resulting bias in ASR system towards a specific accent due to under representation of
that accent in the training dataset. Accent recognition of existing speech samples can help with the
preparation of the training datasets, which is an important step toward closing the accent gap and
eliminating biases in ASR system. For that we built a system to recognize accent from spoken speech data.
In this study, we have explored some prosodic and vocal speech features as well as speaker embeddings for
accent recognition on our custom English speech data that covers speakers from around the world with
varying accents. We demonstrate that our selected speech features are more effective in recognizing nonnative accents. Additionally, we experimented with a hierarchical classification model for multi-level
accent classification. To establish an accent hierarchy, we employed a bottom-up approach, combining
regional accents and categorizing them as either native or non-native at the top level. Furthermore, we
conducted a comparative study between flat classification and hierarchical classification using the accent
hierarchy structure.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters. Although conventional VC can be built from non-parallel data, it is difficult to convert speaker individuality such as phonetic property and speaking rate contained in the posterior probabilities because the source posterior probabilities are directly used for predicting target speech parameters. In this work, we assume that the training data partly include parallel speech data and propose sequence-to-sequence learning between the source and target posterior probabilities. The conversion models perform non-linear and variable-length transformation from the source probability sequence to the target one. Further, we propose a joint training algorithm for the modules. In contrast to conventional VC, which separately trains the speech recognition that estimates posterior probabilities and the speech synthesis that predicts target speech parameters, our proposed method jointly trains these modules along with the proposed probability conversion modules. Experimental results demonstrate that our approach outperforms the conventional VC.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
This paper proposes a voice morphing system for people suffering from Laryngectomy, which is the surgical removal of all or part of the larynx or the voice box, particularly performed in cases of laryngeal cancer. A primitive method of achieving voice morphing is by extracting the source's vocal coefficients and then converting them into the target speaker's vocal parameters. In this paper, we deploy Gaussian Mixture Models (GMM) for mapping the coefficients from source to destination. However, the use of the traditional/conventional GMM-based mapping approach results in the problem of over-smoothening of the converted voice. Thus, we hereby propose a unique method to perform efficient voice morphing and conversion based on GMM, which overcomes the traditional-method effects of over-smoothening. It uses a technique of glottal waveform separation and prediction of excitations and hence the result shows that not only over-smoothening is eliminated but also the transformed vocal tract parameters match with the target. Moreover, the synthesized speech thus obtained is found to be of a sufficiently high quality. Thus, voice morphing based on a unique GMM approach has been proposed and also critically evaluated based on various subjective and objective evaluation parameters. Further, an application of voice morphing for Laryngectomees which deploys this unique approach has been recommended by this paper
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
LPC Models and Different Speech Enhancement Techniques- A Reviewijiert bestjournal
Author has already published one review paper on the quality enhancement of a speech signal by minimizing the noise. This is a second paper of same series. In last two decades the researchers have taken continuous efforts to reduce the noise signal from the speech signal. Th is paper comments on,various study carried out and analysis propos als of the researchers for en hancement of the quality of speech signal. Various models,coding,speech quality improvement methods,speaker dependent codebooks,autocorrelation subtraction,speech restoration,producing speech at low bit rates,compression and enhancement are the vari ous aspects of speech enhancement. We have presented the review of all above mentioned technologies in this paper and also willing to examine few of the techniques in order to analyze the factors affecting them in upcoming paper of the series.
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
Direct Punjabi to English Speech Translation using Discrete UnitsIJCI JOURNAL
Speech-to-speech translation is yet to reach the same level of coverage as text-to-text translation systems. The current speech technology is highly limited in its coverage of over 7000 languages spoken worldwide, leaving more than half of the population deprived of such technology and shared experiences. With voice-assisted technology (such as social robots and speech-to-text apps) and auditory content (such as podcasts and lectures) on the rise, ensuring that the technology is available for all is more important than ever. Speech translation can play a vital role in mitigating technological disparity and creating a more inclusive society. With a motive to contribute towards speech translation research for low-resource languages, our work presents a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English. Additionally, we explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model. The model, abbreviated as Unit-to-Unit Translation (U2UT), takes a sequence of discrete units of the source language (the language being translated from) and outputs a sequence of discrete units of the target language (the language being translated to). Our results show that the U2UT model performs better than the Speechto-Unit Translation (S2UT) model by a 3.69 BLEU score.
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Kotaro Hara
Our talk at CHI2015 in Seoul, South Korea. Find more information at www.kotarohara.com .
YouTube: https://youtu.be/isqsYLkX9gA
Makeability Lab: http://www.cs.umd.edu/~jonf/
Microsoft Research: http://research.microsoft.com/
ABSTRACT
Language barrier is the primary challenge for effective cross-lingual conversations. Spoken language translation (SLT) is perceived as a cost-effective alternative to less affordable human interpreters, but little research has been done on how people interact with such technology. Using a prototype translator application, we performed a formative evaluation to elicit how people interact with the technology and adapt their conversation style. We conducted two sets of studies with a total of 23 pairs (46 participants). Participants worked on storytelling tasks to simulate natural conversations with 3 different interface settings. Our findings show that collocutors naturally adapt their style of speech production and comprehension to compensate for inadequacies in SLT. We conclude the paper with the design guidelines that emerged from the analysis.
ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...sipij
The transcription accuracy of automatic speech recognition (ASR) system may suffer when recognizing
accented speech. The resulting bias in ASR system towards a specific accent due to under representation of
that accent in the training dataset. Accent recognition of existing speech samples can help with the
preparation of the training datasets, which is an important step toward closing the accent gap and
eliminating biases in ASR system. For that we built a system to recognize accent from spoken speech data.
In this study, we have explored some prosodic and vocal speech features as well as speaker embeddings for
accent recognition on our custom English speech data that covers speakers from around the world with
varying accents. We demonstrate that our selected speech features are more effective in recognizing nonnative accents. Additionally, we experimented with a hierarchical classification model for multi-level
accent classification. To establish an accent hierarchy, we employed a bottom-up approach, combining
regional accents and categorizing them as either native or non-native at the top level. Furthermore, we
conducted a comparative study between flat classification and hierarchical classification using the accent
hierarchy structure.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters. Although conventional VC can be built from non-parallel data, it is difficult to convert speaker individuality such as phonetic property and speaking rate contained in the posterior probabilities because the source posterior probabilities are directly used for predicting target speech parameters. In this work, we assume that the training data partly include parallel speech data and propose sequence-to-sequence learning between the source and target posterior probabilities. The conversion models perform non-linear and variable-length transformation from the source probability sequence to the target one. Further, we propose a joint training algorithm for the modules. In contrast to conventional VC, which separately trains the speech recognition that estimates posterior probabilities and the speech synthesis that predicts target speech parameters, our proposed method jointly trains these modules along with the proposed probability conversion modules. Experimental results demonstrate that our approach outperforms the conventional VC.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
This paper proposes a voice morphing system for people suffering from Laryngectomy, which is the surgical removal of all or part of the larynx or the voice box, particularly performed in cases of laryngeal cancer. A primitive method of achieving voice morphing is by extracting the source's vocal coefficients and then converting them into the target speaker's vocal parameters. In this paper, we deploy Gaussian Mixture Models (GMM) for mapping the coefficients from source to destination. However, the use of the traditional/conventional GMM-based mapping approach results in the problem of over-smoothening of the converted voice. Thus, we hereby propose a unique method to perform efficient voice morphing and conversion based on GMM, which overcomes the traditional-method effects of over-smoothening. It uses a technique of glottal waveform separation and prediction of excitations and hence the result shows that not only over-smoothening is eliminated but also the transformed vocal tract parameters match with the target. Moreover, the synthesized speech thus obtained is found to be of a sufficiently high quality. Thus, voice morphing based on a unique GMM approach has been proposed and also critically evaluated based on various subjective and objective evaluation parameters. Further, an application of voice morphing for Laryngectomees which deploys this unique approach has been recommended by this paper
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilitySciAstra
The Indian Statistical Institute (ISI) has extended its application deadline for 2024 admissions to April 2. Known for its excellence in statistics and related fields, ISI offers a range of programs from Bachelor's to Junior Research Fellowships. The admission test is scheduled for May 12, 2024. Eligibility varies by program, generally requiring a background in Mathematics and English for undergraduate courses and specific degrees for postgraduate and research positions. Application fees are ₹1500 for male general category applicants and ₹1000 for females. Applications are open to Indian and OCI candidates.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
2. Table of Contents
1 Introduction
Motivation
Approach
Thesis Problem and Statement
Specific Aims
2 Background
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion
3. 3/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Motivation
Approach
Thesis Problem and Statement
Specific Aims
Unintelligible Speech
Speech is important for human communication
Typical way of speaking is referred as habitual speech
Habitual speech becomes less intelligible in noise
Habitual speech is also hard to understand for people with
hearing impairments and non-native speakers
Tuan Dinh Improving Speech Intelligibility
4. Unintelligible Speech
Figure: Synthetic speech of speaking devices is degraded by noise
Figure: Atypical speech is hard to understand, especially in noise
4/67
5. 5/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Motivation
Approach
Thesis Problem and Statement
Specific Aims
Listener Side Solution
Use noise suppression and cancellation methods
Require noise-cancellation devices, which take as input a noisy
speech signal and output an enhanced signal with higher
intelligibility and quality
There are many cases where listeners don’t have
noise-cancellation devices
transit announcements
Tuan Dinh Improving Speech Intelligibility
6. Lessons from Real Speakers: Habitual vs Clear
Speakers adjust their voice to make it more intelligible
Adopt special clear speaking style to make habitual speech
more resilient to noisy environments and listener deficits
Researchers showed that:
Clear speech features extended phoneme duration, longer and
more frequent pauses [Picheny86, Bradlow03, Krause04]
Clear speech is more intelligible than habitual speech [Picheny85,
Krause02]
Spectral and duration factors are probably significant to the
improved intelligibility of clear speech [Kain08, Tjaden14]
6/67
7. Speaker Side Solution
Convert habitual speech directly from speakers into clear
speech prior to its distortion due to background noise
Figure: Make habitual speech (generated by speech synthesizer) more resilient to noise
Figure: Make atypical speech (spoken by people with dysarthria) more resilient to noise
7/67
8. Previous Work on Speaker Side Solution
Applied filters to habitual speech to create spectral
characteristics of clear speech [Koutsogannaki14]
improved intelligibility for typical speakers
had a trade-off between intelligibility and naturalness
did not model the conversion from habitual to clear speech
Utilized HAB-to-CLR spectral style conversion on vowels using
a Gaussian Mixture Model [Mohammadi12]
Converted dysarthric speech into typical speech using a
Gaussian Mixture Model [Kain07]
Converted alaryngeal speech into typical speech using deep
neural networks [Kazuhiro18, Othmane19]
These machine learning-based methods (e.g., deep neural
networks) showed the most promising results; but there is still
room for improvement
8/67
9. 9/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Motivation
Approach
Thesis Problem and Statement
Specific Aims
Thesis Problem and Statement
Problem
Modifying the habitual speech of typical and atypical speakers on
the speaker side to increase intelligibility in noise is a challenging
problem
Statement
Speech intelligibility of typical and atypical speakers can be
improved automatically by learning how they map their voice and
make it more intelligible
Tuan Dinh Improving Speech Intelligibility
10. 10/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Motivation
Approach
Thesis Problem and Statement
Specific Aims
Specific Aims
1 Determine effective spectral features for spectral voice and
style conversion for typical and dysarthric speakers
2 Develop effective HAB-to-CLR spectral mappings using
machine learning algorithms for typical and dysarthric speakers
3 Develop effective methods for converting alaryngeal speech
into intelligible speech, using machine learning algorithms
4 Investigate the performance of duration style conversion on
speech intelligibility (Only in dissertation)
Tuan Dinh Improving Speech Intelligibility
11. Table of Contents
1 Introduction
2 Background
Acoustic Features and Speech Intelligibility: Hybridization
Voice and Style Conversion
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion
11/67
12. Acoustic Features and Speech Intelligibility: Hybridization
Determine the acoustic causes of improved intelligibility in
clear speech
1 Insert clear components (e.g., clear spectrum) into habitual
speech to create hybrid speech
2 Find acoustic components that make hybrid speech more
intelligible than habitual speech 12/67
13. Hybridization Findings
For typical speakers, inserting clear spectrum and duration
obtained 24% improvement in sentence transcription accuracy
[Kain08]
For dysarthric speakers, Tjaden found that
Inserting clear energy obtained 8.7% improvement
Inserting clear spectrum obtained 18% improvement
Inserting clear spectrum and duration obtained 13.4%
improvement in scaled intelligibility test [Tjaden14]
13/67
14. 14/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Acoustic Features and Speech Intelligibility: Hybridization
Voice and Style Conversion
Voice Conversion
Voice Conversion (VC) is a process of transforming a source
speaker’s speech so it sounds like a target speaker’s speech
Figure: Voice Conversion framework
During Training Phase,
prepare parallel utterances,
which contain pairs of
utterances from source and
target speakers with the
same words
Tuan Dinh Improving Speech Intelligibility
15. Voice Conversion: Training Phase
Figure: Voice Conversion framework
1 Speech Analysis:
1 extract speech features
using Vocoder
2 analyze speech features
into mapping features
(Aim 1)
2 Time Alignment: align
mapping features between
source and target speakers
3 Train mapping function:
produces a mapping
function from aligned
mapping features (Aim 2)
15/67
16. Voice Conversion: Conversion Phase
Figure: Voice Conversion framework
1 Speech Analysis: analyze
mapping features of input
utterance from source
speaker
2 Map the features: apply
mapping function
3 Speech Synthesis: synthesize
speech signal using Vocoder
16/67
17. Style Conversion
Learn how to map one speaking style to another, such as
habitual to clear, of the same speaker
Use VC mapping techniques in this task
Gaussian mixture models were used to map habitual to clear
vowels, resulting in modest results [Mohammadi12]
This mappings are probably limited by:
inappropriate mapping features (Aim 1)
over-smoothing problem of the mapping techniques (Aim 2)
[Toda05]
17/67
18. Table of Contents
1 Introduction
2 Background
3 Spectral Features for Style Conversion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion
19. 19/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Spectral Features for Style Conversion
Determine effective spectral representations for spectral style
conversion
Contrast two new sets of features:
1 Probabilistic peak tracking (PPT) features
2 Manifold features
Evaluate the two sets in
speech reconstruction
style conversion
Dissertation also has voice conversion evaluation
Tuan Dinh Improving Speech Intelligibility
20. 20/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Probabilistic Peak Tracking Features
Represent spectrum by a set of frequencies of nine peaks in a
magnitude (energy) spectrum and corresponding peak
bandwidths
Similar spectrums have similar peak frequencies
Assume that peak frequencies change slowly and continuously
over time
Sometimes causes the peak frequency contours not to pass
through spectral peaks
Peak bandwidths are used to represent the presence or
absence of magnitude peaks:
wide bandwidth represents the absence of a peak
narrower bandwidth represents the presence
Tuan Dinh Improving Speech Intelligibility
21. Probabilistic Peak Tracking
Constrain 4 peak frequencies to be the first 4 formant
frequencies (F1–4) that are important for speech intelligibility
Track 4 peak frequencies at high frequency area
Have initial values of 5000, 6000, 7000, and 8000 Hz
Also calculate the glottal formant frequencies that are
correlated to F0
Finally, calculate corresponding peak bandwidths in an
iterative process to best reconstruct the original spectrum
from computed peak frequencies and peak bandwidths
21/67
22. 22/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Manifold Features
The features are purely machine-learned
The representation is realized through projection of
high-dimensional acoustic features onto a lower-dimensional
manifold
Learn the manifold from a large multi-speaker database of
speech data using Variational Autoencoder
Tuan Dinh Improving Speech Intelligibility
24. 24/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Using PPT and manifold features for reconstruction
Figure: Speech reconstruction with PPT features
Figure: Speech reconstruction with manifold features
Tuan Dinh Improving Speech Intelligibility
25. 25/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Experiment: Reconstruction Quality
Evaluate the speech reconstruction quality of PPT-20,
manifold features (VAE-12) in comparison to 3 baselines:
20th
-order Line Spectral Frequency (LSF-20)
12th
-order Mel-cepstrum coefficient (MCEP-12)
Natural speech
Select data from 4 random speakers (2 male, 2 female) in
Voice Conversion Challenge (VCC) dataset
Conduct comparative mean opinion score (CMOS)
Participants listen to sentences A and B, and specify whether
A is more natural than B
Answer in 5-point scale: “definitely better” (+2), “better”
(+1), “same” (0), “worse” (−1), and “definitely worse” (−2)
Tuan Dinh Improving Speech Intelligibility
26. CMOS Results
A
B
LSF-20 MCEP-12 VAE-12 PPT-20
NAT +0.77* +1.34* +1.02* +1.28*
LSF-20 +1.08* -0.04 +0.26*
MCEP-12 -0.44* -0.31*
VAE-12 +0.45*
Table: Relative quality between original and vocoded stimuli. Positive values show A is
better than B. Results marked with an asterisk are significantly different.
26/67
27. CMOS Results
Show ordering of the systems by projecting above table to a
single dimension using Multiple Dimensional Scaling (MDS)
Use all pairs of data to come up with this
Natural speech (NAT) is better than all synthetic systems
There is still a lot of rooms for improving synthetic speech
VAE-12 is significantly better than MCEP-12
VAE-12 is significantly better than PPT-20 and more compact
Although LSF-20 is better than VAE-12 here, VAE-12 is
better for voice conversion (in dissertation)
27/67
28. 28/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Experiment: Style Conversion
Evaluate the efficacy of manifold feature for mapping habitual
style to clear style to improve intelligibility
We only look at manifold features
Database of 78 speakers: 32 typical speakers (CS), 30 with
multiple sclerosis (MS), and 16 with Parkinson’s disease (PD)
Each read 25 Harvard sentences in habitual and clear style
Establish which speakers benefit from inserting clear spectrum
into habitual speech via Hybridization
Evaluate the intelligibility of hybrid speech (habitual speech
plus clear spectrum) using keyword recall test
66 participants listen and type 25 Harvard sentences
Hybrid speech improved intelligibility of habitual speech for 3
speakers PDF7, PDM6, CSM7
Tuan Dinh Improving Speech Intelligibility
30. VAE with Style conversion mapping
Examine two different DNN architectures
1 Feedforward network (called DNN-mapping VAE)
2 Feedforward network with skip connections (called
skip-mapping VAE)
Output is habitual speech plus modified spectrum
30/67
31. Feedforward network with skip connection
Current
HAB
VAE
12
Left
Context
60
Right
Context
60
Concat
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
12
Add
Current
CLR
VAE
12
The use of skip-connections is motivated by the fact that the
spectral difference in style conversion can be small
31/67
32. Speech Intelligibility Evaluation
CSM7 PDF7 PDM6
Reconstructed HAB 38 13 24
DNN-mapping VAE 32 13 35
Skip-mapping VAE 38 11 46*
CLR spectrum-hybrid 56* 27* 50*
Reconstructed CLR 69* 23* 41*
Table: Average keyword accuracy. Results marked with an asterisk are significantly
different
CLR spectrum-hybrid is HAB speech plus CLR spectrum
It is the gold standard of spectrum mapping
Conduct keyword recall test, 30 participants
Skip-mapping VAE increased intelligibility of HAB speech
from 24% to 46% for PDM6 (a male with Parkison’s disease)
Show potential of manifold features. But DNN-mapping
might be too simplistic
32/67
33. Table of Contents
1 Introduction
2 Background
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion
33/67
34. 34/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
Spectral Mapping for Style Conversion of Typical and
Dysarthric Speech
Improve HAB-to-CLR spectral mapping for style conversion
Utilize conditional Generative Adversarial Nets (cGANs) to
map the spectral features of habitual speech to those of clear
speech
Investigate the cGANs in three spectral style conversion
mappings:
1 one-to-one mappings
2 many-to-one mappings
3 many-to-many mappings (only in dissertation)
Tuan Dinh Improving Speech Intelligibility
35. 35/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
Generative Adversarial Nets
GAN has a Generator (G) and a Discriminator (D) [Goodfellow14]
G generates images and D decides if they are generated or real
As either gets better so does the other
D is only used during training
Applications: Data Augmentation, face aging, super resolution
Tuan Dinh Improving Speech Intelligibility
36. cGANs for Style Conversion
HAB VAE-12
Left Context
Right Context
G Generated
CLR VAE-12
Real
CLR VAE-12
D Real or Generated?
cGAN is a GAN conditioned on auxiliary data
G takes as input HAB spectrum and generates CLR spectrum
D discriminates between generated and real CLR spectrum
Real CLR and HAB spectrum from same sentence and speaker
Real CLR spectrum is time-warped to HAB spectrum
D is conditioned on HAB spectrum to learn if a generated
CLR spectrum is a good transformation from a HAB spectrum
By including the D, we learn better loss function for G
36/67
38. 38/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
One-to-One Mapping
The goal is to improve performance of style conversion from
previous section
Train a cGAN for each speaker for mapping HAB to CLR
spectrum
In conversion, apply speaker-specific mapping to same speaker
The output is habitual speech plus modified spectrum
Tuan Dinh Improving Speech Intelligibility
39. Objective Evaluation: Log Spectral Distortion (dB)
Log Spectral Distortion is the root mean square difference
between converted spectrum and target CLR spectrum
mapping
speaker
PDF7 PDM6 CSM7
DNN (previous section) 16.80 16.67 16.44
GAN 12.85 12.58 12.67
GAN has lower log spectral distortion than DNN
39/67
41. Subjective Evaluation
Log spectral distortion is rough predictor for human perception
Conduct keyword recall test with 60 participants, listening and
typing 25 Harvard sentences (same as previous experiments)
vocoded HAB DNN GAN hybrid vocoded CLR
0
20
40
60
80
100
Average
keyword
accuracy
CSM7
PDF7
PDM6
cGAN outperforms DNN
cGAN significantly increases intelligibility for two speakers
(one typical and one with Parkinson)
41/67
42. 42/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
Many-to-One Mappings
Disadvantage of one-to-one mappings as it requires
speaker-specific training data
Difficult to apply it to new speakers in real life applications
Tuan Dinh Improving Speech Intelligibility
43. Method
Pick two target speakers with best sentence-level intelligibility
one male and one female
both happens to be typical speakers
Map habitual speech of multiple speakers to target
Use all speakers except PDM6, PDF7 and CSM7 for testing
and two target speakers
Use 29 typical speakers, 30 with MS and 14 with Parkinson
In conversion, apply the mapping on unseen speakers
43/67
44. Subjective Evaluation
Conduct keyword recall test with 44 participants
vocoded HAB GAN hybrid vocoded CLR
0
20
40
60
80
100
Average
keyword
accuracy
CSM7
PDF7
PDM6
Figure: Keyword recall accuracy of three speakers. The dashed lines show statistically
significant differences.
many-to-one increases intelligibility for one speaker (person
with Parkison)
promising but not as good as one-to-one
44/67
45. Table of Contents
1 Introduction
2 Background
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
6 Conclusion
46. 46/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Alaryngeal Speech
People who undergo total laryngectomy lose their ability to
produce speech sounds normally
Their speech options: esophageal speech, tracheo-esophageal
puncture (TEP), and electrolarynx (ELX) are difficult to
understand due to:
poor voice quality
no voiced/unvoiced differentiation
lack of articulatory precision
no F0
Alaryngeal speech is more distorted than mild Parkinson
No clear speech for LAR speakers
Tuan Dinh Improving Speech Intelligibility
47. Flowchart of proposed method
LAR
MCEP
MCEP
model
AP
model
VUV
model
spectra
to
MCEP
LAR
spectra
WORLD
vocoder
LAR
speech
INT
MCEP
INT
AP
INT
VUV
MCEP
to
spectra
INT
spectra
pitch
accent
curve
synthesis
INT
F0
LAR
energy
WORLD
vocoder
INT
speech
Propose an approach for transforming alaryngeal speech
(LAR) to intelligible speech (INT):
1 Predict INT binary voicing/unvoicing and degree of voicing
(aperiodicity) from LAR spectrum using DNNs (VUV model
and AP model)
2 Predict INT spectrum from LAR spectrum using cGANs
(MCEP model)
3 Create synthetic F0 from a simple intonation model (Pitch
accent curve synthesis)
47/67
48. 48/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Data
For source LAR speech, database of 4 male speakers: 3
LAR-TEP speakers (L001, L002, L006) and 1 LAR-ELX
speaker (L004)
For target INT speech, ideal option is natural voice, such as
habitual speech or clear speech. I use a synthetic male voice
due to:
expediency
capability of creating a lot of data and arbitrary voices.
Each speaker has 132 sentences (LAR and INT speakers)
Use random split of 100/16/16 sentences for training,
validation, and testing
Tuan Dinh Improving Speech Intelligibility
49. Pre-training Data
Due to limited amount of LAR training data, we use
pre-training to leverage the general knowledge of speech
Use multi-speaker TIMIT database for pre-training.
Can we make a pre-training set that better matches LAR
speech?
Simulate LAR-TEP speech by creating a fully unvoiced version
of TIMIT (FU-TIMIT)
Simulate LAR-ELX speech by creating a fully voiced version of
TIMIT (FV-TIMIT)
Use standard TIMIT split of 462/144/24 speakers for training,
validation, and testing
49/67
50. 50/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Predicting Voicing and Degree of Voicing
Propose a method for predicting when speech should be
voiced and the degree of voicing from a spectrogram
predict a binary voicing value (VUV) and continuous 2-band
aperiodicity (AP) values from mel-cepstral coefficients
(MCEP), using deep neural networks (DNN).
Pre-train three kinds of speaker-independent DNNs using
either TIMIT, FU-TIMIT, or FV-TIMIT as training data
For each utterance in training data, use VUV and AP from
corresponding utterances in TIMIT as target
Tuan Dinh Improving Speech Intelligibility
51. Evaluating Pre-trained models on their Test Data
For testing, apply three pre-trained models (TIMIT,
FU-TIMIT, and FV-TIMIT) on corresponding test data
Use balanced accuracy (BAC,defined as average recall) for
VUV classification (since the classes were imbalanced), and r2
for AP regression
Mapping
Pre-training set
TIMIT FU-TIMIT FV-TIMIT
TIMIT → TIMIT 0.99 (0.87)
FU-TIMIT → TIMIT 0.89 (0.72)
FV-TIMIT → TIMIT 0.93 (0.84)
Table: BAC and r2 in brackets, higher is better, closer to 1 is better
As expected, TIMIT model works best because training data
contains voicing that we want to predict
FU-TIMIT and FV-TIMIT also work well
It’s possible to predict voicing from spectral shape along
51/67
52. 52/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Evaluating Pre-trained models on LAR data
Test pre-trained models, without adaptation, to predict target
INT VUV or AP from LAR-TEP and LAR-ELX
Mapping
Pre-training set
TIMIT FU-TIMIT FV-TIMIT
L001 (TEP) → INT 0.64 (−0.51) 0.60 (−0.17) 0.58 (−0.58)
L002 (TEP) → INT 0.56 (−0.70) 0.67 (0.02) 0.55 (−0.70)
L004 (ELX) → INT 0.63 (–0.44) 0.49 (−1.00) 0.48 (−0.28)
L006 (TEP) → INT 0.53 (−0.84) 0.48 (−0.50) 0.55 (−0.84)
Table: BAC and r2 in brackets
Our expectation was that matching pre-training models and
source speaker works best (black numbers)
Although the results do not match our expectation entirely,
we still need to adapt our models with LAR speech
Tuan Dinh Improving Speech Intelligibility
53. Adapting Pre-trained models on LAR data
Adapt the pre-trained models with LAR-TEP and LAR-ELX
speech
Use speaker specific adaptation due to the limited number of
speakers (similar to one-to-one mapping)
Adapt all weights in DNN models
53/67
54. Evaluating Adapted models
Mapping
Pre-training set
TIMIT FU-TIMIT FV-TIMIT
Before adaptation
L001 (TEP) → INT 0.64 (−0.51) 0.60 (−0.17) 0.58 (−0.58)
L002 (TEP) → INT 0.56 (−0.70) 0.67 (0.02) 0.55 (−0.70)
L004 (ELX) → INT 0.63 (–0.44) 0.49 (−1.00) 0.48 (−0.28)
L006 (TEP) → INT 0.53 (−0.84) 0.48 (−0.50) 0.55 (−0.84)
After adaptation
L001 (TEP) → INT 0.70 (0.22) 0.67 (0.21) 0.72 (0.23)
L002 (TEP) → INT 0.73 (0.43) 0.75 (0.43) 0.73 (0.43)
L004 (ELX) → INT 0.72 (0.29) 0.71 (0.27) 0.70 ( 0.29)
L006 (TEP) → INT 0.65 (0.04) 0.67 (0.05) 0.64 (0.05)
Table: BAC and r2 in brackets, higher is better
Adaptation always increases performance
Pre-training with FU and FV-TIMIT as opposed to TIMIT did
not work as expected
54/67
55. 55/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
cGANs for Predicting Spectrum
LAR MCEP
Left Context
Right Context
G Generated
INT MCEP
Real
INT MCEP
D Real or Generated?
We use the same structure of cGANs to generate INT
spectrum from LAR spectrum
Tuan Dinh Improving Speech Intelligibility
57. Evaluating Pre-trained models
Pre-train models due to limited amount of LAR data
mapping
pre-trained set
Before FU-TIMIT FV-TIMIT
FU-TIMIT → TIMIT 11.3 7.64
FV-TIMIT → TIMIT 11.0 6.46
L001 (TEP) → INT 60.6 60.0 61.9
L002 (TEP) → INT 46.0 45.0 46.5
L004 (ELX) → INT 51.5 51.1 52.8
L006 (TEP) → INT 61.2 61.6 63.0
Predict TIMIT spectrum from FU- and FV-TIMIT spectrum
Results in 7.64 dB for FU-TIMIT and 6.46 dB for FV-TIMIT
Reduces log spectral distortion from 11.33 and 11, respectively
Apply pre-trained models to predict INT spectrum from LAR
spectrum
No noticeable reduction of distortion
Lack of improvement is disappointing but not unexpected as
FU-TIMIT and FV-TIMIT do not know about LAR speech
57/67
58. Adapting Pre-trained models on LAR speech
Adapt pre-trained models on LAR speech
mapping
pre-trained set
FU-TIMIT FV-TIMIT
L001 (TEP) → INT 32 (60) 32 (61.9)
L002 (TEP) → INT 33 (45) 33 (46.5)
L004 (ELX) → INT 31.5 (51.1) 32 (52.8)
L006 (TEP) → INT 37.8 (61.6) 37 (63)
Table: Log spectral distortion before and after adaptation in brackets
As expected, the adaptation always improved performance
Pre-training with FU-TIMIT versus FV-TIMIT does not have
noticeable effect on adaptation
58/67
59. 59/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Synthesizing Pitch
LAR F0 is not present in LAR speech
Use a phrase curve and a single accent curve to model
intonation for each utterance
Phrase curve is logarithmic failing curve from 140 to 60 Hz
Accent curve is linearly-proportional to LAR energy
0 200 400 600 800
60
80
100
120
140
160
Hz
Frames
Tuan Dinh Improving Speech Intelligibility
60. 60/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Overall Results
Conduct perceptual naturalness and intelligibility CMOS
Each listened to a pair of sentences A & B, consisting of
modified speech against the LAR speech
answered ”if A is more natural/intelligible than B?” in a 5
point scale: ”definitely worse” (−2), ”worse” (−1), ”same”
(0), ”better” (+)1, ”definitely better” (+2)
There were 48 participants in each CMOS
For LAR speech, we analyzed and re-synthesized it (using WORLD) to make a fair comparison
Tuan Dinh Improving Speech Intelligibility
61. Intelligibility
INT-spectrum: LAR speech plus predicted spectrum
INT-intonation: LAR speech plus predicted voicing, F0
INT-all: LAR speech plus predicted spectrum, voicing, and F0
Speakers
Systems
INT-spectrum INT-intonation INT-all
L001 (TEP) −0.1 −0.1 0.1
L002 (TEP) 0.1 0.2 −0.3*
L004 (ELX) −0.34* 0.34* −0.2
L006 (TEP) 0.2 −0.1 −0.0
INT-intonation significantly increased intelligibility for L004
INT-all did not increase intelligibility
We did not observe an increasing in overall intelligibility
61/67
62. Naturalness
Speakers
Systems
INT-spectrum INT-intonation INT-all
L001 (TEP) −0.0 −0.3* 0.4*
L002 (TEP) −0.1 −0.0 0.1
L004 (ELX) −0.56* −0.25 0.22
L006 (TEP) −0.3* −0.2* 0.7*
INT-all increased naturalness for all 4 speakers
but only significant for L001 and L006
But, when testing the individual components (e.g., spectrum),
there is no improvement
62/67
63. Table of Contents
1 Introduction
2 Background
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion
63/67
64. 64/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Conclusion
Aim 1: Determine effective spectral features for style
conversion
Proposed two sets of features: PPT and manifold features
(VAE-12)
VAE-12 is better than MCEP-12 and PPT in speech
reconstruction
VAE-12 in combination with DNNs significantly increases
intelligibility for one with Parkinson from 24% to 46%
Tuan Dinh Improving Speech Intelligibility
65. Conclusion
Aim 2: Develop effective HAB-to-CLR style mapping
Proposed a spectral style mapping using cGANs for improving
speech intelligibility
For one-to-one mapping, cGANs outperforms DNN, and
significantly increases the intelligibility for 2 speakers (a typical
speaker and one with Parkinson)
For many-to-one mapping, cGANs significantly increases the
intelligibility for a speaker with Parkinson
65/67
66. Conclusion
Aim 3: Develop effective methods for LAR-to-INT conversion
Proposed a method to predict binary voicing/unvoicing and
degree of voicing (aperiodicity) from LAR MCEP using DNNs
Proposed a method to predict INT spectrum from LAR
spectrum using cGANs
Proposed a method to create a synthetic fundamental
frequency trajectory from a simple intonation model
INT-intonation significantly increases intelligibility for 1
speaker
INT-all significantly increases naturalness for 2 speakers
66/67