SlideShare a Scribd company logo
1 of 4
Download to read offline
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 880
Rendering Of Voice By Using Convolutional Neural Network And With
The Help Of Text-To-Speech Module
Dhiren Patel, Aashay Ingle, Saurabh Joshi, Mandar Hegiste
1.B.sc Computer Science Graduate, Mumbai University, Mithibai College, Maharashtra - 400056
2.Mechanical Engineering Graduate, Mumbai University, Viva College, Maharashtra - 401305
3.Software Engineer Graduate, Mumbai University, Xavier Institute of Technology, Maharashtra
4.Mechanical Engineering Graduate, Mumbai University, Viva College, Maharashtra - 401305
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - This paper describes a novel text-to-speech
(TTS) technique based on deep convolutional neural networks
(CNN), without any recurrent units. Recurrent neural network
(RNN) has been a standard technique to modelsequentialdata
recently, and this technique hasbeenusedinsomecutting-edge
neural TTS techniques. However, training RNN component
often requires a very powerful computer, or very long time
typically several days or weeks. Recent other studies, on the
otherhand,haveshownthatCNN-basedsequencesynthesiscan
be much faster than RNN-based techniques, because of high
parallelizability. The objective of this paper is to show an
alternative neural TTS system, based only on CNN, that can
alleviate these economic costs of training. In our experiment,
the proposed Deep Convolutional TTS can be sufficiently
trained only in a night (15 hours),usinganordinarygamingPC
equipped with two GPUs, while the quality of the synthesized
speech was almostacceptable.
Key Words: Text-to-speech, deep learning,
convolutional neural network, attention, sequence-to-
sequence learning.
1. INTRODUCTION
Text-to-speech (TTS) is getting more and more common
recently, and is getting to be a basic user interface for many
systems. Toencourage further use of TTS in various systems,
it is significant to develop a handy, maintainable, extensible
TTS component that is accessible to speech non-specialists,
enterprising individuals and small teams who do not have
massive computers.
Traditional TTS systems, however, are not necessarily
friendly forthem, as these systemsare typically composedof
many domain- specific modules. For example, a typical
parametric TTS system is an elaborate integration of many
modules e.g. a text analyzer
F0 generator, a spectrum generator, a pause estimator,
and a vocoder that synthesize a waveform from these data,
etc.
Deep learning sometimes can unite these internal build-
ing blocks into a single model, and directlyconnectstheinput
and the output; this type of technique is sometimes called
‘end-to-end’ learning. Although such a technique is
sometimes criticized as ‘a black box,’nevertheless,anend-to-
end TTS system named Tacotron, which directly estimates a
spectrogram from an in- put text, has achieved promising
performance recently, without intensively-engineered
parametric models based on domain-specific knowledge.
Tacotron, however, has a drawback that it exploits many
recurrent units, which are quite costly to train, making it
almost infeasible for ordinary labs without luxurious
machines to studyand extend it further. Indeed, somepeople
tried to implement open clones of Tacotron but they are
struggling to reproduce the speech of satisfactory quality as
clear as the original work.
The purpose of this paper is to show Deep Convolutional
TTS (DCTTS), a novel, handy neural TTS, which is fully
convolutional. The architecture is largely similar to Tacotron
but is based on a fully convolutional sequence-to-sequence
learning model similar to the literature We show this handy
TTSactuallyworksina reasonablesetting.Thecontributionof
this article is twofold: (1) Propose a fully CNN-based TTS
system which can be trained much faster than an RNN-based
state-of-the-art neural TTS system, while the sound quality is
stillacceptable.(2)Anideatorapidlytraintheattention,which
we call ‘guided attention,’ is also shown.
1.1 Related Work
Neural speech synthesis: Recently, there is a surge of
interest in speech synthesis with neural networks,including
Deep Voice 1 [Arik et al., 2017a], Deep Voice 2 [Arik et al.,
2017b], Deep Voice 3 [Ping et al., 2018], WaveNet [Oord et
al., 2016a], SampleRNN [Mehri et al., 2016], Char2Wav
[Sotelo et al., 2017], Tacotron [Wang et al., 2017] and
VoiceLoop [Taigman et al., 2018]. Among these methods,
sequence-to-sequence models [Ping et al., 2018, Wang et al.,
2017, Sotelo et al., 2017] with attention mechanism have
much simpler pipeline and can producemorenatural speech
[e.g., Shen et al., 2017]. In this work, we use Deep Voice 3 as
the baseline multi-speaker model, because of its simple
convolutional architecture and high efficiency for training
and fast model adaptation. It should be noted that our
techniques can be seamlessly applied to other neural speech
synthesis models.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 881
Few-shot generative modeling: Humans can learn new
generative tasks from only a few examples, which motivates
research on few-shot generative models. Early studies
mostly focus on Bayesianmethods.Forexample,hierarchical
Bayesian models are used to exploit compositionality and
causality for few-shot generation of characters [Lake et al.,
2013, 2015] and words in speech [Lake et al., 2014].
Recently, deep neural networks achieve great successes in
few-shot density estimation and conditional image
generation [e.g., Rezende et al., 2016, Reed et al.,2017,Azadi
et al., 2017], because of the great potential for composition
in their learned representation. In this work, we investigate
few-shot generative modeling of speech conditioned on a
particular speaker. We train a separate speaker encoding
network to directly predict the parameters of multi-speaker
generative model byonlytakingunsubscribedaudiosamples
as inputs.
Speaker-dependent speech processing: Speaker-
dependent modeling has been widely studied for automatic
speech recognition (ASR), with the goal of improving the
performance by exploiting speaker characteristics. In
particular, there are two groups of methodsinneural ASR,in
alignment with our two voice cloning approaches. The first
group is speaker adaptation for the whole-model [Yu et al.,
2013], a portion of the model [Miao and Metze, 2015, Cui et
al., 2017], or merely to a speaker embedding [Abdel-Hamid
and Jiang, 2013, Xue et al., 2014]. Speaker adaptation for
voice cloning is in the same vein as these approaches, but
differences arise when text-to-speech vs. speech-to-text are
considered [Yamagishi et al., 2009]. The second group is
based on training ASR models jointly with embeddings.
Extraction of the embeddings can be based on i-vectors
[Miao et al., 2015], or bottleneck layers of neural networks
trained with a classification loss [LiandWu,2015].Although
the general idea of speaker encoding is also based on
extracting the embeddings directly, as a major distinction,
our speaker encoder models are trained with an objective
function that is directly related to speech synthesis. Lastly,
speaker-dependent modeling is essential for multi-speaker
speech synthesis. Using i-vectors to represent speaker-
dependent characteristics is one approach [Wu et al., 2015],
however, they have the limitation of being separately
trained, with an objective that is not directly related to
speech synthesis. Also they may not be accurately extracted
with small amount of audio [Miao et al., 2015]. Another
approach for multi-speaker speech synthesis is using
trainable speaker embeddings [Arik et al., 2017b], whichare
randomly initialized and jointly optimizedfroma generative
loss function.
Voice conversion: A closely related task of voice cloningis
voice conversion. The goal of voice conversion is to modify
an utterance from source speaker to make it sound like the
target speaker, while keeping the linguistic contents
unchanged. Unlike voice cloning, voice conversion systems
do not need to generalize to unseen texts. One common
approach is dynamic frequency warping, to align spectra of
different speakers. Agiomyrgiannakis and Roupakia [2016]
proposes a dynamic programming algorithm that
simultaneously estimates the optimal frequency warping
and weighting transform while matching source and target
speakers using a matching-minimizationalgorithm.Wu et al.
[2016] uses a spectral conversion approach integrated with
the locally linear embeddings for manifold learning. There
are also approaches to model spectral conversion using
neural networks [Desai et al., 2010, Chen et al., 2014,Hwang
et al., 2015]. Those models are typically trained with a large
amount of audio pairs of target and source speakers.
1.2 Methodology
Our DCTTS model consists of two networks: (1)
Text2Mel, which synthesizea mel spectrogramfromaninput
text, and (2) SpectrogramSuper-resolutionNetwork (SSRN),
which convert a coarse mel spectrogram to the full STFT
spectrogram. Fig. 1 shows the overall architecture of the
proposed method. 3.1. Text2Mel: Text to Mel Spectrogram
Network We first consider to synthesize a coarse mel
spectrogram from a text. This is the main part of the
proposed method. This module consists offoursubmodules:
Text Encoder, Audio Encoder, Attention, andAudioDecoder.
The network TextEnc first encodes the input sentence L =
[l1, . . . , lN ] ∈ CharN consisting of N characters, into the two
matrices K, V ∈ R d×N . On the other hand, the network
AudioEnc encodes the coarse mel spectrogramS(=S1:F,1:T)
∈ R F ×T , of previously spoken speech, whose length is T,
into a matrix Q ∈ R d×T . (K, V ) = TextEnc(L). (1) Q =
AudioEnc(S1:F,1:T ). (2) An attention matrix A ∈ R N×T ,
defined as follows, evaluates howstronglythen-thcharacter
ln and t-th time frame S1:F,t are related, A =softmaxn-axis(K
TQ/√ d). (3) Ant ∼ 1 implies that the module is looking at n-
th character ln at the time frame t, and it will look at ln or
ln+1 or characters around them, at the subsequent time
frame t + 1. Whatever, let us expect those are encoded in the
n-th column of V . Thus a seed R ∈ R d×T , decoded to the
subsequent frames S1:F,2:T +1, is obtained as R = Att(Q, K, V
) := V A. (Note: matrix product.) (4) The resultant R is
concatenated with the encoded audio Q, as R 0 = [R, Q],
because we found it beneficial in our pilot study. Then, the
concatenated matrix R 0 ∈ R 2d×T is decoded by the Audio
Decoder module to synthesize a coarse mel spectrogram,
Y1:F,2:T +1 = AudioDec(R 0 ). (5) The result Y1:F,2:T +1 is
compared with the temporally-shifted groundtruthS1:F,2:T
+1, by a loss function Lspec(Y1:F,2:T+1|S1:F,2:T+1),and the
error is back-propagated to the network parameters. The
loss function was the sum of L1 loss and the binary
divergence Dbin, Dbin(Y |S) := Eft[−Sft log Yft − (1 − Sft)
log(1 − Yft)] = Eft[−SftYˆft + log(1 + exp Yˆft)], (6) where Yˆft
= logit(Yft). Since thebinary divergencegivesa nonvanishing
gradient to the network, ∂Dbin(Y |S)/∂Yˆft ∝ Yft − Sft, it is
advantageous in gradient-based training. It is easily verified
that the spectrogram error is non-negative, Lspec(Y |S) =
Dbin(Y |S) + E[|Yft − Sft|] ≥ 0, and the equality holds iff Y = S.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 882
3.1.1. Details of TextEnc, AudioEnc, and AudioDec Our
networks are fully convolutional, and are not dependent on
any recurrent units. Instead of RNN, we sometimes take
advantages of dilated convolution [32, 13, 24] to take long
contextual information into account. ThetopequationofFig.
2 is the content of TextEnc. It consists of the character
embedding and the stacked 1D non-causal convolution. A
previous literature[2]useda heavierRNN-basedcomponent
named ‘CBHG,’ but we found this simpler network also
works well. AudioEnc and AudioDec, shown in Fig. 2, are
composed of 1D causal convolution layers with Highway
activation. These convolution should be causal because the
output of AudioDec is feedbacked to theinputofAudioEncin
the synthesis stage. 3.2. Spectrogram Super-resolution
Network (SSRN) We finally synthesize a full spectrogram |Z|
∈ R F 0×4T , from the obtained coarse mel spectrogramY ∈R
F ×T , by a spectrogram super-resolution network (SSRN).
Upsampling frequency from F to F 0 is rather
straightforward. We can achieve that by increasing the
convolution channels of 1D convolutional network.
Upsampling in temporal direction is not similarly done, but
by twice applying deconvolution layers of stride size 2, we
can quadruple the length of sequence from T to 4T = T 0 .
The bottom equation of Fig. 2 shows SSRN. In this paper, as
we do not consider onlineprocessing,all convolutionscanbe
non-causal. The loss function wasthesameasText2Mel:sum
of binary divergence and L1 distance between the
synthesized spectrogram SSRN(S) and the ground truth |Z|.
Figure -1: Guided Attention
Guided Attention Loss: Motivation, Method and Effects In
general, an attention module is quite costly to train.
Therefore, if there is some prior knowledge, it may be a help
incorporating them into the model to alleviate the heavy
training. We show that the simple measure below is helpful
to train the attention module. In TTS, the possible attention
matrix A lies in the very small subspace of R N×T . This is
because of the rough correspondence of the order of the
characters and the audio segments. That is, if one reads a
text, it is natural to assume that the text position n
progresses nearly linearly to the time t, i.e., n ∼ at, wherea ∼
N/T. This is Fig. 3. Comparison of the attention matrix A,
trained with and without the guided attention loss Latt(A).
(Left) Without, and (Right) with the guided attention. The
test text is “icassp stands for the international conference on
acoustics, speech, and signal processing.” We did not usethe
heuristics described in section 4.2. the prominent difference
of TTS from other seq2seq learning techniques such as
machine translation, in which an attention module should
resolve the word alignment between two languages that
have very different syntax, e.g. English and Japanese. Based
on this idea, we introduce another constraint on the
attention matrix A to prompt it to be ‘nearly diagonal,’
Latt(A) = Ent[AntWnt], where Wnt = 1 − exp{−(n/N − t/T) 2
/2g 2 }. In this paper, we set g = 0.2. If A is far from diagonal
(e.g., reading the characters in the random order), it is
strongly penalized by the loss function. This subsidiary loss
function is simultaneously optimized with the main loss
Lspec with equal weight. Although this measure is based on
quite a rough assumption,itimprovedthetrainingefficiency.
In our experiment, if we added the guided attention loss to
the objective, the term began decreasing only after ∼100
iterations. After ∼5K iterations, the attention became
roughly correct, not only for training data, but also for new
input texts. On the other hand, without the guided attention
loss, it required much more iterations.Itbeganlearningafter
∼10K iterations, and it required ∼50K iterations to look at
roughly correct positions, but the attention matrix was still
vague. Fig. 3 compares the attention matrix,trainedwith and
without guided attention loss. 4.2. Forcibly Incremental
Attention in Synthesis Stage In the synthesis stage, the
attention matrix A sometimes fails to look at the correct
characters. Typical errors we observed were (1) it
occasionally skipped several characters, and (2) it
repeatedly read a same word twice or more. In order to
make the system more robust, we heuristically modify the
matrix A to be ‘nearly diagonal,’ by a simple rule as follows.
We observed this device sometimes alleviated such
misattentions. Let nt be the position of the character to be
read at tth time frame; nt = argmaxnAn,t. Comparing the
current position nt and the previous position nt−1, unless
the difference nt − nt−1 is within the range −1 ≤ nt − nt−1 ≤
3, the current attention position forcibly set to An,t =
δn,nt−1+1 (Kronecker’sdelta),toforciblymaketheattention
target incremental, i.e., nt − nt−1 = 1
Figure 2 : Forcibly Incremental Attention
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 883
Forcibly Incremental Attention in Synthesis Stage In the
synthesis stage, the attention matrix A sometimes fails to
look at the correct characters. Typical errors we observed
were (1) it occasionally skippedseveral characters,and(2)it
repeatedly read a same word twice or more. In order to
make the system more robust, we heuristically modify the
matrix A to be ‘nearly diagonal,’ by a simple rule as follows.
We observed this device sometimes alleviated such
misattentions. Let nt be the position of the character to be
read at tth time frame; nt = argmaxnAn,t. Comparing the
current position nt and the previous position nt−1, unless
the difference nt − nt−1 is within the range −1 ≤ nt − nt−1 ≤
3, the current attention position forcibly set to An,t =
δn,nt−1+1 (Kronecker’sdelta),toforciblymaketheattention
target incremental, i.e., nt − nt−1 = 1.
2. SUMMARY AND FUTURE WORK
This paper described a novel text-to-speech (TTS)
technique based on deep convolutional neural networks
(CNN), as well as a technique to train the attention module
rapidly. In our experiment,theproposedDeepConvolutional
TTS can be sufficiently trained only in a night (∼15 hours),
using an ordinary gaming PC equipped withtwoGPUs, while
the quality of the synthesized speech was almostacceptable.
Although the audio quality is far from perfect yet, it may be
improved by tuningsomehyper-parametersthoroughly,and
by applying some techniques developed in deep learning
community. We believe this handy method encourages
further development of the applications based on speech
synthesis. We can expect this simple neural TTS may be
extended to other versatile purposes, such as
emotional/non-linguistic/personalized speech synthesis,
singing voice synthesis, music synthesis, etc., by further
studies. In addition, since a neural TTS has become this
lightweight, the studies on more integrated speech systems
e.g. some multimodal systems, simultaneous training of
TTS+ASR, and speech translation, etc., may have become
more feasible. These issues should be worked out in the
future.
3. CONCLUSIONS
This paper implements a method to clone and
render voice usingConvolutional Neural Networksinsteadof
widely-used traditional complete Neural Networks which
helps to reduce training time significantly. We also
implement an additional system of 'attention' which helps
increase precision in case of long sentences. Although, fine
parametric tuning can be performed to further increase the
output quality and accuracy, there is surely a comparable
scope with existing systems in the near future..
We are aiming to also classify the type of
plasmodium parasite for which we would use machine
learning technologies.
REFERENCES
[1] I. Goodfellow et al., Deep Learning, MIT Press,2016,http:
//www.deeplearningbook.org.
[2] Y. Wang et al., “Tacotron: Towards end-to-end speech
synthesis,” in Proc. Interspeech,2017,arXiv:1703.10135. [3]
A. Barron, “Implementation of Google’s Tacotron in
TensorFlow,” 2017, Available at GitHub, https://github.
com/barronalex/Tacotron (visited Oct. 2017).
[4] K. Park, “A TensorFlow implementation of Tacotron: A
fully end-to-end text-to-speech synthesis model,” 2017,
AvailableatGitHub,https://github.com/Kyubyong/tacotron
(visited Oct. 2017).
[5] K. Ito, “Tacotron speech synthesis implemented in
TensorFlow, with samples and a pre-trained model,” 2017,
Available at GitHub, https://github.com/keithito/ tacotron
(visited Oct. 2017).
[6] R. Yamamoto, “PyTorch implementation of Tacotron
speech synthesis model,” 2017, Available at GitHub,https://
github.com/r9y9/tacotron_pytorch (visited Oct. 2017).
[7] J. Gehring et al., “Convolutional sequence to sequence
learning,” in Proc. ICML, 2017, pp. 1243–1252,
arXiv:1705.03122.
[8] H. Zen et al., “Statistical parametric speech synthesis
using deep neural networks,” in Proc. ICASSP, 2013, pp.
7962–7966.
[9] Y. Fan et al., “TTS synthesis with bidirectional LSTM
based recurrent neural networks,” in Proc. Interspeech,
2014, pp. 1964–1968.
[10] H. Zen and H. Sak, “Unidirectional long short-term
memory recurrent neural network with recurrent output
layer for lowlatency speech synthesis,” in Proc. ICASSP,
2015, pp. 4470– 4474.
[11] S. Achanta et al., “An investigation of recurrent neural
network architectures for statistical parametric speech
synthesis.,” in Proc. Interspeech, 2015, pp. 859–863.
[12] Z. Wu and S. King, “Investigating gated recurrent
networks for speech synthesis,” in Proc. ICASSP, 2016, pp.
5140–5144.
[13] A. van den Oord et al., “WaveNet:Agenerativemodel for
raw audio,” arXiv:1609.03499, 2016.

More Related Content

Similar to Rendering Of Voice By Using Convolutional Neural Network And With The Help Of Text-To-Speech Module

Autotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualismAutotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualismIRJET Journal
 
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNINGIMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNINGIRJET Journal
 
Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...IAESIJAI
 
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATIONSPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATIONIRJET Journal
 
A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert Systemcsandit
 
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...IRJET Journal
 
The Evaluation of a Code-Switched Sepedi-English Automatic Speech Recognition...
The Evaluation of a Code-Switched Sepedi-English Automatic Speech Recognition...The Evaluation of a Code-Switched Sepedi-English Automatic Speech Recognition...
The Evaluation of a Code-Switched Sepedi-English Automatic Speech Recognition...IJCI JOURNAL
 
A Review On Chatbot Design And Implementation Techniques
A Review On Chatbot Design And Implementation TechniquesA Review On Chatbot Design And Implementation Techniques
A Review On Chatbot Design And Implementation TechniquesJoe Osborn
 
IRJET - A Review on Chatbot Design and Implementation Techniques
IRJET -  	  A Review on Chatbot Design and Implementation TechniquesIRJET -  	  A Review on Chatbot Design and Implementation Techniques
IRJET - A Review on Chatbot Design and Implementation TechniquesIRJET Journal
 
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)IRJET Journal
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information SearchIRJET Journal
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingIOSR Journals
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingScott Faria
 
AI POWERED TRANSCRIBER TO RECORD THE CONFERENCE PROCEEDINGS
AI POWERED TRANSCRIBER TO RECORD THE CONFERENCE PROCEEDINGSAI POWERED TRANSCRIBER TO RECORD THE CONFERENCE PROCEEDINGS
AI POWERED TRANSCRIBER TO RECORD THE CONFERENCE PROCEEDINGSIRJET Journal
 
A study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationA study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationIRJET Journal
 
Automatic Grading of Handwritten Answers
Automatic Grading of Handwritten AnswersAutomatic Grading of Handwritten Answers
Automatic Grading of Handwritten AnswersIRJET Journal
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
 

Similar to Rendering Of Voice By Using Convolutional Neural Network And With The Help Of Text-To-Speech Module (20)

Autotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualismAutotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualism
 
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNINGIMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
 
Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...
 
IRJET- Vocal Code
IRJET- Vocal CodeIRJET- Vocal Code
IRJET- Vocal Code
 
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATIONSPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
 
A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
 
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
 
The Evaluation of a Code-Switched Sepedi-English Automatic Speech Recognition...
The Evaluation of a Code-Switched Sepedi-English Automatic Speech Recognition...The Evaluation of a Code-Switched Sepedi-English Automatic Speech Recognition...
The Evaluation of a Code-Switched Sepedi-English Automatic Speech Recognition...
 
A Review On Chatbot Design And Implementation Techniques
A Review On Chatbot Design And Implementation TechniquesA Review On Chatbot Design And Implementation Techniques
A Review On Chatbot Design And Implementation Techniques
 
IRJET - A Review on Chatbot Design and Implementation Techniques
IRJET -  	  A Review on Chatbot Design and Implementation TechniquesIRJET -  	  A Review on Chatbot Design and Implementation Techniques
IRJET - A Review on Chatbot Design and Implementation Techniques
 
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
 
Speech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - ProjectSpeech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - Project
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information Search
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And Recording
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language Processing
 
AI POWERED TRANSCRIBER TO RECORD THE CONFERENCE PROCEEDINGS
AI POWERED TRANSCRIBER TO RECORD THE CONFERENCE PROCEEDINGSAI POWERED TRANSCRIBER TO RECORD THE CONFERENCE PROCEEDINGS
AI POWERED TRANSCRIBER TO RECORD THE CONFERENCE PROCEEDINGS
 
A study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationA study on the techniques for speech to speech translation
A study on the techniques for speech to speech translation
 
Automatic Grading of Handwritten Answers
Automatic Grading of Handwritten AnswersAutomatic Grading of Handwritten Answers
Automatic Grading of Handwritten Answers
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
 

More from IRJET Journal

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...IRJET Journal
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTUREIRJET Journal
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...IRJET Journal
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsIRJET Journal
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...IRJET Journal
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...IRJET Journal
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...IRJET Journal
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...IRJET Journal
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASIRJET Journal
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...IRJET Journal
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProIRJET Journal
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...IRJET Journal
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemIRJET Journal
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesIRJET Journal
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web applicationIRJET Journal
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...IRJET Journal
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.IRJET Journal
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...IRJET Journal
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignIRJET Journal
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...IRJET Journal
 

More from IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Recently uploaded

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 

Recently uploaded (20)

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 

Rendering Of Voice By Using Convolutional Neural Network And With The Help Of Text-To-Speech Module

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 880 Rendering Of Voice By Using Convolutional Neural Network And With The Help Of Text-To-Speech Module Dhiren Patel, Aashay Ingle, Saurabh Joshi, Mandar Hegiste 1.B.sc Computer Science Graduate, Mumbai University, Mithibai College, Maharashtra - 400056 2.Mechanical Engineering Graduate, Mumbai University, Viva College, Maharashtra - 401305 3.Software Engineer Graduate, Mumbai University, Xavier Institute of Technology, Maharashtra 4.Mechanical Engineering Graduate, Mumbai University, Viva College, Maharashtra - 401305 ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without any recurrent units. Recurrent neural network (RNN) has been a standard technique to modelsequentialdata recently, and this technique hasbeenusedinsomecutting-edge neural TTS techniques. However, training RNN component often requires a very powerful computer, or very long time typically several days or weeks. Recent other studies, on the otherhand,haveshownthatCNN-basedsequencesynthesiscan be much faster than RNN-based techniques, because of high parallelizability. The objective of this paper is to show an alternative neural TTS system, based only on CNN, that can alleviate these economic costs of training. In our experiment, the proposed Deep Convolutional TTS can be sufficiently trained only in a night (15 hours),usinganordinarygamingPC equipped with two GPUs, while the quality of the synthesized speech was almostacceptable. Key Words: Text-to-speech, deep learning, convolutional neural network, attention, sequence-to- sequence learning. 1. INTRODUCTION Text-to-speech (TTS) is getting more and more common recently, and is getting to be a basic user interface for many systems. Toencourage further use of TTS in various systems, it is significant to develop a handy, maintainable, extensible TTS component that is accessible to speech non-specialists, enterprising individuals and small teams who do not have massive computers. Traditional TTS systems, however, are not necessarily friendly forthem, as these systemsare typically composedof many domain- specific modules. For example, a typical parametric TTS system is an elaborate integration of many modules e.g. a text analyzer F0 generator, a spectrum generator, a pause estimator, and a vocoder that synthesize a waveform from these data, etc. Deep learning sometimes can unite these internal build- ing blocks into a single model, and directlyconnectstheinput and the output; this type of technique is sometimes called ‘end-to-end’ learning. Although such a technique is sometimes criticized as ‘a black box,’nevertheless,anend-to- end TTS system named Tacotron, which directly estimates a spectrogram from an in- put text, has achieved promising performance recently, without intensively-engineered parametric models based on domain-specific knowledge. Tacotron, however, has a drawback that it exploits many recurrent units, which are quite costly to train, making it almost infeasible for ordinary labs without luxurious machines to studyand extend it further. Indeed, somepeople tried to implement open clones of Tacotron but they are struggling to reproduce the speech of satisfactory quality as clear as the original work. The purpose of this paper is to show Deep Convolutional TTS (DCTTS), a novel, handy neural TTS, which is fully convolutional. The architecture is largely similar to Tacotron but is based on a fully convolutional sequence-to-sequence learning model similar to the literature We show this handy TTSactuallyworksina reasonablesetting.Thecontributionof this article is twofold: (1) Propose a fully CNN-based TTS system which can be trained much faster than an RNN-based state-of-the-art neural TTS system, while the sound quality is stillacceptable.(2)Anideatorapidlytraintheattention,which we call ‘guided attention,’ is also shown. 1.1 Related Work Neural speech synthesis: Recently, there is a surge of interest in speech synthesis with neural networks,including Deep Voice 1 [Arik et al., 2017a], Deep Voice 2 [Arik et al., 2017b], Deep Voice 3 [Ping et al., 2018], WaveNet [Oord et al., 2016a], SampleRNN [Mehri et al., 2016], Char2Wav [Sotelo et al., 2017], Tacotron [Wang et al., 2017] and VoiceLoop [Taigman et al., 2018]. Among these methods, sequence-to-sequence models [Ping et al., 2018, Wang et al., 2017, Sotelo et al., 2017] with attention mechanism have much simpler pipeline and can producemorenatural speech [e.g., Shen et al., 2017]. In this work, we use Deep Voice 3 as the baseline multi-speaker model, because of its simple convolutional architecture and high efficiency for training and fast model adaptation. It should be noted that our techniques can be seamlessly applied to other neural speech synthesis models.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 881 Few-shot generative modeling: Humans can learn new generative tasks from only a few examples, which motivates research on few-shot generative models. Early studies mostly focus on Bayesianmethods.Forexample,hierarchical Bayesian models are used to exploit compositionality and causality for few-shot generation of characters [Lake et al., 2013, 2015] and words in speech [Lake et al., 2014]. Recently, deep neural networks achieve great successes in few-shot density estimation and conditional image generation [e.g., Rezende et al., 2016, Reed et al.,2017,Azadi et al., 2017], because of the great potential for composition in their learned representation. In this work, we investigate few-shot generative modeling of speech conditioned on a particular speaker. We train a separate speaker encoding network to directly predict the parameters of multi-speaker generative model byonlytakingunsubscribedaudiosamples as inputs. Speaker-dependent speech processing: Speaker- dependent modeling has been widely studied for automatic speech recognition (ASR), with the goal of improving the performance by exploiting speaker characteristics. In particular, there are two groups of methodsinneural ASR,in alignment with our two voice cloning approaches. The first group is speaker adaptation for the whole-model [Yu et al., 2013], a portion of the model [Miao and Metze, 2015, Cui et al., 2017], or merely to a speaker embedding [Abdel-Hamid and Jiang, 2013, Xue et al., 2014]. Speaker adaptation for voice cloning is in the same vein as these approaches, but differences arise when text-to-speech vs. speech-to-text are considered [Yamagishi et al., 2009]. The second group is based on training ASR models jointly with embeddings. Extraction of the embeddings can be based on i-vectors [Miao et al., 2015], or bottleneck layers of neural networks trained with a classification loss [LiandWu,2015].Although the general idea of speaker encoding is also based on extracting the embeddings directly, as a major distinction, our speaker encoder models are trained with an objective function that is directly related to speech synthesis. Lastly, speaker-dependent modeling is essential for multi-speaker speech synthesis. Using i-vectors to represent speaker- dependent characteristics is one approach [Wu et al., 2015], however, they have the limitation of being separately trained, with an objective that is not directly related to speech synthesis. Also they may not be accurately extracted with small amount of audio [Miao et al., 2015]. Another approach for multi-speaker speech synthesis is using trainable speaker embeddings [Arik et al., 2017b], whichare randomly initialized and jointly optimizedfroma generative loss function. Voice conversion: A closely related task of voice cloningis voice conversion. The goal of voice conversion is to modify an utterance from source speaker to make it sound like the target speaker, while keeping the linguistic contents unchanged. Unlike voice cloning, voice conversion systems do not need to generalize to unseen texts. One common approach is dynamic frequency warping, to align spectra of different speakers. Agiomyrgiannakis and Roupakia [2016] proposes a dynamic programming algorithm that simultaneously estimates the optimal frequency warping and weighting transform while matching source and target speakers using a matching-minimizationalgorithm.Wu et al. [2016] uses a spectral conversion approach integrated with the locally linear embeddings for manifold learning. There are also approaches to model spectral conversion using neural networks [Desai et al., 2010, Chen et al., 2014,Hwang et al., 2015]. Those models are typically trained with a large amount of audio pairs of target and source speakers. 1.2 Methodology Our DCTTS model consists of two networks: (1) Text2Mel, which synthesizea mel spectrogramfromaninput text, and (2) SpectrogramSuper-resolutionNetwork (SSRN), which convert a coarse mel spectrogram to the full STFT spectrogram. Fig. 1 shows the overall architecture of the proposed method. 3.1. Text2Mel: Text to Mel Spectrogram Network We first consider to synthesize a coarse mel spectrogram from a text. This is the main part of the proposed method. This module consists offoursubmodules: Text Encoder, Audio Encoder, Attention, andAudioDecoder. The network TextEnc first encodes the input sentence L = [l1, . . . , lN ] ∈ CharN consisting of N characters, into the two matrices K, V ∈ R d×N . On the other hand, the network AudioEnc encodes the coarse mel spectrogramS(=S1:F,1:T) ∈ R F ×T , of previously spoken speech, whose length is T, into a matrix Q ∈ R d×T . (K, V ) = TextEnc(L). (1) Q = AudioEnc(S1:F,1:T ). (2) An attention matrix A ∈ R N×T , defined as follows, evaluates howstronglythen-thcharacter ln and t-th time frame S1:F,t are related, A =softmaxn-axis(K TQ/√ d). (3) Ant ∼ 1 implies that the module is looking at n- th character ln at the time frame t, and it will look at ln or ln+1 or characters around them, at the subsequent time frame t + 1. Whatever, let us expect those are encoded in the n-th column of V . Thus a seed R ∈ R d×T , decoded to the subsequent frames S1:F,2:T +1, is obtained as R = Att(Q, K, V ) := V A. (Note: matrix product.) (4) The resultant R is concatenated with the encoded audio Q, as R 0 = [R, Q], because we found it beneficial in our pilot study. Then, the concatenated matrix R 0 ∈ R 2d×T is decoded by the Audio Decoder module to synthesize a coarse mel spectrogram, Y1:F,2:T +1 = AudioDec(R 0 ). (5) The result Y1:F,2:T +1 is compared with the temporally-shifted groundtruthS1:F,2:T +1, by a loss function Lspec(Y1:F,2:T+1|S1:F,2:T+1),and the error is back-propagated to the network parameters. The loss function was the sum of L1 loss and the binary divergence Dbin, Dbin(Y |S) := Eft[−Sft log Yft − (1 − Sft) log(1 − Yft)] = Eft[−SftYˆft + log(1 + exp Yˆft)], (6) where Yˆft = logit(Yft). Since thebinary divergencegivesa nonvanishing gradient to the network, ∂Dbin(Y |S)/∂Yˆft ∝ Yft − Sft, it is advantageous in gradient-based training. It is easily verified that the spectrogram error is non-negative, Lspec(Y |S) = Dbin(Y |S) + E[|Yft − Sft|] ≥ 0, and the equality holds iff Y = S.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 882 3.1.1. Details of TextEnc, AudioEnc, and AudioDec Our networks are fully convolutional, and are not dependent on any recurrent units. Instead of RNN, we sometimes take advantages of dilated convolution [32, 13, 24] to take long contextual information into account. ThetopequationofFig. 2 is the content of TextEnc. It consists of the character embedding and the stacked 1D non-causal convolution. A previous literature[2]useda heavierRNN-basedcomponent named ‘CBHG,’ but we found this simpler network also works well. AudioEnc and AudioDec, shown in Fig. 2, are composed of 1D causal convolution layers with Highway activation. These convolution should be causal because the output of AudioDec is feedbacked to theinputofAudioEncin the synthesis stage. 3.2. Spectrogram Super-resolution Network (SSRN) We finally synthesize a full spectrogram |Z| ∈ R F 0×4T , from the obtained coarse mel spectrogramY ∈R F ×T , by a spectrogram super-resolution network (SSRN). Upsampling frequency from F to F 0 is rather straightforward. We can achieve that by increasing the convolution channels of 1D convolutional network. Upsampling in temporal direction is not similarly done, but by twice applying deconvolution layers of stride size 2, we can quadruple the length of sequence from T to 4T = T 0 . The bottom equation of Fig. 2 shows SSRN. In this paper, as we do not consider onlineprocessing,all convolutionscanbe non-causal. The loss function wasthesameasText2Mel:sum of binary divergence and L1 distance between the synthesized spectrogram SSRN(S) and the ground truth |Z|. Figure -1: Guided Attention Guided Attention Loss: Motivation, Method and Effects In general, an attention module is quite costly to train. Therefore, if there is some prior knowledge, it may be a help incorporating them into the model to alleviate the heavy training. We show that the simple measure below is helpful to train the attention module. In TTS, the possible attention matrix A lies in the very small subspace of R N×T . This is because of the rough correspondence of the order of the characters and the audio segments. That is, if one reads a text, it is natural to assume that the text position n progresses nearly linearly to the time t, i.e., n ∼ at, wherea ∼ N/T. This is Fig. 3. Comparison of the attention matrix A, trained with and without the guided attention loss Latt(A). (Left) Without, and (Right) with the guided attention. The test text is “icassp stands for the international conference on acoustics, speech, and signal processing.” We did not usethe heuristics described in section 4.2. the prominent difference of TTS from other seq2seq learning techniques such as machine translation, in which an attention module should resolve the word alignment between two languages that have very different syntax, e.g. English and Japanese. Based on this idea, we introduce another constraint on the attention matrix A to prompt it to be ‘nearly diagonal,’ Latt(A) = Ent[AntWnt], where Wnt = 1 − exp{−(n/N − t/T) 2 /2g 2 }. In this paper, we set g = 0.2. If A is far from diagonal (e.g., reading the characters in the random order), it is strongly penalized by the loss function. This subsidiary loss function is simultaneously optimized with the main loss Lspec with equal weight. Although this measure is based on quite a rough assumption,itimprovedthetrainingefficiency. In our experiment, if we added the guided attention loss to the objective, the term began decreasing only after ∼100 iterations. After ∼5K iterations, the attention became roughly correct, not only for training data, but also for new input texts. On the other hand, without the guided attention loss, it required much more iterations.Itbeganlearningafter ∼10K iterations, and it required ∼50K iterations to look at roughly correct positions, but the attention matrix was still vague. Fig. 3 compares the attention matrix,trainedwith and without guided attention loss. 4.2. Forcibly Incremental Attention in Synthesis Stage In the synthesis stage, the attention matrix A sometimes fails to look at the correct characters. Typical errors we observed were (1) it occasionally skipped several characters, and (2) it repeatedly read a same word twice or more. In order to make the system more robust, we heuristically modify the matrix A to be ‘nearly diagonal,’ by a simple rule as follows. We observed this device sometimes alleviated such misattentions. Let nt be the position of the character to be read at tth time frame; nt = argmaxnAn,t. Comparing the current position nt and the previous position nt−1, unless the difference nt − nt−1 is within the range −1 ≤ nt − nt−1 ≤ 3, the current attention position forcibly set to An,t = δn,nt−1+1 (Kronecker’sdelta),toforciblymaketheattention target incremental, i.e., nt − nt−1 = 1 Figure 2 : Forcibly Incremental Attention
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 883 Forcibly Incremental Attention in Synthesis Stage In the synthesis stage, the attention matrix A sometimes fails to look at the correct characters. Typical errors we observed were (1) it occasionally skippedseveral characters,and(2)it repeatedly read a same word twice or more. In order to make the system more robust, we heuristically modify the matrix A to be ‘nearly diagonal,’ by a simple rule as follows. We observed this device sometimes alleviated such misattentions. Let nt be the position of the character to be read at tth time frame; nt = argmaxnAn,t. Comparing the current position nt and the previous position nt−1, unless the difference nt − nt−1 is within the range −1 ≤ nt − nt−1 ≤ 3, the current attention position forcibly set to An,t = δn,nt−1+1 (Kronecker’sdelta),toforciblymaketheattention target incremental, i.e., nt − nt−1 = 1. 2. SUMMARY AND FUTURE WORK This paper described a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), as well as a technique to train the attention module rapidly. In our experiment,theproposedDeepConvolutional TTS can be sufficiently trained only in a night (∼15 hours), using an ordinary gaming PC equipped withtwoGPUs, while the quality of the synthesized speech was almostacceptable. Although the audio quality is far from perfect yet, it may be improved by tuningsomehyper-parametersthoroughly,and by applying some techniques developed in deep learning community. We believe this handy method encourages further development of the applications based on speech synthesis. We can expect this simple neural TTS may be extended to other versatile purposes, such as emotional/non-linguistic/personalized speech synthesis, singing voice synthesis, music synthesis, etc., by further studies. In addition, since a neural TTS has become this lightweight, the studies on more integrated speech systems e.g. some multimodal systems, simultaneous training of TTS+ASR, and speech translation, etc., may have become more feasible. These issues should be worked out in the future. 3. CONCLUSIONS This paper implements a method to clone and render voice usingConvolutional Neural Networksinsteadof widely-used traditional complete Neural Networks which helps to reduce training time significantly. We also implement an additional system of 'attention' which helps increase precision in case of long sentences. Although, fine parametric tuning can be performed to further increase the output quality and accuracy, there is surely a comparable scope with existing systems in the near future.. We are aiming to also classify the type of plasmodium parasite for which we would use machine learning technologies. REFERENCES [1] I. Goodfellow et al., Deep Learning, MIT Press,2016,http: //www.deeplearningbook.org. [2] Y. Wang et al., “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech,2017,arXiv:1703.10135. [3] A. Barron, “Implementation of Google’s Tacotron in TensorFlow,” 2017, Available at GitHub, https://github. com/barronalex/Tacotron (visited Oct. 2017). [4] K. Park, “A TensorFlow implementation of Tacotron: A fully end-to-end text-to-speech synthesis model,” 2017, AvailableatGitHub,https://github.com/Kyubyong/tacotron (visited Oct. 2017). [5] K. Ito, “Tacotron speech synthesis implemented in TensorFlow, with samples and a pre-trained model,” 2017, Available at GitHub, https://github.com/keithito/ tacotron (visited Oct. 2017). [6] R. Yamamoto, “PyTorch implementation of Tacotron speech synthesis model,” 2017, Available at GitHub,https:// github.com/r9y9/tacotron_pytorch (visited Oct. 2017). [7] J. Gehring et al., “Convolutional sequence to sequence learning,” in Proc. ICML, 2017, pp. 1243–1252, arXiv:1705.03122. [8] H. Zen et al., “Statistical parametric speech synthesis using deep neural networks,” in Proc. ICASSP, 2013, pp. 7962–7966. [9] Y. Fan et al., “TTS synthesis with bidirectional LSTM based recurrent neural networks,” in Proc. Interspeech, 2014, pp. 1964–1968. [10] H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output layer for lowlatency speech synthesis,” in Proc. ICASSP, 2015, pp. 4470– 4474. [11] S. Achanta et al., “An investigation of recurrent neural network architectures for statistical parametric speech synthesis.,” in Proc. Interspeech, 2015, pp. 859–863. [12] Z. Wu and S. King, “Investigating gated recurrent networks for speech synthesis,” in Proc. ICASSP, 2016, pp. 5140–5144. [13] A. van den Oord et al., “WaveNet:Agenerativemodel for raw audio,” arXiv:1609.03499, 2016.