SlideShare a Scribd company logo
HIGH QUALITY SPEECH CODING USING
SAMPLERNN
Cong Zhou
Dolby Laboratories
2
Entropy
Coding
Vocoder
Analysis
Speech Bitstream
Encoder
Vocoder
Synthesis
SpeechEntropy
Decoding
Decoder
SPARK JOY
Inspired from Video Coding presentation by Anne Aaron, Director of Video Algorithms at Netflix
© 2019 DOLBY LABORATORIES, INC.
Raw Audio Generative Models
• Sequential generative models
o Directly estimate waveform distributions
𝑝 𝑋 = $
%&'
()*
𝑝 𝑥%,* 𝑥*, . . . , 𝑥%
o Breakthrough success in generating realistic speech
o WaveNet [1], SampleRNN [2], WaveRNN [3]
3
[1] Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
[2] Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint
arXiv:1612.07837 (2016).
[3] Kalchbrenner, Nal, et al. "Efficient neural audio synthesis." arXiv preprint arXiv:1802.08435 (2018).
© 2019 DOLBY LABORATORIES, INC.
SampleRNN
4
SampleRNN: multi-rate RNN based generative model (MILA)
© 2019 DOLBY LABORATORIES, INC.
GRU
Learned upsampling
+
1 ⇥ 1
conv
1 ⇥ 1
conv
Tier 2
GRU
Learned upsampling
+
1 ⇥ 1
conv
1 ⇥ 1
conv
Tier 3
GRU
Learned upsampling
+
1 ⇥ 1
conv
1 ⇥ 1
conv
Tier 4
1 ⇥ 1
conv
MLP Tier 1
ht
p(xi|x<i, ht)
xi FS(2), . . . , xi 1
xi FS(3), . . . , xi 1
xi FS(4), . . . , xi 1
xi FS(1), . . . , xi 1
5
SampleRNN with conditioning
© 2019 DOLBY LABORATORIES, INC.
Training conditional SampleRNN
6
SampleRNN
Speech
16 kHz
Vocoder
Analysis [1]
Speech
16 kHz 𝐡 𝒕
conditioning info:
LPC filter, RMS level of LPC residual,
pitch, voicing level
E{-log p(x)}
[1] Per Hedelin, “A sinusoidal LPC vocoder,” in 2000 IEEE Workshop on Speech Coding. Proceedings. Meeting the
Challenges of the New Millennium (Cat. No.00EX421), Sept 2000, pp. 2–4
© 2019 DOLBY LABORATORIES, INC.
Coding Scheme
7
DECODER
sampleRNN
Speech
16 kHz
ENCODER
Entropy
Coding
Vocoder
Analysis
Entropy
Decoding
Speech
16 kHz
Bitstream 𝐡 𝒕
High-quality speech coding with SampleRNN
Janusz Klejsa, Per Hedelin, Cong Zhou, Roy Fejgin, Lars Villemoes ICASSP 2019
Quantized vocoder features
BadPoorFairGoodExcellent
MUSHRAScore
High-quality speech coding with SampleRNN
Janusz Klejsa, Per Hedelin, Cong Zhou, Roy Fejgin, Lars Villemoes ICASSP 2019
23.05 kbps 16 kbps 6.4 kbps
© 2019 DOLBY LABORATORIES, INC.
Reference
(original)
AMR-WB
(23.05 kbps)
SILK
(16 kbps)
sRNN based
(6.4 kbps)
High-quality speech coding with SampleRNN
Janusz Klejsa, Per Hedelin, Cong Zhou, Roy Fejgin, Lars Villemoes ICASSP 2019
Demo
© 2019 DOLBY LABORATORIES, INC.
Future directions
• Robustness
• Low complexity
10
© 2019 DOLBY LABORATORIES, INC.
Recent related work
"WaveNet based low rate speech coding." Kleijn, W. Bastiaan, et al. 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
"LPCNet: Improving neural speech synthesis through linear prediction." Valin, Jean-Marc, and Jan
Skoglund. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
"Low bit-rate speech coding with VQ-VAE and a WaveNet decoder." Gârbacea, Cristina, et al. 2019 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
"A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet." Valin, Jean-Marc, and Jan Skoglund.
arXiv preprint arXiv:1903.12087 (2019).
"GELP: GAN-Excited Liner Prediction for Speech Synthesis from Mel-spectrogram." Juvela, Lauri, et al.
arXiv preprint arXiv:1904.03976 (2019).
11
Dolby audio ai workshop   speech coding - cong zhou

More Related Content

What's hot

AES 2012 Error Tolerant Coding Workshop
AES 2012 Error Tolerant Coding WorkshopAES 2012 Error Tolerant Coding Workshop
AES 2012 Error Tolerant Coding Workshop
CSR
 
Scientech trainer kit catalog
Scientech trainer kit catalogScientech trainer kit catalog
Scientech trainer kit catalog
ABHAYTAVPSC
 
Dev Days, Speech Recognition, LM Aubert
Dev Days, Speech Recognition, LM AubertDev Days, Speech Recognition, LM Aubert
Dev Days, Speech Recognition, LM Aubert
aubertlm
 
Salt Internoise2012
Salt Internoise2012Salt Internoise2012
Salt Internoise2012
preservelenoxmountain
 
Audio steganography - LSB
Audio steganography - LSBAudio steganography - LSB
Audio steganography - LSB
Mohab El-Shishtawy
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
SoundField UPM-1 Review
SoundField UPM-1 ReviewSoundField UPM-1 Review
SoundField UPM-1 Review
Radikal Ltd.
 
Deepfakesの生成および検出
Deepfakesの生成および検出Deepfakesの生成および検出
Deepfakesの生成および検出
Plot Hong
 
Digital speech processing lecture1
Digital speech processing lecture1Digital speech processing lecture1
Digital speech processing lecture1
Samiul Parag
 

What's hot (9)

AES 2012 Error Tolerant Coding Workshop
AES 2012 Error Tolerant Coding WorkshopAES 2012 Error Tolerant Coding Workshop
AES 2012 Error Tolerant Coding Workshop
 
Scientech trainer kit catalog
Scientech trainer kit catalogScientech trainer kit catalog
Scientech trainer kit catalog
 
Dev Days, Speech Recognition, LM Aubert
Dev Days, Speech Recognition, LM AubertDev Days, Speech Recognition, LM Aubert
Dev Days, Speech Recognition, LM Aubert
 
Salt Internoise2012
Salt Internoise2012Salt Internoise2012
Salt Internoise2012
 
Audio steganography - LSB
Audio steganography - LSBAudio steganography - LSB
Audio steganography - LSB
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
SoundField UPM-1 Review
SoundField UPM-1 ReviewSoundField UPM-1 Review
SoundField UPM-1 Review
 
Deepfakesの生成および検出
Deepfakesの生成および検出Deepfakesの生成および検出
Deepfakesの生成および検出
 
Digital speech processing lecture1
Digital speech processing lecture1Digital speech processing lecture1
Digital speech processing lecture1
 

Similar to Dolby audio ai workshop speech coding - cong zhou

Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)
Harshal Ladhe
 
Presentation2
Presentation2Presentation2
Presentation2
Full Sail University
 
Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)
Harshal Ladhe
 
LPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A ReviewLPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A Review
ijiert bestjournal
 
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Conditional generative model for audio
Conditional generative model for audioConditional generative model for audio
Conditional generative model for audio
Keunwoo Choi
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
ssuser849b73
 
B034205010
B034205010B034205010
B034205010
inventionjournals
 
Jonathan Christensen's Presentation at eComm 2009
Jonathan Christensen's Presentation at eComm 2009Jonathan Christensen's Presentation at eComm 2009
Jonathan Christensen's Presentation at eComm 2009
eCommConf
 
Jonathan Christensen's Presentation at eComm 2009
Jonathan Christensen's Presentation at eComm 2009Jonathan Christensen's Presentation at eComm 2009
Jonathan Christensen's Presentation at eComm 2009
eCommConf
 
Future Proof Surround Sound Mixing using Ambisonics
Future Proof Surround Sound Mixing using AmbisonicsFuture Proof Surround Sound Mixing using Ambisonics
Future Proof Surround Sound Mixing using Ambisonics
Bruce Wiggins
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
karthik annam
 
Radio Drama At A Distance
Radio Drama At A DistanceRadio Drama At A Distance
Radio Drama At A Distance
Richard Elen
 
Digital Watermarking Of Audio Signals.pptx
Digital Watermarking Of Audio Signals.pptxDigital Watermarking Of Audio Signals.pptx
Digital Watermarking Of Audio Signals.pptx
AyushJaiswal781174
 
G010424248
G010424248G010424248
G010424248
IOSR Journals
 
Spatial Conferencing
Spatial ConferencingSpatial Conferencing
Spatial Conferencing
IMTC
 
HD Voice: The Hurdles and how to overcome the codec war
HD Voice: The Hurdles and how to overcome the codec warHD Voice: The Hurdles and how to overcome the codec war
HD Voice: The Hurdles and how to overcome the codec war
John Gallagher
 
HD Voice, telecom operators
HD Voice, telecom operatorsHD Voice, telecom operators
HD Voice, telecom operators
John Gallagher
 
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
Academia Sinica
 
Michael Graves Astricon 2009 Hd Voice Demo Rev2
Michael Graves Astricon 2009 Hd Voice Demo Rev2Michael Graves Astricon 2009 Hd Voice Demo Rev2
Michael Graves Astricon 2009 Hd Voice Demo Rev2
Michael Graves
 

Similar to Dolby audio ai workshop speech coding - cong zhou (20)

Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)Speech compression using voiced excited loosy predictive coding (lpc)
Speech compression using voiced excited loosy predictive coding (lpc)
 
Presentation2
Presentation2Presentation2
Presentation2
 
Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)
 
LPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A ReviewLPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A Review
 
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
 
Conditional generative model for audio
Conditional generative model for audioConditional generative model for audio
Conditional generative model for audio
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
B034205010
B034205010B034205010
B034205010
 
Jonathan Christensen's Presentation at eComm 2009
Jonathan Christensen's Presentation at eComm 2009Jonathan Christensen's Presentation at eComm 2009
Jonathan Christensen's Presentation at eComm 2009
 
Jonathan Christensen's Presentation at eComm 2009
Jonathan Christensen's Presentation at eComm 2009Jonathan Christensen's Presentation at eComm 2009
Jonathan Christensen's Presentation at eComm 2009
 
Future Proof Surround Sound Mixing using Ambisonics
Future Proof Surround Sound Mixing using AmbisonicsFuture Proof Surround Sound Mixing using Ambisonics
Future Proof Surround Sound Mixing using Ambisonics
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
Radio Drama At A Distance
Radio Drama At A DistanceRadio Drama At A Distance
Radio Drama At A Distance
 
Digital Watermarking Of Audio Signals.pptx
Digital Watermarking Of Audio Signals.pptxDigital Watermarking Of Audio Signals.pptx
Digital Watermarking Of Audio Signals.pptx
 
G010424248
G010424248G010424248
G010424248
 
Spatial Conferencing
Spatial ConferencingSpatial Conferencing
Spatial Conferencing
 
HD Voice: The Hurdles and how to overcome the codec war
HD Voice: The Hurdles and how to overcome the codec warHD Voice: The Hurdles and how to overcome the codec war
HD Voice: The Hurdles and how to overcome the codec war
 
HD Voice, telecom operators
HD Voice, telecom operatorsHD Voice, telecom operators
HD Voice, telecom operators
 
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
 
Michael Graves Astricon 2009 Hd Voice Demo Rev2
Michael Graves Astricon 2009 Hd Voice Demo Rev2Michael Graves Astricon 2009 Hd Voice Demo Rev2
Michael Graves Astricon 2009 Hd Voice Demo Rev2
 

Recently uploaded

Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
shahdabdulbaset
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
Roger Rozario
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
Madan Karki
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
MiscAnnoy1
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 

Recently uploaded (20)

Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 

Dolby audio ai workshop speech coding - cong zhou

  • 1. HIGH QUALITY SPEECH CODING USING SAMPLERNN Cong Zhou Dolby Laboratories
  • 2. 2 Entropy Coding Vocoder Analysis Speech Bitstream Encoder Vocoder Synthesis SpeechEntropy Decoding Decoder SPARK JOY Inspired from Video Coding presentation by Anne Aaron, Director of Video Algorithms at Netflix
  • 3. © 2019 DOLBY LABORATORIES, INC. Raw Audio Generative Models • Sequential generative models o Directly estimate waveform distributions 𝑝 𝑋 = $ %&' ()* 𝑝 𝑥%,* 𝑥*, . . . , 𝑥% o Breakthrough success in generating realistic speech o WaveNet [1], SampleRNN [2], WaveRNN [3] 3 [1] Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016). [2] Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016). [3] Kalchbrenner, Nal, et al. "Efficient neural audio synthesis." arXiv preprint arXiv:1802.08435 (2018).
  • 4. © 2019 DOLBY LABORATORIES, INC. SampleRNN 4 SampleRNN: multi-rate RNN based generative model (MILA)
  • 5. © 2019 DOLBY LABORATORIES, INC. GRU Learned upsampling + 1 ⇥ 1 conv 1 ⇥ 1 conv Tier 2 GRU Learned upsampling + 1 ⇥ 1 conv 1 ⇥ 1 conv Tier 3 GRU Learned upsampling + 1 ⇥ 1 conv 1 ⇥ 1 conv Tier 4 1 ⇥ 1 conv MLP Tier 1 ht p(xi|x<i, ht) xi FS(2), . . . , xi 1 xi FS(3), . . . , xi 1 xi FS(4), . . . , xi 1 xi FS(1), . . . , xi 1 5 SampleRNN with conditioning
  • 6. © 2019 DOLBY LABORATORIES, INC. Training conditional SampleRNN 6 SampleRNN Speech 16 kHz Vocoder Analysis [1] Speech 16 kHz 𝐡 𝒕 conditioning info: LPC filter, RMS level of LPC residual, pitch, voicing level E{-log p(x)} [1] Per Hedelin, “A sinusoidal LPC vocoder,” in 2000 IEEE Workshop on Speech Coding. Proceedings. Meeting the Challenges of the New Millennium (Cat. No.00EX421), Sept 2000, pp. 2–4
  • 7. © 2019 DOLBY LABORATORIES, INC. Coding Scheme 7 DECODER sampleRNN Speech 16 kHz ENCODER Entropy Coding Vocoder Analysis Entropy Decoding Speech 16 kHz Bitstream 𝐡 𝒕 High-quality speech coding with SampleRNN Janusz Klejsa, Per Hedelin, Cong Zhou, Roy Fejgin, Lars Villemoes ICASSP 2019 Quantized vocoder features
  • 8. BadPoorFairGoodExcellent MUSHRAScore High-quality speech coding with SampleRNN Janusz Klejsa, Per Hedelin, Cong Zhou, Roy Fejgin, Lars Villemoes ICASSP 2019 23.05 kbps 16 kbps 6.4 kbps
  • 9. © 2019 DOLBY LABORATORIES, INC. Reference (original) AMR-WB (23.05 kbps) SILK (16 kbps) sRNN based (6.4 kbps) High-quality speech coding with SampleRNN Janusz Klejsa, Per Hedelin, Cong Zhou, Roy Fejgin, Lars Villemoes ICASSP 2019 Demo
  • 10. © 2019 DOLBY LABORATORIES, INC. Future directions • Robustness • Low complexity 10
  • 11. © 2019 DOLBY LABORATORIES, INC. Recent related work "WaveNet based low rate speech coding." Kleijn, W. Bastiaan, et al. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. "LPCNet: Improving neural speech synthesis through linear prediction." Valin, Jean-Marc, and Jan Skoglund. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. "Low bit-rate speech coding with VQ-VAE and a WaveNet decoder." Gârbacea, Cristina, et al. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. "A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet." Valin, Jean-Marc, and Jan Skoglund. arXiv preprint arXiv:1903.12087 (2019). "GELP: GAN-Excited Liner Prediction for Speech Synthesis from Mel-spectrogram." Juvela, Lauri, et al. arXiv preprint arXiv:1904.03976 (2019). 11