Cross-Modal Mutual Learning
for Audio-Visual Speech Recognition
and Manipulation
Sinhgad Logo
Name of student s Ex. No.
1 Guided by -
Computer Engineering Department
RMDSSOE, Warje
• Humans have a remarkable auditory system that can
perceive sound sources separately in their
conversations even in the presence of many
surrounding sounds, including background noise,
crowded babbling, thumping music, and sometimes
other loud voices. However, reliably separating a
target speech signal for human computer interaction
(HCI) systems such as speech recognition, speaker
recognition , and emotion recognition is still a
challenging task because it is an ill-posed problem.
Introduction
Motivation
We can be seen that, for both audio-visual speech recognition and synthesis, one needs to
extract representative features from cross-modality (i.e., audio vs. visual) input data.
While extracting linguistic representation would be necessary to realize the task of AVSR,
modality-preserving information such as subject identity needs to be preserved for
data recovery/synthesis purposes. With the impressive advent of deep learning technologies
that utilize high-dimensional embeddings , it is possible nowadays to simultaneously
analyze the unique acoustic characteristics of different speakers even from mixed signals.
Although these deep learning-based methods are effective compared to conventional
statistical signal processing-based ones, they are prone to a label permutation (or
ambiguity) error due to their frame-by-frame or short segment-based processing paradigm .
Project Scope
• Audio-visual speech recognition (AVSR) is the
task to perform speech recognition, with the
aid of the observed visual information (e.g., lip
motion).
Objectives & Goals
• we propose a unified learning framework,
which can be applied to jointly address the
tasks of audio-visual speech recognition and
manipulation (i.e., intra- and
• cross-modality synthesis), as depicted .
• We advance feature disentanglement learning
strategies, followed by a linguistic module that
extracts and transfers knowledge across
modalities via cross-modal mutual learning
Literature Survey
Audio-visual Speech Separation. In terms of multisensory integration, it has been
proved that looking at talking faces during conversation is helpful for speech
perception . visual audio lip synchronization(Prajwal et al. 2020b), automatic voice
acting (Prajwal et al. 2020a), voice conversion (Ding and Gutierrez-Osuna 2019),
and audio-visual speech separation (Gao and Grauman 2021). However, most
existing works typically focus on addressing only one or few selected tasks. For
such cross-modality learning tasks, it would be desirable to advance multi-task
learning strategies to utilize inputs across modalities for solving the above diverse
yet related learning tasks. To extract linguistic features from given input data,
techniques of adversarial training, vector quantization (VQ), or instance
normalization (IN) (Chou, Yeh, and Lee 2019; Ding and Gutierrez-Osuna 2019; van
den Oord, Vinyals, and kavukcuoglu 2017; Zhou et al. 2019) have been proposed.
However, previous studies (Ding and Gutierrez-Osuna 2019; Zhang, Song, and Qi
2018) suggest that such techniques might suffer from training instability or the
degraded synthesis data quality due to the design of the information bottleneck.
Problems Identified in the Existing
Work
• Audio/Visual Speech Recognition: Previous studies have
shown remarkable performance on audio speech recognition
(ASR) while visual speech recognition (VSR) is a more
challenging task due to the variety and ambiguity of lip
movements across speakers. For word-level speech
recognition.
Feasibility and Scope
• By extending the proposed cross-modal
affinity on the complex network, we further
improve the separation performance in the
complex spectral domain. Experimental results
verify that the proposed methods outperform
conventional ones on various datasets,
demonstrating their advantages in real-world
scenarios
Problem Statement
• we address the problem of separating
individual speech signals from videos using
audio-visual neural processing. Most
conventional approaches utilize framewise
matching criteria to extract shared information
between co-occurring audio and video. Thus,
their performance heavily depends on the
accuracy of audio-visual synchronization and
the effectiveness of their representations
References
[1] Simon Haykin and Zhe Chen. The cocktail party problem. Neural computation,
17(9):1875–1902, 2005.
[2] Albert S Bregman. Auditory scene analysis: The perceptual organization of sound.
MIT press, 1994.
[3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed,
Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et
al. Deep neural networks for acoustic modeling in speech recognition: The shared views
of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
[4] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua
Bengio. End-to-end attention-based large vocabulary speech recognition. In: ICASSP,
2016.
[5] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and
spell: A neural network for large vocabulary conversational speech recognition. In:
ICASSP, 2016.
[6] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier
Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker
verification. In: ICASSP, 2014.
[7] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev
Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In: ICASSP,
2018.
Base Paper

Cross Model.pptx

  • 1.
    Cross-Modal Mutual Learning forAudio-Visual Speech Recognition and Manipulation Sinhgad Logo Name of student s Ex. No. 1 Guided by - Computer Engineering Department RMDSSOE, Warje
  • 2.
    • Humans havea remarkable auditory system that can perceive sound sources separately in their conversations even in the presence of many surrounding sounds, including background noise, crowded babbling, thumping music, and sometimes other loud voices. However, reliably separating a target speech signal for human computer interaction (HCI) systems such as speech recognition, speaker recognition , and emotion recognition is still a challenging task because it is an ill-posed problem. Introduction
  • 3.
    Motivation We can beseen that, for both audio-visual speech recognition and synthesis, one needs to extract representative features from cross-modality (i.e., audio vs. visual) input data. While extracting linguistic representation would be necessary to realize the task of AVSR, modality-preserving information such as subject identity needs to be preserved for data recovery/synthesis purposes. With the impressive advent of deep learning technologies that utilize high-dimensional embeddings , it is possible nowadays to simultaneously analyze the unique acoustic characteristics of different speakers even from mixed signals. Although these deep learning-based methods are effective compared to conventional statistical signal processing-based ones, they are prone to a label permutation (or ambiguity) error due to their frame-by-frame or short segment-based processing paradigm .
  • 4.
    Project Scope • Audio-visualspeech recognition (AVSR) is the task to perform speech recognition, with the aid of the observed visual information (e.g., lip motion).
  • 5.
    Objectives & Goals •we propose a unified learning framework, which can be applied to jointly address the tasks of audio-visual speech recognition and manipulation (i.e., intra- and • cross-modality synthesis), as depicted . • We advance feature disentanglement learning strategies, followed by a linguistic module that extracts and transfers knowledge across modalities via cross-modal mutual learning
  • 6.
    Literature Survey Audio-visual SpeechSeparation. In terms of multisensory integration, it has been proved that looking at talking faces during conversation is helpful for speech perception . visual audio lip synchronization(Prajwal et al. 2020b), automatic voice acting (Prajwal et al. 2020a), voice conversion (Ding and Gutierrez-Osuna 2019), and audio-visual speech separation (Gao and Grauman 2021). However, most existing works typically focus on addressing only one or few selected tasks. For such cross-modality learning tasks, it would be desirable to advance multi-task learning strategies to utilize inputs across modalities for solving the above diverse yet related learning tasks. To extract linguistic features from given input data, techniques of adversarial training, vector quantization (VQ), or instance normalization (IN) (Chou, Yeh, and Lee 2019; Ding and Gutierrez-Osuna 2019; van den Oord, Vinyals, and kavukcuoglu 2017; Zhou et al. 2019) have been proposed. However, previous studies (Ding and Gutierrez-Osuna 2019; Zhang, Song, and Qi 2018) suggest that such techniques might suffer from training instability or the degraded synthesis data quality due to the design of the information bottleneck.
  • 7.
    Problems Identified inthe Existing Work • Audio/Visual Speech Recognition: Previous studies have shown remarkable performance on audio speech recognition (ASR) while visual speech recognition (VSR) is a more challenging task due to the variety and ambiguity of lip movements across speakers. For word-level speech recognition.
  • 8.
    Feasibility and Scope •By extending the proposed cross-modal affinity on the complex network, we further improve the separation performance in the complex spectral domain. Experimental results verify that the proposed methods outperform conventional ones on various datasets, demonstrating their advantages in real-world scenarios
  • 9.
    Problem Statement • weaddress the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize framewise matching criteria to extract shared information between co-occurring audio and video. Thus, their performance heavily depends on the accuracy of audio-visual synchronization and the effectiveness of their representations
  • 10.
    References [1] Simon Haykinand Zhe Chen. The cocktail party problem. Neural computation, 17(9):1875–1902, 2005. [2] Albert S Bregman. Auditory scene analysis: The perceptual organization of sound. MIT press, 1994. [3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012. [4] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end attention-based large vocabulary speech recognition. In: ICASSP, 2016. [5] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: ICASSP, 2016. [6] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In: ICASSP, 2014. [7] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In: ICASSP, 2018.
  • 11.