Cross Model.pptx

Cross-Modal Mutual Learning
for Audio-Visual Speech Recognition
and Manipulation
Sinhgad Logo
Name of student s Ex. No.
1 Guided by -
Computer Engineering Department
RMDSSOE, Warje

• Humans have a remarkable auditory system that can
perceive sound sources separately in their
conversations even in the presence of many
surrounding sounds, including background noise,
crowded babbling, thumping music, and sometimes
other loud voices. However, reliably separating a
target speech signal for human computer interaction
(HCI) systems such as speech recognition, speaker
recognition , and emotion recognition is still a
challenging task because it is an ill-posed problem.
Introduction

Motivation
We can be seen that, for both audio-visual speech recognition and synthesis, one needs to
extract representative features from cross-modality (i.e., audio vs. visual) input data.
While extracting linguistic representation would be necessary to realize the task of AVSR,
modality-preserving information such as subject identity needs to be preserved for
data recovery/synthesis purposes. With the impressive advent of deep learning technologies
that utilize high-dimensional embeddings , it is possible nowadays to simultaneously
analyze the unique acoustic characteristics of different speakers even from mixed signals.
Although these deep learning-based methods are effective compared to conventional
statistical signal processing-based ones, they are prone to a label permutation (or
ambiguity) error due to their frame-by-frame or short segment-based processing paradigm .

Project Scope
• Audio-visual speech recognition (AVSR) is the
task to perform speech recognition, with the
aid of the observed visual information (e.g., lip
motion).

Objectives & Goals
• we propose a unified learning framework,
which can be applied to jointly address the
tasks of audio-visual speech recognition and
manipulation (i.e., intra- and
• cross-modality synthesis), as depicted .
• We advance feature disentanglement learning
strategies, followed by a linguistic module that
extracts and transfers knowledge across
modalities via cross-modal mutual learning

Literature Survey
Audio-visual Speech Separation. In terms of multisensory integration, it has been
proved that looking at talking faces during conversation is helpful for speech
perception . visual audio lip synchronization(Prajwal et al. 2020b), automatic voice
acting (Prajwal et al. 2020a), voice conversion (Ding and Gutierrez-Osuna 2019),
and audio-visual speech separation (Gao and Grauman 2021). However, most
existing works typically focus on addressing only one or few selected tasks. For
such cross-modality learning tasks, it would be desirable to advance multi-task
learning strategies to utilize inputs across modalities for solving the above diverse
yet related learning tasks. To extract linguistic features from given input data,
techniques of adversarial training, vector quantization (VQ), or instance
normalization (IN) (Chou, Yeh, and Lee 2019; Ding and Gutierrez-Osuna 2019; van
den Oord, Vinyals, and kavukcuoglu 2017; Zhou et al. 2019) have been proposed.
However, previous studies (Ding and Gutierrez-Osuna 2019; Zhang, Song, and Qi
2018) suggest that such techniques might suffer from training instability or the
degraded synthesis data quality due to the design of the information bottleneck.

Problems Identified in the Existing
Work
• Audio/Visual Speech Recognition: Previous studies have
shown remarkable performance on audio speech recognition
(ASR) while visual speech recognition (VSR) is a more
challenging task due to the variety and ambiguity of lip
movements across speakers. For word-level speech
recognition.

Feasibility and Scope
• By extending the proposed cross-modal
affinity on the complex network, we further
improve the separation performance in the
complex spectral domain. Experimental results
verify that the proposed methods outperform
conventional ones on various datasets,
demonstrating their advantages in real-world
scenarios

Problem Statement
• we address the problem of separating
individual speech signals from videos using
audio-visual neural processing. Most
conventional approaches utilize framewise
matching criteria to extract shared information
between co-occurring audio and video. Thus,
their performance heavily depends on the
accuracy of audio-visual synchronization and
the effectiveness of their representations

References
[1] Simon Haykin and Zhe Chen. The cocktail party problem. Neural computation,
17(9):1875–1902, 2005.
[2] Albert S Bregman. Auditory scene analysis: The perceptual organization of sound.
MIT press, 1994.
[3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed,
Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et
al. Deep neural networks for acoustic modeling in speech recognition: The shared views
of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
[4] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua
Bengio. End-to-end attention-based large vocabulary speech recognition. In: ICASSP,
2016.
[5] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and
spell: A neural network for large vocabulary conversational speech recognition. In:
ICASSP, 2016.
[6] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier
Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker
verification. In: ICASSP, 2014.
[7] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev
Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In: ICASSP,
2018.

Cross Model.pptx

More Related Content

Similar to Cross Model.pptx

Recently uploaded

Cross Model.pptx