This paper proposes a unified learning framework to jointly address audio-visual speech recognition and manipulation tasks using cross-modal mutual learning. It aims to disentangle representative features from audio and visual input data using advanced learning strategies. A linguistic module is used to extract knowledge across modalities through cross-modal learning. The goal is to recognize speech with the aid of visual information like lip movements, while preserving identity information for data recovery and synthesis tasks.