The document describes a system for multilingual speech-to-text conversion using Hugging Face that aims to assist deaf individuals. The system uses a Transformer-based encoder-decoder model trained on audio datasets. Feature extraction and tokenization are performed on the audio inputs. The model is fine-tuned using Hugging Face and evaluated based on word error rate. A web application allows users to record speech via microphone and see the transcription output. The implemented model achieved a 32.42% word error rate on a Hindi language dataset. The goal is to enable seamless communication for those with hearing impairments across multiple languages.