Five years before the release of ChatGPT, the world of Machine Translation (MT) was dominated by
unimodal AI implementations, generally bilingual or multilingual AI models with only text modality. The
era of Large Language Models (LLMs) led to various multimodal translation initiatives with text and
image modalities, based on custom data engineering techniques that introduced expectations for
improvement in the field of MT when using multimodal options. In our work, we introduced a first of its
kind AI multimodal translation with four modalities (text, image, audio and video), from English towards a
low resource language and vice-versa. Our results confirmed that multimodal translation generalizes
better, always brings improvement to unimodal text translation, and superior performance as the number
of unseen samples increases. Moreover, this initiative is a hope for worldwide low resource languages for
which the use of non-text modalities is a great solution to data scarcity in the field.