This document summarizes a paper on multimodal emotion recognition from speech, text, and video data. It discusses how combining multiple modalities can provide richer information than single modalities alone. It presents the IEMOCAP and CMU-MOSEI datasets and compares their modalities. Techniques for fusing modalities include early and late fusion. The paper proposes a solution that filters ineffective data, regenerates proxy features, and uses multiplicative fusion to boost stronger modalities. It evaluates the approach on the CMU-MOSEI dataset using speech, text, and video features and discusses limitations in distinguishing some emotions.