This document proposes a multimodal ensemble model for detecting unreliable information on Vietnamese social media. It uses text, image, and metadata features as inputs to three deep learning models - BERT+CNN, and two variants with additional CNN layers. An attention mechanism is applied to learn which image parts to focus on for each text. The models are ensembled by averaging their prediction probabilities. Evaluation on a private test set shows the ensemble model achieves an AUC of 0.945, outperforming the individual models. Future work could involve comparing posts to external sources to find evidence of fakes.