This document discusses using a combination of long short-term memory (LSTM) and convolutional neural networks (CNN) for visual question answering (VQA). It proposes extracting image features from CNNs and encoding question semantics with LSTMs. A multilayer perceptron would then combine the image and question representations to predict answers. The methodology aims to reduce statistical biases in VQA datasets by focusing attention on relevant image regions. It was implemented in Keras with TensorFlow using pre-trained CNNs for images and word embeddings for questions. The proposed approach analyzes local image features and question semantics to improve VQA classification accuracy over methods relying solely on language.