This document presents an audio-visual emotion recognition system that uses multiple modalities and machine learning techniques. It extracts audio features like MFCCs and visual features like facial landmarks from video clips. It uses classifiers like CNNs and stacks their confidence outputs to predict emotions. The system achieves state-of-the-art performance on several databases according to experiments. It represents an improvement over previous work by combining audio, visual and classifier fusion approaches for multimodal emotion recognition.