This document summarizes an approach for no-audio multimodal speech detection using CNNs and Fisher Vectors. Dense trajectories and Fisher Vectors were used to extract features from video, with 1-second windows labeled as speaking or not. A 1D CNN was used on acceleration data with 3-second windows also labeled at 1 second intervals. Preliminary experiments showed the CNN outperformed logistic regression and was competitive with LSTMs. While the CNN benefited from more context, 1-second windows were too small for accurate video predictions. Opportunities for future work included exploring optimal context sizes and detecting more subtle speech cues.