This master's thesis proposes a recurrent architecture to perform instance segmentation on images and videos using linguistic referring expressions. The model encodes referring expressions with BERT and visual features with an RVOS encoder, then decodes masks with a recurrent fusion of language and vision. Experiments on the RefCOCO dataset show the model outperforms baselines when incorporating referring expressions. The architecture is extended to video by adding temporal recurrence to an MAttNet+RVOS baseline, demonstrating promising initial results on the DAVIS dataset.