This document discusses end-to-end convolutional semantic embeddings for relating images and text. It presents a model that uses a textual network of convolutional and recurrent layers to produce embeddings of sentences, and a visual network of convolutional layers to produce image embeddings. These embeddings are trained end-to-end using both global and local losses to relate images and sentences. The model achieves state-of-the-art results on tasks of sentence retrieval from images and image retrieval from sentences on datasets like MS-COCO and Flickr30K.