The document summarizes several papers on image captioning using neural networks. It describes Karpathy and Fei-Fei (2015), which obtains image and sentence embeddings, trains them to align image patches with words, and uses an RNN to map image fragments to phrases. It also discusses Vinyals et al (2015) and Mao et al (2015), noting common themes of joint embedding spaces and differences in image processing, inputs to RNNs, and recurrence types. Evaluation metrics like R@K and Med r are also introduced.