This document summarizes the CPGAN model for text-to-image synthesis. CPGAN uses a coarse-to-fine generative framework with a memory-attended text encoder to parse text and images into content. It also employs a fine-grained conditional discriminator to match the relationships between words and sub-regions of images. Experimental results show CPGAN outperforms other models on quantitative metrics while using a lighter neural network. However, the quality of generated images still has room for improvement.