Image captioning is the process of generating textual description of an image. It uses both Natural Language Processing and Computer Vision to generate the captions. Like in the notorious “finger pointing to the moon”, automated image captioning requires the ability to discern what it’s really going on in a scene and generate a fluent description for the act taking place. In this talk we present the underlying mechanics to the object detection and language generation using Convolutional and Recurrent Neural Networks.