This document proposes a fluency-guided approach to cross-lingual image captioning that aims to generate relevant and fluent captions in a target language with minimal human effort. It trains a captioning model using machine-translated captions while also estimating the fluency of the translations. Three fluency-guided training strategies are explored: using only fluent translations, rejection sampling of less fluent translations, and weighting the loss by estimated fluency. Experiments on English-Chinese datasets show the rejection sampling approach achieves the best balance of relevance and fluency in human evaluations without requiring manually-written captions.