This paper proposes a system to score how well an image matches a sentence and vice versa. It represents images and sentences as triplets of objects, actions, and scenes in a shared meaning space. Features from detectors, classifiers and distributional semantics are used to compute potentials for a Markov random field model. The model is trained discriminatively to match ground truth image-sentence pairs. Evaluation on a novel dataset shows the system can accurately annotate images and illustrate sentences, though failures still occur.