This document discusses using large image datasets and context to understand scenes and objects. It proposes using millions of internet images to generate proposals for image completion and labeling based on nearest visual neighbors. Location metadata from geotagged images can provide context without object labels. Event prediction and video synthesis is demonstrated by retrieving relevant images from large collections to construct new videos based on a text query. Overall it argues that large internet-scale image collections provide rich context that can be leveraged for computer vision tasks through data-driven approaches rather than explicit modeling.