This document proposes ViSiL, a method for fine-grained video similarity learning that respects both the spatial structure of video frames and the temporal structure of videos. ViSiL learns a video similarity function using a 4-layer CNN that captures temporal structures in a frame-to-frame similarity matrix. Experimental results show ViSiL can accurately retrieve near-duplicate, same incident, same action, and same event videos from databases.