The problem of Spatio-Temporal Invariant Points in Videos
Multimedia Systems - Class Project
Spatio-Temporal Invariant Points in Videos
Priyatham Bollimpalli – 10010148
Pydi Peddigari Venkat Sai – 10010149
PVS Dileep – 10010180
The objective here is to find the spatio-temporal invariant points in a given input video. We
implement the following models on a set of contiguous frames of a video, called a scene.
We divide the problem into three cases, one with the background being fixed in a video and
the entire scene is not dynamic, second, background fixed and the entire scene is
reasonably dynamic, and final one with the background moving and the objects are also
moving. We examine those cases below:
Case1: When the Background is fixed and the entire scene is not dynamic
In this case, the background in the scene is fixed across several frames, while the foreground
objects can keep moving across the whole video, but they do not occupy the entire frame of
video with their movements, i.e., only some parts of the frame would be having movement,
while a decent part of the frame would remain static. The following procedure is followed to
detect the spatio-temporal interest points in the scene:
Every scene is a collection of several frames. In this instance, we would consider a
scene of a video in which the background is constant, and in the foreground, there is
a ball which quickly moves underneath a wooden block which is constant in the
video. These are just a four of the several frames present in the video:
The difference of the first two frames of the image would be computed. In the
resulted difference image, we would have several ranges of pixel values possible.
Hence, we would keep a grey threshold of say 0.2, and put all the locations of the
frame which are above that grey threshold level, to pixel value of that of white. We
also have an image which we will keep writing through across several iterations of
this method(we refer to that image as Rframe). Now, in the previously resulted
frame of difference, we find all the locations where the pixel value is white, and fill
out all those corresponding locations on the Rframe to black.
We keep on repeating this process of finding the difference between two successive
frames of the scene, thresholding the difference to get some pixel values which are
white, and filling out all the locations on Rframe where there is a white pixel to
For example, the difference of the frames at several instants are as follows:
Then, the final Rframe produced will be as follows:
The portions which are black in this image would depict those points where a
temporal invariance is not possible, as the objects keep moving in those areas. The
portions which are white would indicate those points which are constant throughout
the period of the video. These would be our points of interest for application of SIFT
on those points.
On the final go, we would apply SIFT on the original frames of the scene, and find all
those points resulted from SIFT, and consider such a point among them to be our
interest point in this case only if that point is among the white portions of the
Rframe, i.e., we would only want the interest points which coincide with the white
portions of the Rframe.
Hence, the resulted interest points finally would form our Spatio temporal invariant
points in this case.
The above algorithm is also run on other videos as follows:
Some of the several frames in the original video :
Some among the resulted differences in the frames at several instants are:
Then the Rframe resulted is as follows:
Now, we find the interest points at those locations, which are among the whie
portions of the Rframe, since, only those would be the temporal invariant parts of
the image. Hence, the interest points of the scene resulted at some of the several
instants are as follows:
So, the above points marked with green would represent the spatio-temporal
interest points of the scene, at some of the instants among the whole.
It is run on another video as follows:
The Rframe resulted is :
Interest points resulted at several frames are :
Case2: When the Background is fixed and the entire scene is reasonably
In this case, the background in the scene is fixed across several frames, while the foreground
objects can keep moving across the whole video, and they do occupy the most of the frame
of video with their reasonable movements. In this case, the following procedure is followed:
Consider a scene as follows:
Now, if Case-1 was used here, the Rframe resulted would be as follows:
Hence, if case-1 is used here, we can observe that most of the region is blacked out
since the objects motion is present almost over the entire image, and hence, we lose
some of the possible interest points.
Hence, we adopt the following method now. This method would use automatic
detection and motion-based tracking of moving objects in a video. This problem can
be seen as:
o detecting moving objects in each frame
o associating the detections corresponding to the same object over time
The association of detections to the same object is based solely on motion. The
motion of each track is estimated by a Kalman filter. The filter is used to predict the
track's location in each frame, and determine the likelihood of each detection being
assigned to each track.
In any given frame, some detections may be assigned to tracks, while other
detections and tracks may remain unassigned. The assigned tracks are updated using
the corresponding detections. The unassigned tracks are marked invisible. Each track
keeps count of the number of consecutive frames, where it remained unassigned. If
the count exceeds a specified threshold, the example assumes that the object left
the field of view and it deletes the track.
So in the process, a frame is read, objects are detected with their centroids and
bounding boxes, and motion segmentation using the foreground detector. Next, the
detections are assigned to tracks. Then, the assigned tracks would be update and the
unassigned tracks would be updated by marking them invisible, and the lost tracks
would be deleted.
The following are the results of the tracked objects at some of the instants of the
Now, we apply SIFT on the tracked objects at all instants of the video. Then, the
following would be the interest points produced at several instants of the video:
Now, to gather further more interest points from the video, we can also combine the
interest points generated from case-1, and hence, the resulted interest points would
So this would capture all the interest points possible, combining case-1 and case-2
Case3: When the background is moving and the objects are also moving
When the camera as well as well objects are moving, tracking the objects is a challenging
issue. There will be rarely anything invariant even in one particular scene. This is still an
open research problem and some heuristic methods are successful. The video has a moving
car with a moving camera. Some of the frames are given below.
One heuristic method which we found on the web to solve it is given below.
Note that segmenting the car frame differentiation (background subtraction) won’t work
because the camera is also moving henceforth the background is also moving. Hence normal
prediction algorithms would fail in this case. To tackle this issue, Optical flow is used where
the scene and the car have different directions of flow. In the below figures, the red points
denote the optical flow of the background and the green points denote the points having an
optical flow opposite to the red points. Note that the green points are able to track the car
present in the entire video and hence locate the green points which are spatio-temporally
Now, combination of the points obtained here with those obtained in case-1 and case-2 (it is
less likely that any points would be there), gives the total points possible.
Conclusion & Future Work:
In this report we have tried to perform three kinds of techniques on any given scene in a
video – when the background is almost stationary and the scene is not dynamic, when the
background is stationary and the scene is dynamic and the background and the scene
bother are dynamic. Combination of points obtained from all the three methods gives
maximum possible spatio-temporal invariant points.
There is lot of scope for future work in this area and we wish to pursue it further. The
following are the issues involved.
In case-1, the level at which thresholding is done defines the extent/degree to which
motion of the object is considered. Lower the threshold, greater the impact of
motion. This entirely depends on the video i.e. if a video has lot of illumination and
contrast changes between the frames, then the difference of the frames would give
many false contours. In this case considering higher value of threshoding is desirable.
In some other cases the frame rate would be too high due to which almost negligible
amount of motion would be captured between the frames. In this case lower value
for thresholding is desirable. Hence developing an optimal threshold value
automatically by taking the video quality, frame rate into account is one area of
In case-3, the above example gives a possible approach to solution. It works because
the camera is also continuously moving along with the car and getting the points
with opposite optical flow works. But if the camera remains stationary for some time
and moves again suddenly, we need a separate system to first track the background
movement and that in combination of foreground motion can be used to find out
the points on the object. Many object tracking methods exist and many are still
pursed since this is a very active area of research. It is likely that exploring into this
area would give a generic algorithm for obtaining an object and the points which are
consistent on it throughout the scene.
Case-2 considers the background to be stationary and the motion of the objects to
be uniform. Modifying the parameters for the detection, assignment, and deletion
steps of the trackers according to the video may be done. The tracking in this
example was solely based on motion with the assumption that all objects move in a
straight line with constant speed. When the motion of an object significantly
deviates from this model, the example may produce tracking errors.
The likelihood of tracking errors can be reduced by using a more complex motion
model, such as constant acceleration, or by using multiple Kalman filters for every
object. Also, you can incorporate other cues for associating detections over time,
such as size, shape, and colour.