Linear Recurrent Convolutional Networks for Segment-Based
Multiple Object Tracking
Erick Lin, Amirreza Shaban, Dr. Byron Boots
Robot Learning Lab, Institute for Robotics and Intelligent Machines
Introduction
Automatic object tracking in moving images has remained a long-standing problem in the
domain of computer vision, yet it is of paramount practical importance in many day-to-
day scenarios, especially those involving surveillance or human-computer interaction. The
application of eye-tracking technology, for example, has enabled insights into how humans
process visual information such as text, which have led to the development of more effective
methods of diagnostics as well as accessible digital interfaces [2].
Regarded as a problem complementary to object recognition, or the process of identifying
objects in still images by the pixels that compose them, object tracking focuses on the sub-
sequent task of matching these objects in one image to the same objects whose appearances
differ slightly in another image, whether in position, lighting, or even events such as occlu-
sion by objects closer to the foreground. Objects that can be tracked include any tangible
items, people, landmarks, or even parts of other objects that are considered by humans
visually to be separate entities [8].
Motivation
Object recognition and object tracking have both been framed in the context of machine
learning. For object recognition, learning models traditionally take the form of convolutional
neural networks (CNNs), which have been highly favored due to their lower training times
compared to similar models and their ability to take advantage of object locality, the prop-
erty that pixels that make up an object share the same neighborhood. In object recognition
for a single image, convolutional neural networks are often used to output the set of all
the superpixels, or groups of pixels that are similar in location and color, in that image.
Afterward, one of a variety of robust methods such as the POISE [4] algorithm, which has
been successful at addressing the problem of recognizing objects located far from image
boundaries, is used to merge superpixels into segments, which are intended to represent
whole objects.
Learning models that serve the purpose of object tracking have seen especially swift
progress in recent years, with recent breakthroughs in computational efficiency being made
through the use of linear regression and greedy matching techniques [6]. We will utilize
a variation of recurrent neural networks (RNNs), which are characterized by one or more
direct feedback loops from outputs to inputs, to build a fast learning model for tracking all
the visible objects in an image over time. While models based on a form of RNN known
as the long short-term memory (LSTM) network have performed successfully on tasks such
as annotating individual images with English-language descriptions [1], a shortcoming of
LSTM networks is that they are composed of nonlinear transformations on input data, so
these models require larger quantities of training data to avoid the statistical problem of
overfitting, and are hence also more time-consuming to train. Thus, LSTM networks are
infeasible for the large dataset sizes typically associated with moving images; on the other
hand, by being composed exclusively of linear transformations, our RNN architecture for
multiple object tracking is intended to circumvent this issue.
1
Objective
We have thus far prototyped our current design of the linear recurrent convolutional neu-
ral network model using the open-source deep learning framework Caffe. For classifying
superpixels, we will use a “deep” or many-layered convolutional neural network implemen-
tation such as AlexNet [5] which performs well on high-resolution images. We will then
perform the image processing technique of average pooling of the superpixels – that is, av-
eraging the characteristics of the pixels in each superpixel to render it uniform and thereby
better-contrasting with other superpixels for subsequent processing. Next, we will run a
segmentation procedure such as the previously mentioned POISE algorithm, and follow this
with average pooling of the segments by their superpixels.
Next, a sequence of fully connected neural networks (NNs), a more classic architecture
which serves a wider variety of purposes in machine learning, will learn and then perform
the nonlinear mappings of the segmentation data, making it possible for our object tracking
computations to retain their linearity. This step is justifiable in spite of the relative expense
of NNs and the aforementioned shortcomings of nonlinear models, because the segmenta-
tion data is much smaller in size than the original image data. Our entire architecture is
summarized below for convenience.
Input → CNN → Pooling + POISE → Nonlinear NN → Tracker → Output
The centerpiece of our architecture, the newly proposed linear recurrent neural network, is
referred to as the Tracker layer. Our prototype of the Tracker layer so far is governed by
the following equations, which are also visualized in a network diagram.
Eq (5)
Eq (4)
Eq (2)Eq (1)
Eq (6)
Eq (3)
Ht = Ht−1 + Xt MtXt (1)
Ct = Ct−1 + ˜Vt MtXt (2)
˜Vt = φ1(Vt ) (3)
Vt = Wt−1Xt (4)
Wt−1 = Ct−1(Ht−1 + λst−1I)−1
(5)
Mt = δ(φ2(Vt )) (6)
st = st−1 + σ(Mt) (7)
In these equations, t is the number of the current frame and can be seen as a time
parameter, Xt is the primary input matrix whose rows represent segments, λ is a regular-
ization constant commonly used in machine learning to prevent overfitting, and Ht and Ct
are hidden and memory cell units, respectively, which account for the cumulative nature of
previously seen examples. φ1 is given by the operation that keeps only the maximum value
in each row of a matrix while zeroing out the rest, δ converts an n-dimensional vector into
an n-by-n diagonal matrix, and σ sums all the elements in a matrix. The primary output
is ˜Vt, which encodes the best matchings between the existing segments and the segments in
the current frame.
As of now, φ2 and its parametrization remain unknown, and equation (5) still involves
a matrix inverse operation, which is known to be computationally expensive. Thus, one of
my primary objectives will be to work out the remaining details and modify the design of
the Tracker layer in order to further improve its efficiency and accuracy – in the case of the
matrix inverse, I will need to consider faster approximation methods that are feasible given
our knowledge of the structure of the Xt matrix.
In addition, the previous parts of the pipeline currently require training in a supervised
manner for a performance boost. This involves using a set of input image sequences paired
2
with ground truth labels, or the absolute segmentations for each image that are known to
be correct; I will obtain this data by applying the POISE segmentation proposal algorithm
to the publically-available Sintel dataset, which contains a collection of video sequences
originating from the open-source computer animated film of the same name. The Sintel
dataset also includes the ground truth optical flow for each image, which describes pixel-
wise movement from the current image to the next. In order to match ground truth segments
from each frame to the next, I will need to write algorithms that combine the segmentation
and optical flow data for any frame to produce the predicted superpixels and segments for
the next frame with occlusion handling, then match the predicted with the ground truth
superpixels and segments of the actual subsequent frame by their overlap, or the size of their
pairwise intersection divided by the size of their pairwise union.
We may consider adding an additional phase following the Tracker layer which further
improves segmentation results by using known refinement techniques such as composite
statistical inference (CSI) [6]. Finally, we will compare the performances of our linear
recurrent convolutional network on various metrics with the currently established video
segmentation benchmarks [3].
Conclusion
In this proposal, I have described a linear RNN-based model which may outperform the
state-of-the-art approaches in object tracking and mark the first appearance of such a class
of models for this specific application. Our ideal end goal is a multiple object tracking
system that works in real time with incoming video streams up to some certain resolution,
a tool which would prove beneficial for many critical as well as everyday settings.
References
[1] Donahue, J., Hendricks, L. A., Guadarrama, S., and Rohrbach, M. Long-
term recurrent convolutional networks for visual recognition and description. In Neural
Information Processing Systems (2007).
[2] Duchowski, A. T. A breadth-first survey of eye tracking applications. Behavior
Research Methods, Instruments, and Computers (2002).
[3] Galasso, F., Nagaraja, N. S., Ca’rdenas, T. J., Brox, T., and Schiele, B. A
unified video segmentation benchmark: Annotation, metrics, and analysis. In Computer
Vision and Pattern Recognition (2013).
[4] Humayun, A., Li, F., and Rehg, J. M. The middle child problem: Revisiting para-
metric min-cut and seeds for object proposals. In International Conference on Computer
Vision (2015).
[5] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with
deep convolutional neural networks. In Neural Information Processing Systems Confer-
ence (2012).
[6] Li, F., Kim, T., Humayun, A., Tsai, D., and Rehg, J. M. Video segmentation by
tracking many figure-ground segments. In International Conference on Computer Vision
(2013).
[7] Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for
semantic segmentation. In Computer Vision and Pattern Recognition (2015).
[8] Luo, W., Xing, J., Zhang, X., Zhao, X., and Kim, T.-K. Multiple object tracking:
A literature review. ACM Computing Surveys (2015).
3

proposal_pura

  • 1.
    Linear Recurrent ConvolutionalNetworks for Segment-Based Multiple Object Tracking Erick Lin, Amirreza Shaban, Dr. Byron Boots Robot Learning Lab, Institute for Robotics and Intelligent Machines Introduction Automatic object tracking in moving images has remained a long-standing problem in the domain of computer vision, yet it is of paramount practical importance in many day-to- day scenarios, especially those involving surveillance or human-computer interaction. The application of eye-tracking technology, for example, has enabled insights into how humans process visual information such as text, which have led to the development of more effective methods of diagnostics as well as accessible digital interfaces [2]. Regarded as a problem complementary to object recognition, or the process of identifying objects in still images by the pixels that compose them, object tracking focuses on the sub- sequent task of matching these objects in one image to the same objects whose appearances differ slightly in another image, whether in position, lighting, or even events such as occlu- sion by objects closer to the foreground. Objects that can be tracked include any tangible items, people, landmarks, or even parts of other objects that are considered by humans visually to be separate entities [8]. Motivation Object recognition and object tracking have both been framed in the context of machine learning. For object recognition, learning models traditionally take the form of convolutional neural networks (CNNs), which have been highly favored due to their lower training times compared to similar models and their ability to take advantage of object locality, the prop- erty that pixels that make up an object share the same neighborhood. In object recognition for a single image, convolutional neural networks are often used to output the set of all the superpixels, or groups of pixels that are similar in location and color, in that image. Afterward, one of a variety of robust methods such as the POISE [4] algorithm, which has been successful at addressing the problem of recognizing objects located far from image boundaries, is used to merge superpixels into segments, which are intended to represent whole objects. Learning models that serve the purpose of object tracking have seen especially swift progress in recent years, with recent breakthroughs in computational efficiency being made through the use of linear regression and greedy matching techniques [6]. We will utilize a variation of recurrent neural networks (RNNs), which are characterized by one or more direct feedback loops from outputs to inputs, to build a fast learning model for tracking all the visible objects in an image over time. While models based on a form of RNN known as the long short-term memory (LSTM) network have performed successfully on tasks such as annotating individual images with English-language descriptions [1], a shortcoming of LSTM networks is that they are composed of nonlinear transformations on input data, so these models require larger quantities of training data to avoid the statistical problem of overfitting, and are hence also more time-consuming to train. Thus, LSTM networks are infeasible for the large dataset sizes typically associated with moving images; on the other hand, by being composed exclusively of linear transformations, our RNN architecture for multiple object tracking is intended to circumvent this issue. 1
  • 2.
    Objective We have thusfar prototyped our current design of the linear recurrent convolutional neu- ral network model using the open-source deep learning framework Caffe. For classifying superpixels, we will use a “deep” or many-layered convolutional neural network implemen- tation such as AlexNet [5] which performs well on high-resolution images. We will then perform the image processing technique of average pooling of the superpixels – that is, av- eraging the characteristics of the pixels in each superpixel to render it uniform and thereby better-contrasting with other superpixels for subsequent processing. Next, we will run a segmentation procedure such as the previously mentioned POISE algorithm, and follow this with average pooling of the segments by their superpixels. Next, a sequence of fully connected neural networks (NNs), a more classic architecture which serves a wider variety of purposes in machine learning, will learn and then perform the nonlinear mappings of the segmentation data, making it possible for our object tracking computations to retain their linearity. This step is justifiable in spite of the relative expense of NNs and the aforementioned shortcomings of nonlinear models, because the segmenta- tion data is much smaller in size than the original image data. Our entire architecture is summarized below for convenience. Input → CNN → Pooling + POISE → Nonlinear NN → Tracker → Output The centerpiece of our architecture, the newly proposed linear recurrent neural network, is referred to as the Tracker layer. Our prototype of the Tracker layer so far is governed by the following equations, which are also visualized in a network diagram. Eq (5) Eq (4) Eq (2)Eq (1) Eq (6) Eq (3) Ht = Ht−1 + Xt MtXt (1) Ct = Ct−1 + ˜Vt MtXt (2) ˜Vt = φ1(Vt ) (3) Vt = Wt−1Xt (4) Wt−1 = Ct−1(Ht−1 + λst−1I)−1 (5) Mt = δ(φ2(Vt )) (6) st = st−1 + σ(Mt) (7) In these equations, t is the number of the current frame and can be seen as a time parameter, Xt is the primary input matrix whose rows represent segments, λ is a regular- ization constant commonly used in machine learning to prevent overfitting, and Ht and Ct are hidden and memory cell units, respectively, which account for the cumulative nature of previously seen examples. φ1 is given by the operation that keeps only the maximum value in each row of a matrix while zeroing out the rest, δ converts an n-dimensional vector into an n-by-n diagonal matrix, and σ sums all the elements in a matrix. The primary output is ˜Vt, which encodes the best matchings between the existing segments and the segments in the current frame. As of now, φ2 and its parametrization remain unknown, and equation (5) still involves a matrix inverse operation, which is known to be computationally expensive. Thus, one of my primary objectives will be to work out the remaining details and modify the design of the Tracker layer in order to further improve its efficiency and accuracy – in the case of the matrix inverse, I will need to consider faster approximation methods that are feasible given our knowledge of the structure of the Xt matrix. In addition, the previous parts of the pipeline currently require training in a supervised manner for a performance boost. This involves using a set of input image sequences paired 2
  • 3.
    with ground truthlabels, or the absolute segmentations for each image that are known to be correct; I will obtain this data by applying the POISE segmentation proposal algorithm to the publically-available Sintel dataset, which contains a collection of video sequences originating from the open-source computer animated film of the same name. The Sintel dataset also includes the ground truth optical flow for each image, which describes pixel- wise movement from the current image to the next. In order to match ground truth segments from each frame to the next, I will need to write algorithms that combine the segmentation and optical flow data for any frame to produce the predicted superpixels and segments for the next frame with occlusion handling, then match the predicted with the ground truth superpixels and segments of the actual subsequent frame by their overlap, or the size of their pairwise intersection divided by the size of their pairwise union. We may consider adding an additional phase following the Tracker layer which further improves segmentation results by using known refinement techniques such as composite statistical inference (CSI) [6]. Finally, we will compare the performances of our linear recurrent convolutional network on various metrics with the currently established video segmentation benchmarks [3]. Conclusion In this proposal, I have described a linear RNN-based model which may outperform the state-of-the-art approaches in object tracking and mark the first appearance of such a class of models for this specific application. Our ideal end goal is a multiple object tracking system that works in real time with incoming video streams up to some certain resolution, a tool which would prove beneficial for many critical as well as everyday settings. References [1] Donahue, J., Hendricks, L. A., Guadarrama, S., and Rohrbach, M. Long- term recurrent convolutional networks for visual recognition and description. In Neural Information Processing Systems (2007). [2] Duchowski, A. T. A breadth-first survey of eye tracking applications. Behavior Research Methods, Instruments, and Computers (2002). [3] Galasso, F., Nagaraja, N. S., Ca’rdenas, T. J., Brox, T., and Schiele, B. A unified video segmentation benchmark: Annotation, metrics, and analysis. In Computer Vision and Pattern Recognition (2013). [4] Humayun, A., Li, F., and Rehg, J. M. The middle child problem: Revisiting para- metric min-cut and seeds for object proposals. In International Conference on Computer Vision (2015). [5] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems Confer- ence (2012). [6] Li, F., Kim, T., Humayun, A., Tsai, D., and Rehg, J. M. Video segmentation by tracking many figure-ground segments. In International Conference on Computer Vision (2013). [7] Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition (2015). [8] Luo, W., Xing, J., Zhang, X., Zhao, X., and Kim, T.-K. Multiple object tracking: A literature review. ACM Computing Surveys (2015). 3