Tracking emerges by colorizing videos

Tracking Emerges by Colorizing Videos
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). arXiv preprint arXiv:1806.09594.
발표자 : 오유진

Key Point
✓ Visual tracking of objects naturally by converting black-and-white images into color
[Abstract]
• Teaching a machine to visually track objects is challenging
– It requires large, labeled tracking datasets for training, which are impractical to annotate at scale
– Hard to prepare image datasets
• Suggest how to color the grayscale video by copying colors from a reference frame
• Network automatically tracks objects without supervision
academic dataset DAVIS 2017

Related Work
• Self-supervised Learning
– Training visual models without human supervision
– Training labels are decided by input data
– Typically, network uses a piece of data and predict the rest
• Tracking without label
– Self-supervised learning problem that causes the model to automatically learn tracking on its own
– Using the same trained model to tracking and colorizing without fine-tuning or re-training
• Colorization
– Colorizing gray-scale images has been the subject of significant study in the computer vision community
– Use video colorization as a proxy task for learning to track

Self-supervised tracking; Model
✓ Convert all frames except the first frame to grays-scale and learn the convolutional network to predict the original color
• When a Gray-scale frame is given, this model calculates low-dimensional embedding for each location
• Points from the target frame into the reference frame embeddings(solid yellow arrow)
• Copies the color back into the predicted frame (dashed yellow arrow)
• After learning, use the pointing mechanism as a visual tracker

Self-supervised tracking; Model
• 𝑐𝑖 ∈ ℝ 𝑑
is the true color for pixel 𝑖 in the reference frame
• 𝑐𝑖 ∈ ℝ 𝑑
is the true color for pixel 𝑗 in the target frame
• 𝑦𝑗 ∈ ℝ 𝑑
is model’s prediction for 𝑐𝑖
• Predicts 𝑦𝑗 as a linear combination of colors in the reference frame → 𝑦𝑗 = σ𝑖 𝐴𝑖𝑗 𝑐𝑖
• A is a similarity matrix between target frame and reference frame
𝐴𝑖𝑗 =
exp(𝑓𝑖
𝑇
𝑓𝑗)
σ 𝑘 exp(𝑓𝑘
𝑇
𝑓𝑗)
• 𝑓𝑖 ∈ ℝ 𝐷
is a low-dimension embedding for pixel 𝑖 that is estimated by a CNN
• If there are two objects with the same color, the model does not constrain them to have the same embedding
video from the DAVIS 2017 dataset

Self-supervised tracking; Learning
• The assumption during training that color is generally temporally stable
• Visualize frames one second apart from the Kinetics training set
– The first row shows the original frames
– The second row shows the ab color channels from Lab space
– The third row quantizes the color space into discrete bins and perturbs the colors to make the effect more pronounced → Using k-means to
clustering color channel
• loss function : min
𝜃
σ 𝑗 ℒ(𝑦𝑗, 𝑐𝑗)
– Train the parameters of the model θ such that the predicted colors 𝑦𝑗 are close to the target colors 𝑐𝑗 across the training set

Self-supervised tracking; Learning
• Learning to copy colors from the single reference frame requires the model to learn to internally point to the right region in order to
copy the right colors
• learn an explicit mechanism that we can use for tracking
InputReference Frame Predicted Colors
Examples of predicted colors from colorized reference frame applied to input video using the publicly-available Kinetics dataset

Implementation Details
• Use a 3D convolutional network to produce 64-dimensional embeddings
• The network predicts a down-sampled feature map of 32 × 32 for each of the input frames
– On each input frame uses ResNet-18 network architecture, Use five 3D convolutional network layer
– To give the features global spatial information, we encode the spatial location as a two-dimensional vector in the range [−1, 1] and
concatenate this to the features between the ResNet-18 and the 3D convolutional network
• Model input : 256 × 256 down-sampled four gray-scale video frame
• First three frame are used as reference frame fourth frame is used as target frame
• 400, 000 iterations, 32 batch size, Adam optimizer
– learning rate of 0.001 for the first 60, 000 iterations and reduce it to 0.0001 afterwards
– The model is randomly initialized with Gaussian noise

Experiments
• model on the training set from Kinetics (use dataset after removeing the label)
– Kinetics dataset is diverse collection of 300, 000 videos from YouTube
– Evaluate the model on the standard testing sets of other datasets depending on the task
– Compare against the following unsupervised baselines
• Optical Flow : After extracting the feature points that seem important in the previous frame (which can also be extracted in the next
frame), visualize how much the same feature points are found in the current frame
• Single Image Colorization : Evaluated how well computing similarity from the embeddings of a single image colorization model
work instead of our embeddings
http://hs36.tistory.com/47 참고

Experiments
• The picture on the left is an example of the video selection result given by the model reference frame (Use Kinetics validation set)
– This model learns to copy colors over many challenging transformations
– For example, butter spreading or people dancing
– Model adaptable to various difficult tracking situations

Experiments
• Video segmentation average performance versus time in the video
• More consistent performance for longer time periods than optical flow
– Optical flow on average degrades to the identity baseline. Since videos are variable length
• The average performance broken down by attributes that describe
the type of motion in the video
• Sort the attributes by relative gain over optical flow

Experiments
• Human Pose Tracking
• Track human poses given key-points in an initial frame
– JHMDB academic dataset
• At a strict threshold, this model tracks key-points with a similar performance as optical flow
Examples of using the model to track movements of the human skeleton. From ai.googleblog

Conclusion
• The task of video colorization is a promising signal for learning to track without requiring human supervision
• Learning to colorize video by pointing to a colorful reference frame causes a visual tracker to automatically emerge, which we
leverage for video segmentation and human pose tracking
• Improving the video colorization task may translate into improvements in self-supervised tracking

Tracking emerges by colorizing videos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tracking emerges by colorizing videos

Similar to Tracking emerges by colorizing videos (20)

Recently uploaded

Recently uploaded (20)

Tracking emerges by colorizing videos