1. Abstract:
With the incresed adoption of social media platforms there has been a rapid growth in the transfer of
data between the people, be it in the form of videos, images, text and audio. Based on the
information being received conclusions are made accordingly and hence we need to ensure that
correct data is received before taking a decision and therfore video verification is important and is
attracting a lot of attention. But there is not much effor being put forward in this regard and hence
motivated to deep dive in to this area of study. In this work we present an approach for automatic
video manipulation detection.Video filters are used for verification and here we have used two such
forensic filters one based on Discrete Cosine Transform(DCT) coefficients and second based on
video requantization errors and combining them with CNN. We compare the performance of the
proposed approach. Finally we discussed about the proposed work lmitations and proposed paths
for future work in this area.
Introduction:
The amount of video content produced by non-professionals has increased rapidly and several
agencies depends on User Generated Content(UGC) for news reporting. But the videos shared may
not be authentic and generally people may manipulate them for various reasons be it for comedy
purpose, be it intentionally etc. This information which is tampered may hurt the sentiments of
people. This makes the demand to create new tools which will detect the quality of video.
Multimedia forensics aims to address this by providing various algorithms and tools that points the
traces of tampering. Filters are being used in order to spot inconsistencies and in our work e have
used two filters and these outputs are used to train a number of deep learning visual classifiers, in
order to learn to discriminate between original and tampered videos.
Methodology:
The proposed approach is a two stepprocess:
1. Forensic based feature extraction
2. Classification
The feature extraction is basically based on two novel filters and the classification step is based on
CNN.
Forensic based filters:
Here we have used Q4 and cobalt filters in which Q4 is used to analyze the decomposition of the
image through DCT and it is applied on each individual video frames. Each frame is split in to N *
N blocks and 2-D DCT is applied. Coefficients are identified based in their frequency. The first
coefficient (0, 0) represents low frequency information, while higher coefficients represent higher
frequencies. The typical block size for DCT is 8 × 8. However, analysis in 8 × 8 blocks yields too
small coefficient arrays. Instead, 2 × 2 blocks are used so that the resulting output frame is only half
the original size. A selection of coefficients (0, 1), (1, 0) and (1, 1) generates the final output video
map of the Q4 filter. The second filter we use is the Cobalt filter. This compares the original video
with a modified version of it, re-quantized using MPEG-4 at a different quality level. Here we
requantize the video and calculate the per-pixel values, creating an Error Video, i.e. a video
depicting the differences. n designing Cobalt, a “compare-to-worst” strategy has been investigated,
i.e. if a constant quality encoding is done, the comparison will be performed with the worst possible
quality, and conversely, if a constant bit rate encoding is done, the comparison is performed with the
worst possible bit rate.
Classification:
The filters used generates outputs in the form of RGB images. With the idea that the filter maps
were originally intended to be visually evaluated by a human expert, decided to treat this as a
classification problem. This
2. allows us to combine the maps with Convolutional Neural Networks pre-trained for image
classification. Specifically, we take an instance of GoogLeNet and an instance of ResNet , both pre-
trained on the ImageNet classification task, and adapt them to the needs of our task. As the resulting
networks are designed for image classification, we feed the filter outputs to each network one frame
at a time, during both training and classification.
Experimental Study:
The data sets used are named as Dev1 and Dev2 provided by NIST. The first one consisted of 30
video pairs(tampered and untampered) and the second one of 86 video pairs containing
approximately 44K and 134K frames. These two datasets, Dev1 and Dev2, are treated as
independent sets, but since they originate from the same source, they likely exhibit similar features.
The second source of videos was the InVID Fake Video Corpus developed over the course of the
InVID project. The Fake Video Corpus (FVC) contains 110 real and 117 fake newsworthy videos
from social media sources, which include not only videos that have been tampered but also videos
that are contextually false (e.g. whose description on YouTube contains misinformation about what
is shown). Out of that dataset, we selected 35 real videos that comprise single-shot, unedited videos,
and 33 fake videos that include tampered UGC. For this experiments, we treated all frames
originating from untampered videos as untampered, and all frames originating from tampered
videos as tampered.