Operation-wise Attention Network for
Tampering Localization Fusion
Polychronis Charitidis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis
Kompatsiaris
MeVer Team @ Information Technologies Institute (ITI) /
Centre for Research & Technology Hellas (CERTH)
Content-Based Multimedia Indexing Conference, June 28-30, 2021
WeVerify Project
● Goals
○ Address the advanced content verification challenges
○ Social media and web content analysis for detection of disinformation
○ Exposure of misleading and fabricated content
○ Platform for collaborative, decentralised content verification, tracking, and debunking.
● Developed tools
○ DeepFake detection service
○ Image Verification Assistant
Image Verification Assistant
● Goal: forgery localization in images.
● Report from various image forensics algorithms.
○ JPEG based methods, Noise-based methods, Deep-learning based methods
○ Focuses on splicing and copy-move manipulations
● Inspect the multiple reports in tandem.
Tampered Image Mask Localizations from Forensic Algorithms
Source: DEFACTO dataset
Motivation
● Observations:
○ Many forensics output visualizations increases the complexity of the results, especially for
non-experts.
■ Each algorithm has requires specific knowledge for proper interpretation.
○ Some of these forensics results are complementary to each other so their combination could
potentially lead to better results
● Solution and contributions:
○ Develop a fully automatic fusion approach that is able to combine diverse forensics signals.
○ The combined result:
■ is more robust and accurate
■ is easier to interpret and requires no specialized knowledge
■ empower non-experts in image verification
Methodology
● For this work we select 5 forensics algorithms for fusion.
● These algorithms were selected among others based on their performance on
forgery localization datasets
○ ADQ1 and DCT that both base their detection on analysis of the JPEG compression, in the
transform domain
○ BLK and CAGI that base their detection on analysis of the JPEG compression in the spatial
domain
○ Splicebuster which is a noise-based detector
● Train a deep learning architecture to fuse the diverse tampering localization
algorithms
○ Fully automatic
○ Complex and diverse features
○ Availability of large-scale datasets
Models
● We considered two different models:
○ Eff-B4-Unet: A U-Net based architecture that uses Efficient-B4 as an encoder
○ Operation-wise Attention Fusion network (OwAF), which is an adapted image
restoration architecture
■ Operation-wise Attention layer:
Training and Evaluation process
● Training dataset:
○ DEFACTO dataset (Mahfoudi et al., 2019)
○ Contains various synthetic manipulations like splicing and copy-move
○ 15,000 tampered images / 75,000 forensics algorithms localizations
● Evaluation datasets:
○ DEFACTO test dataset
■ Contains 1000 tampered images
○ CASIA V2.0 dataset (Dong et al., 2013)
■ Contains 5,123 tampered images
○ The IFS-TC Image Forensics Challenge set
■ Contains 450 tampered images
● Compared our approach with another fusion approach (Iakovidou et al., 2020)
● Metrics: F1, IoU
Results on DEFACTO test dataset
MODELS MACRO-F1 IOU
BLK 0.463 0.053
ADQ1 0.573 0.123
CAGI 0.479 0.072
DCT 0.509 0.101
Splicebuster 0.554 0.087
Eff-B4-Unet 0.908 0.690
OwAF 0.912 0.707
Results on CASIA v2 dataset
MODELS MACRO-F1 IOU
BLK 0.509 0.089
ADQ1 0.573 0.130
CAGI 0.502 0.094
DCT 0.546 0.113
Splicebuster 0.576 0.093
Iakovidou et al. (2020) 0.598 0.166
OwAF 0.611 0.172
Results on IFS-TC dataset
MODELS MACRO-F1 IOU
BLK 0.459 0.063
ADQ1 0.485 0.076
CAGI 0.506 0.091
DCT 0.467 0.065
Splicebuster 0.560 0.129
Iakovidou et al. (2020) 0.549 0.112
OwAF 0.529 0.106
Discussion and Limitations
● The reported experimental results are promising and in many cases
outperform the individual forensics techniques.
● Our automatic approach outperforms a competing fusion approach in many
cases.
● The results of our approach are easier to interpret by non-experts.
● An important limitation of this work is the generalization ability of the
fusion model.
● Our approach performance depends on the performance of the individual
forensic algorithms.
Future work
● To deal with the generalization, we will try to increase the size of the training
dataset and include different manipulations from other datasets.
● We will experiment with task-specific regularization techniques, like
localization map dropout.
● We plan to experiment with multi-stream fusion architectures that besides the
forensics localization maps, will consider the input image itself.
Thank you!
Polychronis Charitidis / charitidis@iti.gr
Media Verification Team / https://mever.gr / @meverteam
WeVerify project / http://www.weverify.eu / @WeVerify

Operation-wise Attention Network for Tampering Localization Fusion.

  • 1.
    Operation-wise Attention Networkfor Tampering Localization Fusion Polychronis Charitidis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Kompatsiaris MeVer Team @ Information Technologies Institute (ITI) / Centre for Research & Technology Hellas (CERTH) Content-Based Multimedia Indexing Conference, June 28-30, 2021
  • 2.
    WeVerify Project ● Goals ○Address the advanced content verification challenges ○ Social media and web content analysis for detection of disinformation ○ Exposure of misleading and fabricated content ○ Platform for collaborative, decentralised content verification, tracking, and debunking. ● Developed tools ○ DeepFake detection service ○ Image Verification Assistant
  • 3.
    Image Verification Assistant ●Goal: forgery localization in images. ● Report from various image forensics algorithms. ○ JPEG based methods, Noise-based methods, Deep-learning based methods ○ Focuses on splicing and copy-move manipulations ● Inspect the multiple reports in tandem. Tampered Image Mask Localizations from Forensic Algorithms Source: DEFACTO dataset
  • 4.
    Motivation ● Observations: ○ Manyforensics output visualizations increases the complexity of the results, especially for non-experts. ■ Each algorithm has requires specific knowledge for proper interpretation. ○ Some of these forensics results are complementary to each other so their combination could potentially lead to better results ● Solution and contributions: ○ Develop a fully automatic fusion approach that is able to combine diverse forensics signals. ○ The combined result: ■ is more robust and accurate ■ is easier to interpret and requires no specialized knowledge ■ empower non-experts in image verification
  • 5.
    Methodology ● For thiswork we select 5 forensics algorithms for fusion. ● These algorithms were selected among others based on their performance on forgery localization datasets ○ ADQ1 and DCT that both base their detection on analysis of the JPEG compression, in the transform domain ○ BLK and CAGI that base their detection on analysis of the JPEG compression in the spatial domain ○ Splicebuster which is a noise-based detector ● Train a deep learning architecture to fuse the diverse tampering localization algorithms ○ Fully automatic ○ Complex and diverse features ○ Availability of large-scale datasets
  • 6.
    Models ● We consideredtwo different models: ○ Eff-B4-Unet: A U-Net based architecture that uses Efficient-B4 as an encoder ○ Operation-wise Attention Fusion network (OwAF), which is an adapted image restoration architecture ■ Operation-wise Attention layer:
  • 7.
    Training and Evaluationprocess ● Training dataset: ○ DEFACTO dataset (Mahfoudi et al., 2019) ○ Contains various synthetic manipulations like splicing and copy-move ○ 15,000 tampered images / 75,000 forensics algorithms localizations ● Evaluation datasets: ○ DEFACTO test dataset ■ Contains 1000 tampered images ○ CASIA V2.0 dataset (Dong et al., 2013) ■ Contains 5,123 tampered images ○ The IFS-TC Image Forensics Challenge set ■ Contains 450 tampered images ● Compared our approach with another fusion approach (Iakovidou et al., 2020) ● Metrics: F1, IoU
  • 8.
    Results on DEFACTOtest dataset MODELS MACRO-F1 IOU BLK 0.463 0.053 ADQ1 0.573 0.123 CAGI 0.479 0.072 DCT 0.509 0.101 Splicebuster 0.554 0.087 Eff-B4-Unet 0.908 0.690 OwAF 0.912 0.707
  • 9.
    Results on CASIAv2 dataset MODELS MACRO-F1 IOU BLK 0.509 0.089 ADQ1 0.573 0.130 CAGI 0.502 0.094 DCT 0.546 0.113 Splicebuster 0.576 0.093 Iakovidou et al. (2020) 0.598 0.166 OwAF 0.611 0.172
  • 10.
    Results on IFS-TCdataset MODELS MACRO-F1 IOU BLK 0.459 0.063 ADQ1 0.485 0.076 CAGI 0.506 0.091 DCT 0.467 0.065 Splicebuster 0.560 0.129 Iakovidou et al. (2020) 0.549 0.112 OwAF 0.529 0.106
  • 11.
    Discussion and Limitations ●The reported experimental results are promising and in many cases outperform the individual forensics techniques. ● Our automatic approach outperforms a competing fusion approach in many cases. ● The results of our approach are easier to interpret by non-experts. ● An important limitation of this work is the generalization ability of the fusion model. ● Our approach performance depends on the performance of the individual forensic algorithms.
  • 12.
    Future work ● Todeal with the generalization, we will try to increase the size of the training dataset and include different manipulations from other datasets. ● We will experiment with task-specific regularization techniques, like localization map dropout. ● We plan to experiment with multi-stream fusion architectures that besides the forensics localization maps, will consider the input image itself.
  • 13.
    Thank you! Polychronis Charitidis/ charitidis@iti.gr Media Verification Team / https://mever.gr / @meverteam WeVerify project / http://www.weverify.eu / @WeVerify

Editor's Notes

  • #2 Hello, my name is Polychronis Charitidis and I am going to present the study that I have conducted alongside with my colleagues Giorgos Kordopatis-Zilos, Symeon Papadopoulos and Ioannis Kompatsiaris with the title “Operation-wise Attention Network for Tampering Localization Fusion”. I am a member of the media verification team of Information Technologies Institute which is part of the Centre for Research & Technology Hellas which is located in Thessaloniki, Greece . My main research interests are in the field of media forensics and content verification.
  • #3 The work I am going to present was conducted in the context of the WeVerify project. This project is an ongoing EU Horizon 2020 project. The main goals of WeVerify is to address the advanced content verification challenges. Also to analyse social media and web content in order to detect disinformation campaigns and finally to expose and debunk misleading and manipulated content. The outcome of the project aims to be a platform for collaborative, decentralised content verification, tracking, and debunking. There are a lot of tools that were developed or enhanced during Weferify. Example of such tools is a deepfake detection service, which detects facial manipulations in images or videos. Another example is some improvements that were made to an already existing tool, the image verification Assistant that uses image forensic algorithms to provide reports regarding potential forgeries in images. The work presented here showcases a particular enchantment on this tool.
  • #4 As i mentioned, the main goal of image verification assistant is to localize potential forgeries in images. Due to the large number of possible forgery types and transformations that can be applied to an image it is beneficial for a forensic report to include results from multiple forensic algorithms that cover a wide range of them. So the image verification assistant provides a report that consists of localizations from JPEG- based methods, Noise-based methods, Deep-learning based methods and focuses on manipulation types like splicing and copy-move. So the process of verification is straightforward. A user inserts an image for inspection to the tool. This image might be tampered like the image in this example below on the left. The forgery is shown with yellow in the mask next to it, and gets a report from various Forensics algorithms on the right.
  • #5 Now for an expert user it might be easy to draw a conclusion from the image verification assistant. But there are some important observations. First is that although discovering manipulation traces is desirable,adding a lot forensics visualizations increases the complexity of a media verification tool, especially for non-expert users. The reason is that each algorithm has a different output that requires specific knowledge for proper interpretation. Consequently,this quickly becomes overwhelming for the non-experts. Another observation is that in many cases some of these forensics results are complementary to each other so their combination could potentially lead to better results. In this work, we aim to address these observations. The main objective is to develop a fully automatic fusion approach using deep learning, being able to leverage diverse forensics signals, so as to improve the robustness and reliability of the overall localization system. This final visualization will retain the most important features of the individual algorithms leading to more accurate results. This result will be easier to interpret and will require no additional specialized knowledge. This outcome can empower non-experts like fact-checkers and journalists, to actively contribute to image verification tasks.
  • #6  For this work in order to simplify the process, we select a subset of the forensics algorithms that appear on Image Verification Assistant to be considered for fusion. Based on the evaluation results of another work, we select a set of five methods as the building blocks of the fusion model. These are ADQ1 and DCT that both base their detection on analysis of the JPEG compression, in the transform domain, BLK and CAGI that base their detection on analysis of the JPEG compression in the spatial domain and Splicebuster which is a noise-based detector. In this work, we adopt a deep learning-based fusion approach for the following reasons. First, we aspire to develop a fully automatic approach without the need for heuristic tuning or manual intervention. Second, the complex and diverse nature of the input signal calls for an effective approach to automatically extract the most important features, which is something that deep learning excels at. Finally, the availability of large-scale datasets, which are required by deep learning approaches, makes the training of a deep learning-based model feasible.
  • #7 For the fusion model, we consider two different deep learning architectures. The first model is a U-Net based architecture. U-Net is a convolutional neural network that was initially applied for semantic segmentation in a medical context, but nowadays, it has got a much broader application field. The network only uses convolutions without any fully connected layers. The U-Net architecture has a lot of variants. For the fusion task, we use a variant of U-Net architecture that uses Efficient as the encoder part. The second model that we employ is a simple architecture of neural networks that was proposed for the problem of image restoration. This architecture is suitable for the fusion problem because it uses attention to capture important features by examining which operations are the most beneficial, depending on the input signal. Another important aspect of this architecture is that it focuses on low-level features, which is important for the fusion task, as semantic or high-level representations are often not or useful for the problem. After experimenting, we adapted this architecture by reducing the number of layers, replacing the dilated convolution of the original approach and added more operations to be weighted with attention. Tthe operation wise attention layer can be seen in this slide. In each layer a number of convolutional and pooling operations are applied to the input features. These are weighted by an attention layer and concatenated. The resulting features are processed by 1x1 convolution. Finally the layer input is added to the resulting feature map just like in residual architectures.
  • #8 For training these architectures we use the DEFACTO dataset which contains various synthetic manipulations like splicing and copy-move. We use fifteen thousand tampered images for training. For each image we use the forensic algorithms to produce 5 tampering localization results. This means that the total input for the fusion model is 75000 localizations. For evaluation of our method we use three datasets. The first is 1000 seperate images from the DEFACTO dataset. The second is CASIA version two datasets which contains five thousand one hundred twenty three images and the last one in IFS-TC datasets which contains four hundred fifty images. In our reported results we compare our approach with another statistical and heuristic based fusion approach that considers the same forensic algorithms. In our experiments we report the F1 and Intersection over Union metrics
  • #9 In the first experiment, we investigate the performance of the two proposed fusion models in the DEFACTO test dataset. We can see that the OwAF network outperforms the Eff-B4-Unet in all evaluation metrics. Evaluation results for individual algorithms are very low when compared to the fusion approaches. The best performing individual model is ADQ1.The figure in this slide shows random examples from the DEFACTO test dataset. The first column shows the input images. The next five columns show the outputs of the individual tampering localization algorithms. The final two columns show the ground truth mask, which reveals the actual location of the forgery and the fusion result of the best performing OwAF. It is evident from these examples that the fusion architecture learned to combine the diverse signals in order to localize the tampered region. One interesting observation is that for each input example, there are usually different algorithms that better localize the forgery. This means that the fusion model learned to detect proper signals that contribute to a correct localization. For example, in the first row, Splicebuster and CAGI spot the tampering, but in row three, ADQ1 and DCT do so. In both cases, the fusion model has identified these signals and provides a correct result.
  • #10 To further investigate the fusion performance, we compare our best performing approach with another fusion framework. For evaluation the CASIA v2 dataset in order to examine the generalization capabilities of the fusion model that was trained with the DEFACTO dataset. We can observe slightly better performance in every metric from individual models compared to those in the previous experiment. This means that this dataset contains images with manipulations that can be localized better by the individual algorithms. ADQ1 and DCT are the best performing individual approaches. Regarding the fusion methods evaluation,our approach outperforms the competing fusion framework. One notable observation is that the performance of OwAF is significantly worse than the evaluation results that are reported on the previous slide. This is a clear indication that our trained models have overfitted to the training set manipulations. The fusion model possibly learned to localize specific forgeries, like shapes and patterns from the outputs of individual algorithms that frequently appear in the DEFACTO dataset. Yet, the proposed approach is still better than individual algorithms and also outperforms the competing fusion framework in terms of both F1 score and IoU. The figure shows some successful examples of tampering localization outputs produced by the fusion model and the individual methods. In most examples, ADQ1 and DCT visualization better localize the tampering.
  • #11 For the evaluation results on the First IFS-TC, a significant decrease in the performance of the individual algorithms can be observed. One exception is the Spicebuster performance, which increased compared to previous evaluations. Splicebuster even outperforms both fusion approaches. Iakovidou et al. also achieve marginally better performance than our fusion model in this dataset. One possible explanation is that our fusion model learned to focus more on the individual localization maps that achieved better performance in the training set, namely the ADQ1 and DCT. On the contrary, in the First IFS-TC case, the best performing individual algorithm is Splicebuster and this possibly justifies the poor performance of the OwAF approach. To verify this we show the better localized results in this dataset. We can see in the figure that in each successful case the forgery has been localized by DCT and ADQ1 as well.
  • #12 To sum up, from our experiments, it is evident that our approach is promising and in many cases outperform the individual forensics techniques and other competing frameworks. Additionally, the results of our approach are easier to interpret by non-experts. On the other hand, the main challenge of the proposed approach stems from overfitting to the training data. This leads to lack of generalization to unseen manipulations. Namely, we get relatively poor predictions for datasets that have different types of manipulations compared to those that appeared in the training dataset. Additionally, the low evaluation performance of individual algorithms is a major indication that the forgery localization problem is very difficult and is even more challenging to design a general fusion solution that receives noisy signals from these algorithms.
  • #13 For future steps we plan to focus on countering the issue of overfitting. We will experiment with larger datasets and combine datasets with diverse manipulations. Also we will experiment with task-specific regularization approaches like localization map dropout. Finally, so far we used only signals from forensics algorithms, we plan to experiment with multi-stream fusion architectures that besides these signals, will also consider the input image itself.
  • #14 Thank you very much for your attention. Also If you are interested in experimenting with our Image verification Assistant service please don’t hesitate to send me an email. I will be happy to answer any questions you may have.