Social Media Verification Challenges, Approaches and Applications

Social Media Veriﬁca.on
Challenges, Approaches and Applica.ons
Dr. Yiannis Kompatsiaris, ikom@i2.gr
Mul$media, Knowledge and Social Media Analy$cs Lab, Head
CERTH-ITI
3rd Interna.onal
Conference on Internet
Science (INSCI 2016)

3rd interna*onal conference on Internet Science
INSCI 2016
Social Media Verifica*on
Overview
•  Introduc.on
–  Mo.va.on – Challenges
•  Social Media in News and Journalism
•  The problem of verifica.on
•  Approaches
–  Context extrac.on from Web and Social Media
–  Image Forensics
–  Computa.onal verifica.on
•  Demos - Resources
•  Conclusions
2

INSCI 2016
Social Media Veriﬁca*on 3
Pope Francis
Pope Benedict
2007: iPhone release
2008: Android release
2010: iPad release
hVp://petapixel.com/2013/03/14/a-starry-sea-of-cameras-at-the-unveiling-of-pope-francis/

INSCI 2016
hVp://blog.tyronesystems.com/how-much-data-is-created-every-minute-by-the-social-media

INSCI 2016
Caption
Time
User
Profile
Favs
Comms
Tags
Social Media aspects

INSCI 2016
rise of the networks

INSCI 2016
Mul2-modal graphs
#

INSCI 2016
Social Networks as Graphs

INSCI 2016
Social Networks as Real-Life Sensors
•  Social Networks is a data source with an
extremely dynamic nature that reﬂects
events and the evolu.on of community
focus (user’s interests)
•  Huge smartphones and mobile devices
penetra2on provides real-.me and
loca.on-based user feedback
•  Transform individually rare but
collec2vely frequent media to meaningful
topics, events, points of interest, emo.onal
states and social connec.ons
•  Present in an eﬃcient way for a variety of
applica.ons (news, marke.ng, science,
health, entertainment)

INSCI 2016
Real-life Social Networks
•  Social networks have emergent
proper2es. Emergent proper.es
are new aVributes of a whole
that arise from the interac.on
and interconnec.on of the parts
•  Emo.ons, Health, Sexual
rela.onships depend on our
connec2ons (e.g. number of
them) and on our posi2on -
structure in the social graph
•  Central – Hub
•  Outlier
•  Transi.vity (connec.ons between
friends)

INSCI 2016
Examples - Science
Xin Jin, Andrew Gallagher, Liangliang Cao, Jiebo Luo, and
Jiawei Han. The wisdom of social mul*media: using
ﬂickr for predic*on and forecast, Interna.onal
conference on Mul.media (MM '10). ACM.
11
“…if you're more than 100 km away from the epicenter
[of an earthquake] you can read about the quake on
twiVer before it hits you…”
Many twiVer examples at: What can TwiVer tell us about the real world? TwiVer and the Real
World CIKM'13 Tutorial, hVps://sites.google.com/site/twiVerandtherealworld/home

INSCI 2016
Examples - Science
12

INSCI 2016
Example – News (Boston bombing)
13
“Following the Boston Marathon bombings, one
quarter of Americans reportedly looked to Facebook,
TwiVer and other social networking sites for
informa.on, according to The Pew Research Center.
When the Boston Police Department posted its ﬁnal
“CAPTURED!!!” tweet of the manhunt, more than
140,000 people retweeted it.”
“Authori.es have recognized that one
the ﬁrst places people go in events like
this is to social media, to see what the
crowd is saying about what to do next”
"I have been following my friend's
Facebook [account] who is near the scene
and she is upda2ng everyone before it
even gets to the news”

INSCI 2016
Many other examples: smellymaps
14
Smell related words in geo-located social media
hVp://researchswinger.org/smellymaps/

INSCI 2016
Be careful of correla2on diagrams
15

INSCI 2016
API Wrapper
Website Wrapper
Scheduler
CRAWLING
Visual Indexing
Near-duplicates
Text Indexing
INDEXING
Media Fetcher
SNA
Sen2ment - Inﬂuence
Trends - Topics
MINING
Model Building
Concepts
Relevance
Diversity
Popularity
RANKING
Veracity
Crawling Specs
Sources
Interac2on
Responsiveness
Aggrega2on
VISUALIZATION
Aesthe2cs
Conceptual Architecture

INSCI 2016
Challenges – Content (Indexing - Mining)
• Mul2-modality: e.g. image + tags, video, audio
• Rich social context: spa.o-temporal, social connec.ons,
rela.ons and social graph
• Speciﬁc messages: short, conversa.ons, errors, no context
• Inconsistent quality: noise, spam, fake, propaganda
• Huge volume: Massively produced and disseminated
• Mul2-source: may be generated by diﬀerent applica.ons and
user communi.es
• Dynamic: Fast updates, real-.me

INSCI 2016
Policy – Licensing – Legal challenges
•  Fragmented access to data
–  Separate wrappers/APIs for each source (TwiVer, Facebook, etc.)
–  Diﬀerent data collec.on/crawling policies
•  Limita.ons imposed by API providers (“Walled Gardens”)
•  Full access to data impossible or extremely expensive (e.g. see data
licensing plans for GNIP and DataSit)
•  Non-transparent data access prac.ces (e.g. access is provided to an
organiza.on/person if they have a contact in TwiVer)
•  Constant change of model and ToS of social APIs
–  No backwards compa.bility, addi.onal development costs
•  Ephemeral nature of content
•  Social search results oten lead to removed content à inconsistent
and unreliable referencing
•  User Privacy & Purpose of use
•  Fuzzy regulatory framework regarding mining user-contributed data
18

INSCI 2016
“It has changed the way we do news”(MSN)
“Social media is the key place for emerging stories –
interna$onally, na$onally, locally” (BBC)
“Social media is transforming the way we do journalism”
(New York Times)
Source: picture alliance / dpa

INSCI 2016

Source: GeVy Images
“It’s really hard to find the nuggets of useful stuff in an ocean of
content” (BBC)
“Things that aren’t relevant crowd out the content you are looking for” (MSN)
“The filters aren’t configurable enough” (CNN)

INSCI 2016
Veriﬁca2on was simpler in the past...
Source: Frank Grätz
21

INSCI 2016
News Requirements
Quickly surface trusted and relevant material from
social media – with context.
• “quickly”: in real .me
• “surfaces”: automa.cally discovers, clusters and searches
• “trusted”: automa.c support in verifica.on process
• “relevant”: to the specific event
• “material”: any material (text, image, audio, video =
mul.media), aggregated with other sources (e.g. web)
• “social media”: across all relevant social media playorms
• “with context”: loca.on, .me, sen.ment, influence

INSCI 2016
Can mul2media on the Web be trusted?
23
Real photo
captured April 2011 by WSJ
but
heavily tweeted during Hurricane Sandy
(29 Oct 2012)

Tweeted by mul.ple sources &
retweeted mul.ple .mes

Original online at:

hVp://blogs.wsj.com/metropolis/2011/04/28/weather-
journal-clouds-gathered-but-no-tornado-damage/

INSCI 2016
Can mul2media on the Web be trusted?

INSCI 2016
The Problem
•  Everyone can easily publish content on the Web
•  Content can be easily repurposed and manipulated
•  Not only for fun but also for propaganda
•  News outlets are compe.ng for views and clicks à
Pressure for airing stories very quickly leaves very liVle
room for veriﬁca.on. à Very oten, even well-
reputed news providers fall for fake news content.
•  Mul.ple tools and services available for individual tasks
à complex veriﬁca.on process
Very hard and 2me consuming to check the veracity of
Web mul2media
25

INSCI 2016
Image verifica2on: tools of the trade
•  Metadata analysis
–  E.g. do the dates/loca.ons match? Is the image already
copyrighted? By whom?
•  Context Extrac.on from Web and Social Networks
–  Reverse image search using e.g. Google or TinEye
–  Clustering
•  Has the image been posted elsewhere? Does it originate from a
different context?
•  Supervised machine learning for automa.c
classifica.on
–  Exploi.ng paVerns of usage, content, linking of fake/real
content
•  Content analysis (forensics) for tampering localiza.on
–  Most commonly, Error Level Analysis (ELA)

Monitoring and intelligence system for
Web mul2media veriﬁca2on

INSCI 2016
Media REVEALr
•  Developed within the REVEAL project:
hVp://revealproject.eu/
•  Framework for collec.ng, indexing and browsing
mul.media content from the Web and social media
•  Support for veriﬁca.on:
–  Near-duplicate detec.on against an indexed collec.on
–  Clustering of social media posts by visual similarity à
compara.ve view of the same incident
–  Aggrega.on and visualiza.on of Named En..es around an
incident
28

INSCI 2016
Overview of Media REVEALr
29
Media collec.on
Media pre-processing &
feature extrac.on
Media analysis, mining &
indexing
Persistence (storage, indexing)
Access (API)
Visualiza.on, front-end
TEXT VISUAL

INSCI 2016
Named En2ty Detec2on
•  Brevity and noisy nature of text in social media poses
a serious challenge
•  Employed solu.on:
–  Pre-processing: tokeniza.on, user men.on resolu.on, text
cleaning
–  Stanford NER + user men.on resolu.on
–  Regular expressions to remove special characters and
symbols (e.g., #, @, URLs, etc.)
30

INSCI 2016
Visual Indexing
•  Content-based image retrieval to solve Near-
Duplicate Search (NDS) problem
•  Based on local descriptors (SURF), aggrega.on
(VLAD), dimensionality reduc.on (PCA), quan.za.on
(PQ) and indexing (IVFADC)
•  State-of-the-art visual similarity search
–  High precision/recall
–  Very eﬃcient and scalable implementa2on (search many
millions of images in a few msec, maintain full index in
memory using ~1GB/10M images)
31

INSCI 2016
Improving NDS Resilience (NDS+)
•  Oten, NDS performance suﬀers from overlay
graphics and fonts
•  To address this issue, we integrate a descriptor-level
classiﬁer that tries to remove the font/graphic
descriptors from the VLAD vector
32

INSCI 2016
Example: Filtering Out Font Descriptors
•  Assuming that in most cases the classiﬁer is correct,
the resul.ng VLAD vector is of much higher quality
compared to the one without ﬁltering
33

INSCI 2016
Classifier Details
•  Random Forest used as base classifier
•  Cost Sensi.ve meta-classifier to penalize
misclassifica.on of True Posi.ves
•  Challenge due to Class Imbalance (overlay descriptors
<< useful image content descriptors)
–  Cost Sensi.ve meta-classifier performs over-sampling of
minority class to balance the training set
•  Training set created by collec.ng images with
overlays (e.g., memes) from the Web and manually
annota.ng them (selec.ng areas w. fonts/overlays)
34

INSCI 2016
Mining: Clustering and Aggrega2on
•  Visual aggrega.on
–  DBSCAN on the visual feature representa.on (PCA-reduced
VLAD vectors)
–  Element (tweet) selected based on the largest amount of
keywords (expected to result in more informa.on)
•  En.ty aggrega.on
–  NER on individual items
–  En.ty categoriza.on (àPersons, Loca.on, Organiza.ons)
–  En.ty ranking based on frequency of occurrence
35

INSCI 2016
User Interface: Collec2ons View
36

INSCI 2016
User Interface: Items View & Search
37

INSCI 2016
User Interface: Clusters View
38

INSCI 2016
User Interface: En22es View
39

INSCI 2016
Evalua2on: NER
•  Manual annota.on of 400 tweets from the SNOW
Data Challenge dataset (Papadopoulos et al., 2014)
•  Measure: Accuracy à instance is considered
correct when both en.ty and type are correctly
iden.ﬁed
•  Three compe.ng solu.ons:
–  Base Stanford NER (S-NER)
–  S-NER + Extensions/Post-processing (S-NER+)
–  Ellogon library (hVp://www.ellogon.org)
40

INSCI 2016
Evalua2on: NDS
•  Benchmark Datasets
–  Holidays: 1,491 images, 500 queries (Jegou et al., 2008)
–  Oxford: 5,063 images, 55 queries (Philbin et al., 2008)
–  Paris: 6,412 images, 55 queries (Philbin et al., 2008)
•  Accuracy: mean Average Precision (mAP)
41

INSCI 2016
Use Cases: Real-world Datasets
42
sandy boston malaysia ferry

INSCI 2016
NDS Use Case (boston)
43

INSCI 2016
Clustering Use Case (boston)
•  Visual clustering enables compara.ve view and analysis over
.me (in this case showing increasing conﬁdence on picture).
•  When journalists see many similar photos of the same scene,
they have more conﬁdence that it is real and not fabricated.
44

INSCI 2016
En2ty Aggrega2on Use Case (snow)
45
LOCATIONS PERSONS ORGANIZATIONS

Image Forensics for Veriﬁca2on

INSCI 2016
Image Forensics for Veriﬁca2on
47

Computa2onal Veriﬁca2on in Social
Media

INSCI 2016
Computa2onal Veriﬁca2on in Social Media
•  Create a computa$onal veriﬁca$on framework to
classify tweets with unreliable media content.
•  Events used for experimenta.on
49
Fake images posted during Hurricane Sandy natural disaster Fake images posted during Boston Marathon bombings

INSCI 2016
Goals/Contribu2ons
•  Dis.nguish between fake and real content shared on
TwiVer using a supervised approach
•  Provide closer to reality es.mates of automa.c
veriﬁca.on performance
•  Explore methodological issues with respect to
evalua.ng classiﬁer performance
•  Create reusable resources
–  Fake (and real) tweets (incl. images) corpus
–  Open-source implementa.on
50

INSCI 2016
Methodology
•  Corpus Crea.on
–  Topsy API
–  Near-duplicate image detec.on
•  Feature Extrac.on
–  Content-based features
–  User-based features
–  Link-based features
•  Classiﬁer Building & Evalua.on
–  Cross-valida.on
–  Independent photo sets
–  Cross-dataset training
51

INSCI 2016
Corpus Crea2on
•  Deﬁne a set of keywords K around an event of interest.
•  Use Topsy API (keyword-based search) and keep only tweets
containing images T.
•  Using independent online sources, deﬁne a set of fake
images IF and a set of real ones IR.
•  Select TC ⊂ T of tweets that contain any of the images in IF or
IR.
•  Use near-duplicate visual search (VLAD+SURF) to extend TC
with tweets that contain near-duplicate images.
•  Manually check that the returned near-duplicates indeed
correspond to the images of IF or IR.
52

INSCI 2016
Features (veriﬁca2on handbook)
53
# User Features
1 Username
2 Number of friends
3 Number of followers
4 Number of followers/number of friends
5 Number of .mes the user was listed
6 If the user’s status contains URL
7 If the user is veriﬁed or not
# Content Features
1 Length of the tweet
2 Number of words
3 Number of exclama.on marks
4 Number of quota.on marks
5 Contains emo.con (happy/sad)
6 Number of uppercase characters
7 Number of hashtags
8 Number of men.ons
9 Number of pronouns
10 Number of URLs
11 Number of sen.ment words
12 Number of retweets
13 Readability1
# Link-based features
1 Web Of Trust score (WOT)2
2 In-degree and harmonic centrali.es3
3 Alexa rankings4
1 Flesch reading ease method to compute a score in [0,100] range, 0 hard-
to-read and 100 easy-to-read text
2 A metric for how trustworthy a website is, based on user ra$ngs
3 Rankings computed based on the Web graph
4 Alexa rankings, which evaluate the frequency of visits on various
websites

INSCI 2016
Training and Tes2ng the Classiﬁer
•  Care should be taken to make sure that no
knowledge from the training set enters the
test set.
•  This is NOT the case when using standard
cross-valida.on.
54

INSCI 2016
The Problem with Cross-Valida2on
55
Training/Test tweets are randomly selected.
One of the reference fake images Mul.ple tweets per reference image.

INSCI 2016
Independence of Training-Test Set
56
Training/Test tweets are constraint to correspond to
diﬀerent reference images.

INSCI 2016
Cross-dataset Training-Tes2ng
•  In the most unfavourable case, the dataset used for
training should refer to a different event than the one
used for tes.ng.
•  Simulates real-world scenario of a breaking story,
where no prior informa.on is available to news
professionals.
•  Variants:
–  Different event, same domain
–  Different event, different domain (very challenging!)
57

INSCI 2016
Evalua2on
•  Datasets
–  Hurricane Sandy
–  Boston Marathon bombings
•  Evalua.on of two sets of features (content/
user)
•  Evalua.on of diﬀerent classiﬁer se‚ngs
58

INSCI 2016
Dataset – Hurricane Sandy
59

Natural disaster held around the USA from October 22nd to 31st, 2012. Fake
images and content, such as sharks inside New York and ﬂooded Statue of
Liberty, went viral.

Hashtags
Hurricane Sandy #hurricaneSandy
Hurricane #hurricane
Sandy #Sandy

INSCI 2016
Dataset – Boston Marathon Bombings
60
The bombings occurred on 15 April, 2013 during the Boston Marathon when
two pressure cooker bombs exploded at 2:49 pm EDT, killing three people
and injuring an es.mated 264 others.

Hashtags
Boston Marathon #bostonMarathon
Boston bombings #bostonbombings
Boston suspect #bostonSuspect
manhunt #manhunt
watertown #watertown
Tsarnaev #Tsarnaev
4chan #4chan
Sunil Tripathi #prayForBoston

INSCI 2016
Dataset Sta2s2cs
61
Tweets with other image URLs 343939
Tweets with fake images 10758
Tweets with real images 3540
Hurricane Sandy Boston Marathon
Tweets with other image URLs 112449
Tweets with fake images 281
Tweets with real images 460
Tweets with
fake images 1%
Tweets with
other image
URLs
Tweets with fake images
Tweets with real images
Tweets with other image URLs
3% 1%
96%
Tweets with fake images
Tweets with real images
Tweets with other image URLs

INSCI 2016
Predic2on accuracy (1)
62
0. 10. 20. 30. 40. 50. 60. 70. 80. 90. 100.
Total
User
Content
J48 decision tree
0. 10. 20. 30. 40. 50. 60. 70. 80. 90. 100.
Total
User
Content
KStar
0. 10. 20. 30. 40. 50. 60. 70. 80. 90. 100.
Total
User
Content
Random Forest
Boston Marathon
Hurricane Sandy
•  10-fold cross valida.on results using diﬀerent classiﬁers
~80%

INSCI 2016
Predic2on accuracy (2)
•  Results using diﬀerent training and tes.ng set from the
Hurricane Sandy dataset
63
0. 25. 50. 75. 100.
Total
User
Content
Random Forest
Kstar
J48 decision tree
•  Results using Hurricane Sandy for training and Boston
Marathon for tes.ng
0. 10. 20. 30. 40. 50. 60. 70. 80. 90. 100.
Total
User
Content
Random Forest
Kstar
J48 decision tree
~75%
~58%
separate classiﬁers might be built
for certain types of incidents

INSCI 2016
Sample Results
64
•  Real tweet
My friend's sister's Trampolene in Long Island.
#HurricaneSandy
Classified as real
•  Real tweet
23rd street repost from @wendybarton
#hurricanesandy #nyc
Classified as fake
•  Fake tweet
Sharks in people's front yard #hurricane #sandy #bringing
#sharks #newyork #crazy hZp://t.co/PVewUIE1
Classified as fake
•  Fake tweet
Statue of Liberty + crushing waves. hZp://t.co/7F93HuHV
#hurricaneparty #sandy
Classified as real

INSCI 2016
Sample fake and real images in Sandy
•  Fake pictures shared on social media

•  Real pictures shared on social media

INSCI 2016
Reusable results
•  Computa2onal verifica2on
–  Dataset: hVps://github.com/MKLab-ITI/image-verifica.on-corpus
–  Code: hVps://github.com/socialsensor/computa.onal-verifica.on
•  The Wild Web Tampered Image Dataset
–  80 confirmed digital forgeries, 10,870 images, Ground truth binary masks
–  Dataset: hVps://mklab.i..gr/project/wild-web-tampered-image-dataset

•  The Deutsche Welle Tampered Image Dataset
–  6 original images, 3 image sources, 7 different modified versions
–  Surprisingly tough to crack using the state-of-the-art
–  Dataset:
hVps://revealproject.eu/the-deutsche-welle-image-forensics-dataset/
•  Open-source projects (Apache License v2):
hVps://github.com/socialsensor
–  Data collec.on (stream-manager, storm-focused-crawler)
–  Indexing (framework-client, mul.media-indexing)
–  Mining (topic-detec.on, mul.media-analysis, community-evolu.on-
analysis, social-event-detec.on)
66

INSCI 2016
Contribu2ons
•  Dr. Symeon Papadopoulos
–  Social network analysis, social media content mining and
mul.media indexing and retrieval
–  hVp://mklab.i..gr/people/papadop
–  TwiVer: @sympap
•  Dr. Zampoglou Markos
–  Web mul.media verifica.on, image forensics for
verifica.on
–  markzampoglou@i..gr
•  Boididou Chris.na
–  computa.onal approaches for verifica.on
–  boididou@i..gr
67

INSCI 2016
Support
Tools and services for
Social Media verifica.on
from a journalis.c and
enterprise perspec.ve.
68
Knowledge verifica.on
playorm to detect emerging
stories and assess the
reliability of newsworthy
video files and content spread
via social media
EU funded projects

INSCI 2016
Conclusions
•  Social media data useful in many applica.ons
–  From conﬁrming exis.ng and known correla.ons to predic.on
and decision-making
•  Many challenges exist
–  Data availability (infrastructure, policies)
–  Personal data value (legal, ethical)
–  Real-.me and scalable approaches
–  Fusion of various modali.es (Content, social, temporal, loca.on)
•  Veriﬁca.on requires contribu.on from various disciplines
–  Content Analy.cs
–  Machine Learning
–  Network Analysis
–  Psychology – Social Sciences (paVerns of presenta.on, sharing)
–  Visualiza.on
69

Thank you for your aVen.on!
ikom@i..gr
hVp://mklab.i..gr

Social Media Verification Challenges, Approaches and Applications

Recommended

Recommended

More Related Content

Similar to Social Media Verification Challenges, Approaches and Applications

Similar to Social Media Verification Challenges, Approaches and Applications (20)

More from Yiannis Kompatsiaris

More from Yiannis Kompatsiaris (17)

Recently uploaded

Recently uploaded (16)

Social Media Verification Challenges, Approaches and Applications