As grassroots and social media-based journalism becomes more widespread, the need to verify information coming from such channels becomes imperative. The objective of this talk is to explore the challenges involved in social media computational verification to automatically classify unreliable media content as fake or real. After presenting a generic conceptual architecture, there will be a focus on tweets around big events linking to images (fake or real) of which the reliability could be verified by independent online sources. The REVEALr platform will be demonstrated, a scalable and efficient content-based media crawling and indexing framework featuring a novel and resilient near-duplicate detection approach and intelligent content- and context-based aggregation capabilities (e.g. clustering, named entity extraction)
18. 3rd interna*onal conference on Internet Science
INSCI 2016
Social Media Verifica*on
Policy – Licensing – Legal challenges
• Fragmented access to data
– Separate wrappers/APIs for each source (TwiVer, Facebook, etc.)
– Different data collec.on/crawling policies
• Limita.ons imposed by API providers (“Walled Gardens”)
• Full access to data impossible or extremely expensive (e.g. see data
licensing plans for GNIP and DataSit)
• Non-transparent data access prac.ces (e.g. access is provided to an
organiza.on/person if they have a contact in TwiVer)
• Constant change of model and ToS of social APIs
– No backwards compa.bility, addi.onal development costs
• Ephemeral nature of content
• Social search results oten lead to removed content à inconsistent
and unreliable referencing
• User Privacy & Purpose of use
• Fuzzy regulatory framework regarding mining user-contributed data
18
53. 3rd interna*onal conference on Internet Science
INSCI 2016
Social Media Verifica*on
Features (verifica2on handbook)
53
# User Features
1 Username
2 Number of friends
3 Number of followers
4 Number of followers/number of friends
5 Number of .mes the user was listed
6 If the user’s status contains URL
7 If the user is verified or not
# Content Features
1 Length of the tweet
2 Number of words
3 Number of exclama.on marks
4 Number of quota.on marks
5 Contains emo.con (happy/sad)
6 Number of uppercase characters
7 Number of hashtags
8 Number of men.ons
9 Number of pronouns
10 Number of URLs
11 Number of sen.ment words
12 Number of retweets
13 Readability1
# Link-based features
1 Web Of Trust score (WOT)2
2 In-degree and harmonic centrali.es3
3 Alexa rankings4
1 Flesch reading ease method to compute a score in [0,100] range, 0 hard-
to-read and 100 easy-to-read text
2 A metric for how trustworthy a website is, based on user ra$ngs
3 Rankings computed based on the Web graph
4 Alexa rankings, which evaluate the frequency of visits on various
websites