MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims

•Download as PPT, PDF•

0 likes•503 views

EMNLP 2019 paper: https://arxiv.org/abs/1909.03242 ==== We contribute the largest publicly available dataset of naturally occurring factual claims for the purpose of automatic claim verification. It is collected from 26 fact checking websites in English, paired with textual sources and rich metadata, and labelled for veracity by human expert journalists. We present an in-depth analysis of the dataset, highlighting characteristics and challenges. Further, we present results for automatic veracity prediction, both with established baselines and with a novel method for joint ranking of evidence pages and predicting veracity that outperforms all baselines. Significant performance increases are achieved by encoding evidence, and by modelling metadata.

Social Media

$Isabelle Augenstein, Christina Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Christian Hansen, Jakob Grue Simonsen University of Copenhagen {augenstein | c.lioma | wang | lcl | c.hansen | chrh | simonsen}@di.ku.dk MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims Joint Veracity Prediction & Evidence Ranking Claims in MultiFCContributions • Novel fact checking dataset Largest such with naturally occurring claims 34 918 claims from 26 English fact checking portals Rich additional meta-data 10 evidence pages per claim • Joint veracity prediction and evidence ranking model Treats claims from different portals as different tasks / domains Encodes disparate labels with label embeddings Confusion MatrixEntities in Claims Fact Checking Portals Overall Results Dataset Download https://copenlu.github.io/publication/2019_emnlp_augenstein/ Error Analysis • Meta-data: topic tags most important, entities least important • Correctly predicting ‘true’ claims is much easier than ‘false’ ones • Most confusions happen over close labels • Longer claims are harder to classify correctly • High token overlap of claims & evidence → high evidence ranking • General topic tags frequently co-occur with incorrect predictions; more specific tags often co-occur with correct predictions$

More from Isabelle Augenstein

One of the core challenges in typology is to record properties of languages in a structured way. As a result of manual efforts, typological knowledge bases have emerged, which contains information about languages’ phonological, morphological and syntactic properties; as well as information about language families. Ideally, such typological knowledge bases would provide useful information for multilingual NLP models to learn how to selectively share parameters. A related area of research suggests a different way of encoding properties of languages, namely to learn language representation vectors directly from text documents. In this talk, I will analyse and contrast these two ways of encoding linguistic properties, as well as present research on how the two can benefit one another.

What can typological knowledge bases and language representations tell us abo...

MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims

Recommended

Recommended

More Related Content

More from Isabelle Augenstein

More from Isabelle Augenstein (18)

Recently uploaded

Recently uploaded (8)

MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims