Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity. Wikidata can be edited by anyone and, unfortunately, it frequently gets vandalized, exposing all information systems using it to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We propose a set of 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and it achieves an area under curve value of the receiver operating characteristic, ROC-AUC, of 0.991. It significantly outperforms the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia within the Objective Revision Evaluation Service (0.859 ROC-AUC).
http://www.heindorf.me/wdvd
18. Why is it a problem?
4
Patrolling Reverting Warning Blocking Protecting
• Over 2 Mio manual edits per month
• A lot of tedious work
• Vandalism is not detected in time
Stefan HeindorfVandalism Detection in Wikidata
19. Research Question
How to detect damaging changes to
crowdsourced knowledge bases?
5Stefan HeindorfVandalism Detection in Wikidata
21. Our Approach
1. Label Dataset Vandalism Corpus [SIGIR’15]
Vandalism Detection in Wikidata Stefan Heindorf 6
22. Our Approach
1. Label Dataset Vandalism Corpus [SIGIR’15]
2. Study Vandalism Characteristics 47 Features
Vandalism Detection in Wikidata Stefan Heindorf 6
23. Our Approach
1. Label Dataset Vandalism Corpus [SIGIR’15]
2. Study Vandalism Characteristics 47 Features
3. Experiment with ML Multiple-Instance Learning
Vandalism Detection in Wikidata Stefan Heindorf 6
24. Our Approach
1. Label Dataset Vandalism Corpus [SIGIR’15]
2. Study Vandalism Characteristics 47 Features
3. Experiment with ML Multiple-Instance Learning
4. Compare with state of the art 2 Baselines
Vandalism Detection in Wikidata Stefan Heindorf 6
30. Corpus [SIGIR ’15]
Revisions over time
7Month
103,000 vandalism revisions
24 million manual revisions
0.4% vandalism
31. Corpus [SIGIR ’15]
Revisions over time
7
Item head
(1.3% vandalism)
Month
103,000 vandalism revisions
24 million manual revisions
0.4% vandalism
32. Corpus [SIGIR ’15]
Revisions over time
7
Item head
(1.3% vandalism)
Item body
(0.2% vandalism)
Month
103,000 vandalism revisions
24 million manual revisions
0.4% vandalism
33. Corpus [SIGIR ’15]
Revisions over time
7
Item head
(1.3% vandalism)
Item body
(0.2% vandalism)
Training
Month
103,000 vandalism revisions
24 million manual revisions
0.4% vandalism
34. Corpus [SIGIR ’15]
Revisions over time
7
Item head
(1.3% vandalism)
Item body
(0.2% vandalism)
Training
Validation
Month
103,000 vandalism revisions
24 million manual revisions
0.4% vandalism
35. Corpus [SIGIR ’15]
Revisions over time
7
Item head
(1.3% vandalism)
Item body
(0.2% vandalism)
Training
TestValidation
Month
103,000 vandalism revisions
24 million manual revisions
0.4% vandalism
36. Content Features
11 Character features (e.g., lowerCaseRatio, digitRatio)
9 Word features (e.g., badWordRatio)
4 Sentence features (e.g., commentSitelinkSimilarity)
3 Statement features (e.g., propertyFrequency)
Context Features
10 User features (e.g., userCountry)
2 Item features (e.g., logItemFrequency)
8 Revision features (e.g., revisionTag, revisionLanguage)
Features (47 in total)
Stefan Heindorf 8Vandalism Detection in Wikidata
37. Features (47 in total)
Stefan Heindorf 8Vandalism Detection in Wikidata
revisionTag
38. Features (47 in total)
Stefan Heindorf 8Vandalism Detection in Wikidata
revisionTag Vand. Total Prob.
Rev. with tags 52 T 8,619 T 0.60%
By abuse filter 49 T 122 T 39.90%
By editing tools 3 T 8,496 T 0.03%
Rev. w/o tags 52 T 15,386 T 0.34%
revisionTag
39. Features (47 in total)
Stefan Heindorf 8Vandalism Detection in Wikidata
revisionTag Vand. Total Prob.
Rev. with tags 52 T 8,619 T 0.60%
By abuse filter 49 T 122 T 39.90%
By editing tools 3 T 8,496 T 0.03%
Rev. w/o tags 52 T 15,386 T 0.34%
revisionTag
40. Features (47 in total)
Stefan Heindorf 8Vandalism Detection in Wikidata
revisionTag Vand. Total Prob.
Rev. with tags 52 T 8,619 T 0.60%
By abuse filter 49 T 122 T 39.90%
By editing tools 3 T 8,496 T 0.03%
Rev. w/o tags 52 T 15,386 T 0.34%
revisionTag
41. Features (47 in total)
Stefan Heindorf 8Vandalism Detection in Wikidata
revisionTag Vand. Total Prob.
Rev. with tags 52 T 8,619 T 0.60%
By abuse filter 49 T 122 T 39.90%
By editing tools 3 T 8,496 T 0.03%
Rev. w/o tags 52 T 15,386 T 0.34%
revisionTag
62. Conclusion and Outlook
Conclusion
• Vandalism: Concentration on item heads (currently)
• Features: Content & Context
• Model: Multiple-Instance
• PR-AUC: 0.491
• ROC-AUC: 0.991
Stefan Heindorf 11Vandalism Detection in Wikidata
63. Conclusion and Outlook
Conclusion
• Vandalism: Concentration on item heads (currently)
• Features: Content & Context
• Model: Multiple-Instance
• PR-AUC: 0.491
• ROC-AUC: 0.991
Stefan Heindorf 11Vandalism Detection in Wikidata
Code + Data:
http://www.heindorf.me/
wdvd.html
64. Conclusion and Outlook
Conclusion
• Vandalism: Concentration on item heads (currently)
• Features: Content & Context
• Model: Multiple-Instance
• PR-AUC: 0.491
• ROC-AUC: 0.991
Outlook
• Goal: Better detection (on item bodies)
• Idea: Double-check with other sources
Stefan Heindorf 11Vandalism Detection in Wikidata
Code + Data:
http://www.heindorf.me/
wdvd.html
65. Conclusion and Outlook
Conclusion
• Vandalism: Concentration on item heads (currently)
• Features: Content & Context
• Model: Multiple-Instance
• PR-AUC: 0.491
• ROC-AUC: 0.991
Outlook
• Goal: Better detection (on item bodies)
• Idea: Double-check with other sources
Stefan Heindorf 11Vandalism Detection in Wikidata
Code + Data:
http://www.heindorf.me/
wdvd.html
Join the competition:
Vandalism Detection @WSDM Cup 2017
http://www.wsdm-cup-2017.org/
66. Conclusion and Outlook
Conclusion
• Vandalism: Concentration on item heads (currently)
• Features: Content & Context
• Model: Multiple-Instance
• PR-AUC: 0.491
• ROC-AUC: 0.991
Outlook
• Goal: Better detection (on item bodies)
• Idea: Double-check with other sources
Acknowledgement
• German Research Foundation (DFG)
• SIGIR Student Travel Grant
Stefan Heindorf 11Vandalism Detection in Wikidata
Code + Data:
http://www.heindorf.me/
wdvd.html
Join the competition:
Vandalism Detection @WSDM Cup 2017
http://www.wsdm-cup-2017.org/
67. Conclusion and Outlook
Conclusion
• Vandalism: Concentration on item heads (currently)
• Features: Content & Context
• Model: Multiple-Instance
• PR-AUC: 0.491
• ROC-AUC: 0.991
Outlook
• Goal: Better detection (on item bodies)
• Idea: Double-check with other sources
Acknowledgement
• German Research Foundation (DFG)
• SIGIR Student Travel Grant
Stefan Heindorf 11Vandalism Detection in Wikidata
Code + Data:
http://www.heindorf.me/
wdvd.html
Join the competition:
Vandalism Detection @WSDM Cup 2017
http://www.wsdm-cup-2017.org/
Thank you!