Kick-off seminar of the largest Wikimedia IEG, 2015 round 2 call.
In conjunction with Wikipedia's 15 birthday.
Project page: https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
9. ▸ Reliability of content across Wikimedia
projects
▸ Trust needed on the content addition
process
▸ Mature in Wikipedia, but what about
Wikidata?
WHY
THE CRITICAL ISSUE
9
10. WHY
THE CRITICAL ISSUE
▸ StrepHit = novel, automatic process
▸ Generates trust and reliability over
Wikidata content
▸ Alleviates the burden of manual
curation
10
12. WHY
THE TECHNICAL PROBLEM
▸ Content should be validated against
third-party resources
▸ References to external authoritative
sources
▸ Ensure at least one reference for each
piece of data
12
13. HOW?
‣ INPUT = PRIMARY SOURCES CORPUS
‣ OUTPUT = DATASET FOR WIKIDATA
‣ AUTHENTICATE EXISTING CONTENT
‣ PROPOSE NOVEL CONTENT
‣ VIA REFERENCES TO SUCH SOURCES
13
15. HOW
MAIN TASKS
1. Sources selection
2. Corpus harvesting
3. Corpus analysis
4. Frame repository selection
5. Training set construction
6. Frame extraction
7. Dataset production
15
18. FIRST STEP
BIOGRAPHIES
▸ plenty of existing data
▸ broad coverage
▸ potentially easy to find valuable primary sources
18
LIBRARIANS,
WHAT DO YOU THINK?
19. FIRST STEP
COMPANIES
▸ relatively biased domain
▸ ad-prone content
▸ the company edits the page on the company itself
▸ low-quality data
19
20. FIRST STEP
BIOMEDICAL
▸ great primary source
▸ PubMed: scientific papers
▸ proof of usage for an Open Access corpus
20
21. OPEN DISCUSSION
DOMAIN + SOURCES SELECTION
MARCO FOSSATI - HJFOCS - FOSSATI@FBK.EU
TRENTO, 15TH JANUARY 2016
THIS WORK IS LICENSED UNDER A CC BY SA 4.0 LICENSE
https://pad.okfn.org/p/strephit