Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
STREPHIT
A WIKIMEDIA FOUNDATION
IEG PROJECT
MARCO FOSSATI - HJFOCS - FOSSATI@FBK.EU
TRENTO, 15TH JANUARY 2016
HAPPY BIRTHDAY,
WIKIPEDIA!
Preamble
PREAMBLE 2
INDIVIDUAL
ENGAGEMENT
GRANTS
Preamble
PREAMBLE 3
THE FREE
KNOWLEDGE BASE
THAT ANYONE CAN EDIT
Preamble
PREAMBLE 4
5
MARCO FOSSATI
EMILIO DORIGATTI
WHO?
WHO?
‣ ADVISOR: CLAUDIO GIULIANO
‣ VOLUNTEERS:
‣ AUVA87, BOLIOLIANDREA, DANROK,
NISPRATEEK, PROJEKT ANA,
VLADIMIR ALEXIEV
6
WHAT?
‣ IS A NLP PIPELINE
‣ HARVESTS STRUCTURED DATA FROM
RAW TEXT
‣ PRODUCES WIKIDATA CONTENT WITH
REFERENCES
7
WHY?
1. THE CRITICAL ISSUE
2. THE VISION
3. THE TECHNICAL PROBLEM
8
▸ Reliability of content across Wikimedia
projects
▸ Trust needed on the content addition
process
▸ Mature in Wikipedia, b...
WHY
THE CRITICAL ISSUE
▸ StrepHit = novel, automatic process
▸ Generates trust and reliability over
Wikidata content
▸ All...
WHY
THE VISION
▸ Wikidata as a
central Open
Data hub
11
WHY
THE TECHNICAL PROBLEM
▸ Content should be validated against
third-party resources
▸ References to external authoritati...
HOW?
‣ INPUT = PRIMARY SOURCES CORPUS
‣ OUTPUT = DATASET FOR WIKIDATA
‣ AUTHENTICATE EXISTING CONTENT
‣ PROPOSE NOVEL CONT...
HOW?
‣ LEXICOGRAPHICAL ANALYSIS
‣ RELATION EXTRACTION
‣ FRAME SEMANTICS
‣ MACHINE LEARNING
14
HOW
MAIN TASKS
1. Sources selection
2. Corpus harvesting
3. Corpus analysis
4. Frame repository selection
5. Training set ...
WHERE?
PRIMARY SOURCES TOOL
16
A. BIOGRAPHIES
B. COMPANIES
C. BIOMEDICAL
which domain?
FIRST STEP 17
THANKS NEMO FOR OUR PRECIOUS CONVERSATION
FIRST STEP
BIOGRAPHIES
▸ plenty of existing data
▸ broad coverage
▸ potentially easy to find valuable primary sources
18
LI...
FIRST STEP
COMPANIES
▸ relatively biased domain
▸ ad-prone content
▸ the company edits the page on the company itself
▸ lo...
FIRST STEP
BIOMEDICAL
▸ great primary source
▸ PubMed: scientific papers
▸ proof of usage for an Open Access corpus
20
OPEN DISCUSSION
DOMAIN + SOURCES SELECTION
MARCO FOSSATI - HJFOCS - FOSSATI@FBK.EU
TRENTO, 15TH JANUARY 2016
THIS WORK IS ...
Upcoming SlideShare
Loading in …5
×

StrepHit IEG Kick-off Seminar

1,688 views

Published on

Kick-off seminar of the largest Wikimedia IEG, 2015 round 2 call.
In conjunction with Wikipedia's 15 birthday.
Project page: https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

Published in: Technology
  • Be the first to comment

  • Be the first to like this

StrepHit IEG Kick-off Seminar

  1. 1. STREPHIT A WIKIMEDIA FOUNDATION IEG PROJECT MARCO FOSSATI - HJFOCS - FOSSATI@FBK.EU TRENTO, 15TH JANUARY 2016
  2. 2. HAPPY BIRTHDAY, WIKIPEDIA! Preamble PREAMBLE 2
  3. 3. INDIVIDUAL ENGAGEMENT GRANTS Preamble PREAMBLE 3
  4. 4. THE FREE KNOWLEDGE BASE THAT ANYONE CAN EDIT Preamble PREAMBLE 4
  5. 5. 5 MARCO FOSSATI EMILIO DORIGATTI WHO?
  6. 6. WHO? ‣ ADVISOR: CLAUDIO GIULIANO ‣ VOLUNTEERS: ‣ AUVA87, BOLIOLIANDREA, DANROK, NISPRATEEK, PROJEKT ANA, VLADIMIR ALEXIEV 6
  7. 7. WHAT? ‣ IS A NLP PIPELINE ‣ HARVESTS STRUCTURED DATA FROM RAW TEXT ‣ PRODUCES WIKIDATA CONTENT WITH REFERENCES 7
  8. 8. WHY? 1. THE CRITICAL ISSUE 2. THE VISION 3. THE TECHNICAL PROBLEM 8
  9. 9. ▸ Reliability of content across Wikimedia projects ▸ Trust needed on the content addition process ▸ Mature in Wikipedia, but what about Wikidata? WHY THE CRITICAL ISSUE 9
  10. 10. WHY THE CRITICAL ISSUE ▸ StrepHit = novel, automatic process ▸ Generates trust and reliability over Wikidata content ▸ Alleviates the burden of manual curation 10
  11. 11. WHY THE VISION ▸ Wikidata as a central Open Data hub 11
  12. 12. WHY THE TECHNICAL PROBLEM ▸ Content should be validated against third-party resources ▸ References to external authoritative sources ▸ Ensure at least one reference for each piece of data 12
  13. 13. HOW? ‣ INPUT = PRIMARY SOURCES CORPUS ‣ OUTPUT = DATASET FOR WIKIDATA ‣ AUTHENTICATE EXISTING CONTENT ‣ PROPOSE NOVEL CONTENT ‣ VIA REFERENCES TO SUCH SOURCES 13
  14. 14. HOW? ‣ LEXICOGRAPHICAL ANALYSIS ‣ RELATION EXTRACTION ‣ FRAME SEMANTICS ‣ MACHINE LEARNING 14
  15. 15. HOW MAIN TASKS 1. Sources selection 2. Corpus harvesting 3. Corpus analysis 4. Frame repository selection 5. Training set construction 6. Frame extraction 7. Dataset production 15
  16. 16. WHERE? PRIMARY SOURCES TOOL 16
  17. 17. A. BIOGRAPHIES B. COMPANIES C. BIOMEDICAL which domain? FIRST STEP 17 THANKS NEMO FOR OUR PRECIOUS CONVERSATION
  18. 18. FIRST STEP BIOGRAPHIES ▸ plenty of existing data ▸ broad coverage ▸ potentially easy to find valuable primary sources 18 LIBRARIANS, WHAT DO YOU THINK?
  19. 19. FIRST STEP COMPANIES ▸ relatively biased domain ▸ ad-prone content ▸ the company edits the page on the company itself ▸ low-quality data 19
  20. 20. FIRST STEP BIOMEDICAL ▸ great primary source ▸ PubMed: scientific papers ▸ proof of usage for an Open Access corpus 20
  21. 21. OPEN DISCUSSION DOMAIN + SOURCES SELECTION MARCO FOSSATI - HJFOCS - FOSSATI@FBK.EU TRENTO, 15TH JANUARY 2016 THIS WORK IS LICENSED UNDER A CC BY SA 4.0 LICENSE https://pad.okfn.org/p/strephit

×