SlideShare a Scribd company logo
1 of 21
Download to read offline
STREPHIT
A WIKIMEDIA FOUNDATION
IEG PROJECT
MARCO FOSSATI - HJFOCS - FOSSATI@FBK.EU
TRENTO, 15TH JANUARY 2016
HAPPY BIRTHDAY,
WIKIPEDIA!
Preamble
PREAMBLE 2
INDIVIDUAL
ENGAGEMENT
GRANTS
Preamble
PREAMBLE 3
THE FREE
KNOWLEDGE BASE
THAT ANYONE CAN EDIT
Preamble
PREAMBLE 4
5
MARCO FOSSATI
EMILIO DORIGATTI
WHO?
WHO?
‣ ADVISOR: CLAUDIO GIULIANO
‣ VOLUNTEERS:
‣ AUVA87, BOLIOLIANDREA, DANROK,
NISPRATEEK, PROJEKT ANA,
VLADIMIR ALEXIEV
6
WHAT?
‣ IS A NLP PIPELINE
‣ HARVESTS STRUCTURED DATA FROM
RAW TEXT
‣ PRODUCES WIKIDATA CONTENT WITH
REFERENCES
7
WHY?
1. THE CRITICAL ISSUE
2. THE VISION
3. THE TECHNICAL PROBLEM
8
▸ Reliability of content across Wikimedia
projects
▸ Trust needed on the content addition
process
▸ Mature in Wikipedia, but what about
Wikidata?
WHY
THE CRITICAL ISSUE
9
WHY
THE CRITICAL ISSUE
▸ StrepHit = novel, automatic process
▸ Generates trust and reliability over
Wikidata content
▸ Alleviates the burden of manual
curation
10
WHY
THE VISION
▸ Wikidata as a
central Open
Data hub
11
WHY
THE TECHNICAL PROBLEM
▸ Content should be validated against
third-party resources
▸ References to external authoritative
sources
▸ Ensure at least one reference for each
piece of data
12
HOW?
‣ INPUT = PRIMARY SOURCES CORPUS
‣ OUTPUT = DATASET FOR WIKIDATA
‣ AUTHENTICATE EXISTING CONTENT
‣ PROPOSE NOVEL CONTENT
‣ VIA REFERENCES TO SUCH SOURCES
13
HOW?
‣ LEXICOGRAPHICAL ANALYSIS
‣ RELATION EXTRACTION
‣ FRAME SEMANTICS
‣ MACHINE LEARNING
14
HOW
MAIN TASKS
1. Sources selection
2. Corpus harvesting
3. Corpus analysis
4. Frame repository selection
5. Training set construction
6. Frame extraction
7. Dataset production
15
WHERE?
PRIMARY SOURCES TOOL
16
A. BIOGRAPHIES
B. COMPANIES
C. BIOMEDICAL
which domain?
FIRST STEP 17
THANKS NEMO FOR OUR PRECIOUS CONVERSATION
FIRST STEP
BIOGRAPHIES
▸ plenty of existing data
▸ broad coverage
▸ potentially easy to find valuable primary sources
18
LIBRARIANS,
WHAT DO YOU THINK?
FIRST STEP
COMPANIES
▸ relatively biased domain
▸ ad-prone content
▸ the company edits the page on the company itself
▸ low-quality data
19
FIRST STEP
BIOMEDICAL
▸ great primary source
▸ PubMed: scientific papers
▸ proof of usage for an Open Access corpus
20
OPEN DISCUSSION
DOMAIN + SOURCES SELECTION
MARCO FOSSATI - HJFOCS - FOSSATI@FBK.EU
TRENTO, 15TH JANUARY 2016
THIS WORK IS LICENSED UNDER A CC BY SA 4.0 LICENSE
https://pad.okfn.org/p/strephit

More Related Content

Similar to StrepHit IEG Kick-off Seminar

Enterprise wiki analytics EMWCon 2017
Enterprise wiki analytics EMWCon 2017Enterprise wiki analytics EMWCon 2017
Enterprise wiki analytics EMWCon 2017Bernadette Clemente
 
Lucy Crompton-Reid OER17 keynote presentation
Lucy Crompton-Reid OER17 keynote presentation Lucy Crompton-Reid OER17 keynote presentation
Lucy Crompton-Reid OER17 keynote presentation Wikimedia UK
 
Wikipedia Diversity
Wikipedia DiversityWikipedia Diversity
Wikipedia DiversityIlona Buchem
 
Wikipedia Primary School Cape Town 2014
Wikipedia Primary School Cape Town 2014Wikipedia Primary School Cape Town 2014
Wikipedia Primary School Cape Town 2014Iolanda Pensa
 
A new research agenda for Wikimedia – Big Dive 2015
A new research agenda for Wikimedia – Big Dive 2015A new research agenda for Wikimedia – Big Dive 2015
A new research agenda for Wikimedia – Big Dive 2015Dario Taraborelli
 
Introduction to the Social Web and its applications
Introduction to the Social Web and its applicationsIntroduction to the Social Web and its applications
Introduction to the Social Web and its applicationsmdabrowski
 
Innotech Austin 2017: The Path of DevOps Enlightenment for InfoSec
Innotech Austin 2017: The Path of DevOps Enlightenment for InfoSecInnotech Austin 2017: The Path of DevOps Enlightenment for InfoSec
Innotech Austin 2017: The Path of DevOps Enlightenment for InfoSecJames Wickett
 
Reinforcing the bridge between researchers and global citizens by means of Op...
Reinforcing the bridge between researchers and global citizens by means of Op...Reinforcing the bridge between researchers and global citizens by means of Op...
Reinforcing the bridge between researchers and global citizens by means of Op...Miquel Duran
 
Il mondo di Wikipedia, Liceo Cantonale di Bellinzona
Il mondo di Wikipedia, Liceo Cantonale di BellinzonaIl mondo di Wikipedia, Liceo Cantonale di Bellinzona
Il mondo di Wikipedia, Liceo Cantonale di BellinzonaIolanda Pensa
 
Wikimedia recommendation
Wikimedia recommendationWikimedia recommendation
Wikimedia recommendationYash Nagar
 
BAAC conference 2018 - Wikimedia keynote
BAAC conference 2018 - Wikimedia keynoteBAAC conference 2018 - Wikimedia keynote
BAAC conference 2018 - Wikimedia keynoteSandra Fauconnier
 
Bbc live the story
Bbc live the storyBbc live the story
Bbc live the storyLipatov Petr
 
Covid-19 Endemic: Challenges And Opportunities for Information Professionals
Covid-19 Endemic: Challenges And Opportunities for Information ProfessionalsCovid-19 Endemic: Challenges And Opportunities for Information Professionals
Covid-19 Endemic: Challenges And Opportunities for Information ProfessionalsIsmail Fahmi
 
The Path of DevOps Enlightenment for InfoSec
The Path of DevOps Enlightenment for InfoSecThe Path of DevOps Enlightenment for InfoSec
The Path of DevOps Enlightenment for InfoSecJames Wickett
 
Influence of Social Media and Mainstream Media in Indonesia
Influence of Social Media and Mainstream Media in IndonesiaInfluence of Social Media and Mainstream Media in Indonesia
Influence of Social Media and Mainstream Media in IndonesiaIsmail Fahmi
 
Digital Transformation and Data - the Wikimedia Residency at the University o...
Digital Transformation and Data - the Wikimedia Residency at the University o...Digital Transformation and Data - the Wikimedia Residency at the University o...
Digital Transformation and Data - the Wikimedia Residency at the University o...Ewan McAndrew
 

Similar to StrepHit IEG Kick-off Seminar (20)

Enterprise wiki analytics EMWCon 2017
Enterprise wiki analytics EMWCon 2017Enterprise wiki analytics EMWCon 2017
Enterprise wiki analytics EMWCon 2017
 
Enhancing Diversity via Web 2.0 @ Web4Dev
Enhancing Diversity via Web 2.0 @ Web4DevEnhancing Diversity via Web 2.0 @ Web4Dev
Enhancing Diversity via Web 2.0 @ Web4Dev
 
Lucy Crompton-Reid OER17 keynote presentation
Lucy Crompton-Reid OER17 keynote presentation Lucy Crompton-Reid OER17 keynote presentation
Lucy Crompton-Reid OER17 keynote presentation
 
Wikipedia Diversity
Wikipedia DiversityWikipedia Diversity
Wikipedia Diversity
 
Embedding wikimedia in the curriculum - McAndrew
Embedding wikimedia in the curriculum - McAndrewEmbedding wikimedia in the curriculum - McAndrew
Embedding wikimedia in the curriculum - McAndrew
 
Wikipedia Primary School Cape Town 2014
Wikipedia Primary School Cape Town 2014Wikipedia Primary School Cape Town 2014
Wikipedia Primary School Cape Town 2014
 
A new research agenda for Wikimedia – Big Dive 2015
A new research agenda for Wikimedia – Big Dive 2015A new research agenda for Wikimedia – Big Dive 2015
A new research agenda for Wikimedia – Big Dive 2015
 
Introduction to the Social Web and its applications
Introduction to the Social Web and its applicationsIntroduction to the Social Web and its applications
Introduction to the Social Web and its applications
 
Web 2.0 For Labor
Web 2.0 For LaborWeb 2.0 For Labor
Web 2.0 For Labor
 
Innotech Austin 2017: The Path of DevOps Enlightenment for InfoSec
Innotech Austin 2017: The Path of DevOps Enlightenment for InfoSecInnotech Austin 2017: The Path of DevOps Enlightenment for InfoSec
Innotech Austin 2017: The Path of DevOps Enlightenment for InfoSec
 
Reinforcing the bridge between researchers and global citizens by means of Op...
Reinforcing the bridge between researchers and global citizens by means of Op...Reinforcing the bridge between researchers and global citizens by means of Op...
Reinforcing the bridge between researchers and global citizens by means of Op...
 
Il mondo di Wikipedia, Liceo Cantonale di Bellinzona
Il mondo di Wikipedia, Liceo Cantonale di BellinzonaIl mondo di Wikipedia, Liceo Cantonale di Bellinzona
Il mondo di Wikipedia, Liceo Cantonale di Bellinzona
 
Wikimedia recommendation
Wikimedia recommendationWikimedia recommendation
Wikimedia recommendation
 
BAAC conference 2018 - Wikimedia keynote
BAAC conference 2018 - Wikimedia keynoteBAAC conference 2018 - Wikimedia keynote
BAAC conference 2018 - Wikimedia keynote
 
Bbc live the story
Bbc live the storyBbc live the story
Bbc live the story
 
Covid-19 Endemic: Challenges And Opportunities for Information Professionals
Covid-19 Endemic: Challenges And Opportunities for Information ProfessionalsCovid-19 Endemic: Challenges And Opportunities for Information Professionals
Covid-19 Endemic: Challenges And Opportunities for Information Professionals
 
The Path of DevOps Enlightenment for InfoSec
The Path of DevOps Enlightenment for InfoSecThe Path of DevOps Enlightenment for InfoSec
The Path of DevOps Enlightenment for InfoSec
 
Influence of Social Media and Mainstream Media in Indonesia
Influence of Social Media and Mainstream Media in IndonesiaInfluence of Social Media and Mainstream Media in Indonesia
Influence of Social Media and Mainstream Media in Indonesia
 
Digital Transformation and Data - the Wikimedia Residency at the University o...
Digital Transformation and Data - the Wikimedia Residency at the University o...Digital Transformation and Data - the Wikimedia Residency at the University o...
Digital Transformation and Data - the Wikimedia Residency at the University o...
 
20160817 9am to 5pm is over
20160817 9am to 5pm is over20160817 9am to 5pm is over
20160817 9am to 5pm is over
 

More from Marco Fossati

Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
Unsupervised Learning of an Extensive and Usable Taxonomy for DBpediaUnsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
Unsupervised Learning of an Extensive and Usable Taxonomy for DBpediaMarco Fossati
 
Fact Extraction from Wikipedia
Fact Extraction from WikipediaFact Extraction from Wikipedia
Fact Extraction from WikipediaMarco Fossati
 
What you Can Make Out of Linked Data
What you Can Make Out of Linked DataWhat you Can Make Out of Linked Data
What you Can Make Out of Linked DataMarco Fossati
 
Primo mapping sprint della DBpedia italiana
Primo mapping sprint della DBpedia italianaPrimo mapping sprint della DBpedia italiana
Primo mapping sprint della DBpedia italianaMarco Fossati
 
Outsourcing FrameNet to the Crowd
Outsourcing FrameNet to the CrowdOutsourcing FrameNet to the Crowd
Outsourcing FrameNet to the CrowdMarco Fossati
 

More from Marco Fossati (7)

Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
Unsupervised Learning of an Extensive and Usable Taxonomy for DBpediaUnsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
 
Fact Extraction from Wikipedia
Fact Extraction from WikipediaFact Extraction from Wikipedia
Fact Extraction from Wikipedia
 
What you Can Make Out of Linked Data
What you Can Make Out of Linked DataWhat you Can Make Out of Linked Data
What you Can Make Out of Linked Data
 
Primo mapping sprint della DBpedia italiana
Primo mapping sprint della DBpedia italianaPrimo mapping sprint della DBpedia italiana
Primo mapping sprint della DBpedia italiana
 
DBpedia italiana
DBpedia italianaDBpedia italiana
DBpedia italiana
 
On Data quality
On Data qualityOn Data quality
On Data quality
 
Outsourcing FrameNet to the Crowd
Outsourcing FrameNet to the CrowdOutsourcing FrameNet to the Crowd
Outsourcing FrameNet to the Crowd
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

StrepHit IEG Kick-off Seminar