SlideShare a Scribd company logo
1 of 18
Download to read offline
14th DBpedia Community meeting
12 September 2019, Karlsruhe
References extraction
from Wikipedia infoboxes
Włodzimierz Lewoniewski, Krzysztof Węcel
Introduction
 Wikipedia infoboxes may contain
references, which can be useful for
checking the reliability of the
provided data
 References (sources) in Wikipedia
represented in various formats
 It is possible to extract source
metadata
 Authors, Title, URL, DOI, ISBN etc.
2019.09.12
2
Considered data
 We used dumps from September 2019 for
10 Wikipedia languages:
2019.09.12
3
Language Articles
Articles with infobox
Number Share
English (en) 5 921 047 3 893 781 65,8%
Swedish (sv) 3 749 566 3 068 186 81,8%
German (de) 2 338 219 773 217 33,1%
French (fr) 2 136 118 1 337 170 62,6%
Dutch (nl) 1 977 370 1 568 009 79,3%
Russian (ru) 1 565 802 1 055 754 67,4%
Italian (it) 1 550 407 1 090 419 70,3%
Spanish (es) 1 542 071 1 060 434 68,8%
Polish (pl) 1 356 252 987 240 72,8%
Portuguese (pt) 1 012 953 539 999 53,3%
Infobox extraction
 We used own Python Infobox Reference
Extractor (PIRE) to provide statistics
about citations in 10 language
versions.
 PIRE input:
 infobox names, Wikipedia xml dumps for
each language.
 PIRE output:
 Wikipedia article name, infobox name,
infobox parameter name, reference code,
citation metadata and others.
2019.09.12
4
Infobox extraction - result
 General statistics about extraction in
September 2019
2019.09.12
5
Language
Parameters
with value
Parameters with reference
Number Share
English (en) 59 959 916 2 084 768 3,48%
Swedish (sv) 54 789 889 816 569 1,49%
Dutch (nl) 17 889 993 80 741 0,45%
French (fr) 17 437 104 1 157 619 6,64%
Italian (it) 16 166 909 230 494 1,43%
Russian (ru) 15 966 329 232 023 1,45%
Spanish (es) 15 394 821 344 649 2,24%
Polish (pl) 12 770 997 490 368 3,84%
German (de) 11 426 462 347 691 3,04%
Portuguese (pt) 7 249 771 152 380 2,10%
Infobox parameters with references
English Wikipedia
2019.09.12
6
Parameter name Refs
refnum 61614
population_footnotes 49797
status_ref 42887
area_footnotes 40574
synonyms_ref 39110
blank_info 37907
footnotes 36165
authority 35741
birth_date 35648
genre 31259
Parameter name Refs
Einwohner-Quelle 11913
NACHWEIS-LÄNGE 9397
Löslichkeit 8928
Quellen Alben 8821
HubbleRef 8063
RekDekRef 7862
Mitarbeiterzahl 7672
Schmelzpunkt 7661
NACHWEIS-EINZUGSGEBIET 7554
Beschreibung 7548
German Wikipedia
Calculation using PIRE based on Wikipedia dumps from September 2019
More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
Infoboxes with references
English Wikipedia
2019.09.12
7
German Wikipedia
Calculation using PIRE based on Wikipedia dumps from September 2019
More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
Infobox name Refs
Infobox settlement 272152
Speciesbox 96764
Video game reviews 85487
Taxobox 83279
Infobox NRHP 79235
Infobox football biography 71031
Infobox film 65913
Infobox planet 53536
Infobox person 46782
Infobox company 43595
Infobox name Refs
Infobox Chemikalie 59976
Infobox Galaxie 58135
Infobox Fluss 34890
Infobox Unternehmen 19506
Infobox Chartplatzierungen 15932
Infobox Ortsteil einer
Gemeinde in Deutschland
14550
Infobox Mineral 13244
Infobox Stern 7450
Infobox See 6035
Infobox Hochschule 4407
API
• After combining PIRE with DBpedia
Extraction Framework it is possible
to get RDF triples with source
metadata through API:
• URL:
 http://dbpedia.informatik.uni-
leipzig.de:8111/infobox/references?
article={Wikipedia_article_URL}&format=json&dbpedia
• User script for Wikipedia:
https://meta.wikimedia.org/wiki/User:JohannesFre/global.js
2019.09.12
8
API – User script for Wikipedia
2019.09.12
9
Challenges: infobox names
• There are plenty of different names of
templates in each Wikipedia language.
• Insignificant part of these names
indicates infoboxes.
• Infoboxes names can be listed in
special categories
• https://en.wikipedia.org/wiki/Category:Infobox_templates
• Problems:
• Depending on the language, infoboxes titles
can be placed in subcategories at different
levels
• Not all titles point to infoboxes
2019.09.12
10
Challenges: templates in parameters
• Metadata of the source can
contains other templates
• {{cite web |url= {{Allmusic |
class=artist | id=p44722 |
pure_url=yes}} | title=...}}
• {{cite web | url= {{NRHP
url|id=79000934}} |title=…}}
• {{cite web | url = {{BillboardURLbyName
| artist=garth brooks|bio=true}} | last
= Erlewine | first = Stephen Thomas |
title = …}}
• …
2019.09.12
11
Challenges: citation templates
• Apart of general citation templates,
there are specific with reference to
concrete source placed within the <ref>
tag (website, encyclopedia, book,
article etc.)
• Examples in English Wikipedia:
• {{London Gazette |issue= |date= |page= }}
• {{NRISref | 2013a | dateform=mdy |
accessdate=September 10, 2019 |
refnum=66000030 | name=Lincoln Memorial}}
• {Iran Census 2006 | 07}}
• {{GEOnet3 | -3064853}}
• …
2019.09.12
12
Templates in infobox parameters
English Wikipedia
2019.09.12
13
German Wikipedia
Calculation using PIRE based on Wikipedia dumps from September 2019
More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
Template name Number
Coord 795314
Convert 669388
Cite web 614827
Birth date and age 550629
Flag 304386
Death date and age 256338
Birth date 248617
Flagicon 239973
Start date 201211
URL 177383
Template name Number
Team-Station 384153
AB 125220
0 115069
RSIGN 70032
Charts 62709
USA 27711
DEU 24978
Medaillenspiegel 22391
Internetquelle 21693
Single 20833
Challenges: footnote templates
• Some of the citation templates are
not placed within <ref> tag and
generates it after compiling.
• Examples:
• {{sfn|Solomon|1989|p=24}}
• {{sfnm | 1a1=Perramon | 1y=1986 | 1p=242
| 2a1=Clendinnen | 2y=2003 | 2pp=3–4 }}
• {{sfnp|Smith|Jones|Brown|2005|p=25}}
• …
2019.09.12
14
Challenges: references not aligned
to concrete infobox parameters
2019.09.12
15
Challenges: infobox parameter
not represented in wikitext
2019.09.12
16
?
“Infobox German
location” automatically
transcludes population
data from {{Population
Germany}}
Future work
• Improving the extraction algorithm
• Unification of the parameters of source
metadata
• Assessing the quality of the references
• Quality/popularity measures of Wikipedia articles
(WikiRank.net)
• Appearance of the source in different databases
• Assessment of the domain reputation/popularity
• Finding best source for specific data
• Such as population for cities, revenue for
companies etc.
• Integrating PIRE to DBpedia Extraction
Framework
• Integrating citation metadata and
measures to GFS Data Browser
2019.09.12
17
Related publications
 Multilingual Ranking of Wikipedia Articles with
Quality and Popularity Assessment in Different Topics
(2019)
 Measures for Quality Assessment of Articles and
Infoboxes in Multilingual Wikipedia (2019)
 Application of SEO Metrics to Determine the Quality
of Wikipedia Articles and Their Sources (2018)
 Completeness and Reliability of Wikipedia Infoboxes
in Various Languages (2018)
 Relative Quality and Popularity Evaluation of
Multilingual Wikipedia Articles (2017)
 Analysis of References across Wikipedia Languages
(2017)
 Quality and Importance of Wikipedia Articles in
Different Languages (2016)
 Modelling the Quality of Attributes in Wikipedia
Infoboxes (2015)
2019.09.12
18

More Related Content

Similar to Reference Extraction from Wikipedia Infoboxes

A Comprehensive Introduction to Object-Oriented Programming with Java.pdf
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfA Comprehensive Introduction to Object-Oriented Programming with Java.pdf
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfYasmine Anino
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
 
Example PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docxExample PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docxcravennichole326
 
Example PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docxExample PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docxelbanglis
 
A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...Cynthia Velynne
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
DBpedia as Gaeilge Chapter
DBpedia as Gaeilge ChapterDBpedia as Gaeilge Chapter
DBpedia as Gaeilge ChapterBianca Pereira
 
Open source programming
Open source programmingOpen source programming
Open source programmingRizwan Ahmed
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesToine Bogers
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLSamuel Lampa
 
Quality assessment of Wikipedia and its sources
Quality assessment of Wikipedia and its sourcesQuality assessment of Wikipedia and its sources
Quality assessment of Wikipedia and its sourcesWłodzimierz Lewoniewski
 
From enthusiasm to hesitation, and beyond: some German remarks on BIBFRAME
From enthusiasm to hesitation,and beyond: some German remarks on BIBFRAMEFrom enthusiasm to hesitation,and beyond: some German remarks on BIBFRAME
From enthusiasm to hesitation, and beyond: some German remarks on BIBFRAMEReinhold Heuvelmann
 
Interconnecting Belgian national and regional address data using EC ISA "Loca...
Interconnecting Belgian national and regional address data using EC ISA "Loca...Interconnecting Belgian national and regional address data using EC ISA "Loca...
Interconnecting Belgian national and regional address data using EC ISA "Loca...PeterWinstanley1
 
A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...Petar Ristoski
 
Linked Data in Learning Analytics Tools
Linked Data in Learning Analytics ToolsLinked Data in Learning Analytics Tools
Linked Data in Learning Analytics ToolsMathieu d'Aquin
 
Conference Identity: persistent identifiers for conferences
Conference Identity: persistent identifiers for conferencesConference Identity: persistent identifiers for conferences
Conference Identity: persistent identifiers for conferencesAliaksandr Birukou
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationVladimir Alexiev, PhD, PMP
 
Beyond the Record : OCLC & the Future of MARC
Beyond the Record : OCLC & the Future of MARCBeyond the Record : OCLC & the Future of MARC
Beyond the Record : OCLC & the Future of MARCtfons
 

Similar to Reference Extraction from Wikipedia Infoboxes (20)

A Comprehensive Introduction to Object-Oriented Programming with Java.pdf
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfA Comprehensive Introduction to Object-Oriented Programming with Java.pdf
A Comprehensive Introduction to Object-Oriented Programming with Java.pdf
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
 
Example PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docxExample PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docx
 
Example PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docxExample PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docx
 
A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
DBpedia as Gaeilge Chapter
DBpedia as Gaeilge ChapterDBpedia as Gaeilge Chapter
DBpedia as Gaeilge Chapter
 
Open source programming
Open source programmingOpen source programming
Open source programming
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program Committees
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQL
 
Quality assessment of Wikipedia and its sources
Quality assessment of Wikipedia and its sourcesQuality assessment of Wikipedia and its sources
Quality assessment of Wikipedia and its sources
 
A hint of_mint
A hint of_mintA hint of_mint
A hint of_mint
 
From enthusiasm to hesitation, and beyond: some German remarks on BIBFRAME
From enthusiasm to hesitation,and beyond: some German remarks on BIBFRAMEFrom enthusiasm to hesitation,and beyond: some German remarks on BIBFRAME
From enthusiasm to hesitation, and beyond: some German remarks on BIBFRAME
 
Interconnecting Belgian national and regional address data using EC ISA "Loca...
Interconnecting Belgian national and regional address data using EC ISA "Loca...Interconnecting Belgian national and regional address data using EC ISA "Loca...
Interconnecting Belgian national and regional address data using EC ISA "Loca...
 
A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...
 
Linked Data in Learning Analytics Tools
Linked Data in Learning Analytics ToolsLinked Data in Learning Analytics Tools
Linked Data in Learning Analytics Tools
 
Conference Identity: persistent identifiers for conferences
Conference Identity: persistent identifiers for conferencesConference Identity: persistent identifiers for conferences
Conference Identity: persistent identifiers for conferences
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
 
Beyond the Record : OCLC & the Future of MARC
Beyond the Record : OCLC & the Future of MARCBeyond the Record : OCLC & the Future of MARC
Beyond the Record : OCLC & the Future of MARC
 

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

Reference Extraction from Wikipedia Infoboxes

  • 1. 14th DBpedia Community meeting 12 September 2019, Karlsruhe References extraction from Wikipedia infoboxes Włodzimierz Lewoniewski, Krzysztof Węcel
  • 2. Introduction  Wikipedia infoboxes may contain references, which can be useful for checking the reliability of the provided data  References (sources) in Wikipedia represented in various formats  It is possible to extract source metadata  Authors, Title, URL, DOI, ISBN etc. 2019.09.12 2
  • 3. Considered data  We used dumps from September 2019 for 10 Wikipedia languages: 2019.09.12 3 Language Articles Articles with infobox Number Share English (en) 5 921 047 3 893 781 65,8% Swedish (sv) 3 749 566 3 068 186 81,8% German (de) 2 338 219 773 217 33,1% French (fr) 2 136 118 1 337 170 62,6% Dutch (nl) 1 977 370 1 568 009 79,3% Russian (ru) 1 565 802 1 055 754 67,4% Italian (it) 1 550 407 1 090 419 70,3% Spanish (es) 1 542 071 1 060 434 68,8% Polish (pl) 1 356 252 987 240 72,8% Portuguese (pt) 1 012 953 539 999 53,3%
  • 4. Infobox extraction  We used own Python Infobox Reference Extractor (PIRE) to provide statistics about citations in 10 language versions.  PIRE input:  infobox names, Wikipedia xml dumps for each language.  PIRE output:  Wikipedia article name, infobox name, infobox parameter name, reference code, citation metadata and others. 2019.09.12 4
  • 5. Infobox extraction - result  General statistics about extraction in September 2019 2019.09.12 5 Language Parameters with value Parameters with reference Number Share English (en) 59 959 916 2 084 768 3,48% Swedish (sv) 54 789 889 816 569 1,49% Dutch (nl) 17 889 993 80 741 0,45% French (fr) 17 437 104 1 157 619 6,64% Italian (it) 16 166 909 230 494 1,43% Russian (ru) 15 966 329 232 023 1,45% Spanish (es) 15 394 821 344 649 2,24% Polish (pl) 12 770 997 490 368 3,84% German (de) 11 426 462 347 691 3,04% Portuguese (pt) 7 249 771 152 380 2,10%
  • 6. Infobox parameters with references English Wikipedia 2019.09.12 6 Parameter name Refs refnum 61614 population_footnotes 49797 status_ref 42887 area_footnotes 40574 synonyms_ref 39110 blank_info 37907 footnotes 36165 authority 35741 birth_date 35648 genre 31259 Parameter name Refs Einwohner-Quelle 11913 NACHWEIS-LÄNGE 9397 Löslichkeit 8928 Quellen Alben 8821 HubbleRef 8063 RekDekRef 7862 Mitarbeiterzahl 7672 Schmelzpunkt 7661 NACHWEIS-EINZUGSGEBIET 7554 Beschreibung 7548 German Wikipedia Calculation using PIRE based on Wikipedia dumps from September 2019 More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
  • 7. Infoboxes with references English Wikipedia 2019.09.12 7 German Wikipedia Calculation using PIRE based on Wikipedia dumps from September 2019 More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net Infobox name Refs Infobox settlement 272152 Speciesbox 96764 Video game reviews 85487 Taxobox 83279 Infobox NRHP 79235 Infobox football biography 71031 Infobox film 65913 Infobox planet 53536 Infobox person 46782 Infobox company 43595 Infobox name Refs Infobox Chemikalie 59976 Infobox Galaxie 58135 Infobox Fluss 34890 Infobox Unternehmen 19506 Infobox Chartplatzierungen 15932 Infobox Ortsteil einer Gemeinde in Deutschland 14550 Infobox Mineral 13244 Infobox Stern 7450 Infobox See 6035 Infobox Hochschule 4407
  • 8. API • After combining PIRE with DBpedia Extraction Framework it is possible to get RDF triples with source metadata through API: • URL:  http://dbpedia.informatik.uni- leipzig.de:8111/infobox/references? article={Wikipedia_article_URL}&format=json&dbpedia • User script for Wikipedia: https://meta.wikimedia.org/wiki/User:JohannesFre/global.js 2019.09.12 8
  • 9. API – User script for Wikipedia 2019.09.12 9
  • 10. Challenges: infobox names • There are plenty of different names of templates in each Wikipedia language. • Insignificant part of these names indicates infoboxes. • Infoboxes names can be listed in special categories • https://en.wikipedia.org/wiki/Category:Infobox_templates • Problems: • Depending on the language, infoboxes titles can be placed in subcategories at different levels • Not all titles point to infoboxes 2019.09.12 10
  • 11. Challenges: templates in parameters • Metadata of the source can contains other templates • {{cite web |url= {{Allmusic | class=artist | id=p44722 | pure_url=yes}} | title=...}} • {{cite web | url= {{NRHP url|id=79000934}} |title=…}} • {{cite web | url = {{BillboardURLbyName | artist=garth brooks|bio=true}} | last = Erlewine | first = Stephen Thomas | title = …}} • … 2019.09.12 11
  • 12. Challenges: citation templates • Apart of general citation templates, there are specific with reference to concrete source placed within the <ref> tag (website, encyclopedia, book, article etc.) • Examples in English Wikipedia: • {{London Gazette |issue= |date= |page= }} • {{NRISref | 2013a | dateform=mdy | accessdate=September 10, 2019 | refnum=66000030 | name=Lincoln Memorial}} • {Iran Census 2006 | 07}} • {{GEOnet3 | -3064853}} • … 2019.09.12 12
  • 13. Templates in infobox parameters English Wikipedia 2019.09.12 13 German Wikipedia Calculation using PIRE based on Wikipedia dumps from September 2019 More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net Template name Number Coord 795314 Convert 669388 Cite web 614827 Birth date and age 550629 Flag 304386 Death date and age 256338 Birth date 248617 Flagicon 239973 Start date 201211 URL 177383 Template name Number Team-Station 384153 AB 125220 0 115069 RSIGN 70032 Charts 62709 USA 27711 DEU 24978 Medaillenspiegel 22391 Internetquelle 21693 Single 20833
  • 14. Challenges: footnote templates • Some of the citation templates are not placed within <ref> tag and generates it after compiling. • Examples: • {{sfn|Solomon|1989|p=24}} • {{sfnm | 1a1=Perramon | 1y=1986 | 1p=242 | 2a1=Clendinnen | 2y=2003 | 2pp=3–4 }} • {{sfnp|Smith|Jones|Brown|2005|p=25}} • … 2019.09.12 14
  • 15. Challenges: references not aligned to concrete infobox parameters 2019.09.12 15
  • 16. Challenges: infobox parameter not represented in wikitext 2019.09.12 16 ? “Infobox German location” automatically transcludes population data from {{Population Germany}}
  • 17. Future work • Improving the extraction algorithm • Unification of the parameters of source metadata • Assessing the quality of the references • Quality/popularity measures of Wikipedia articles (WikiRank.net) • Appearance of the source in different databases • Assessment of the domain reputation/popularity • Finding best source for specific data • Such as population for cities, revenue for companies etc. • Integrating PIRE to DBpedia Extraction Framework • Integrating citation metadata and measures to GFS Data Browser 2019.09.12 17
  • 18. Related publications  Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics (2019)  Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia (2019)  Application of SEO Metrics to Determine the Quality of Wikipedia Articles and Their Sources (2018)  Completeness and Reliability of Wikipedia Infoboxes in Various Languages (2018)  Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles (2017)  Analysis of References across Wikipedia Languages (2017)  Quality and Importance of Wikipedia Articles in Different Languages (2016)  Modelling the Quality of Attributes in Wikipedia Infoboxes (2015) 2019.09.12 18