In Wikipedia’s infoboxes some facts have references, which can be useful for checking the reliability of the provided data. We present challenges and methods connected with the metadata extraction of the Wikipedia’s sources. We used DBpedia Extraction Framework along with own extensions in Python to provide statistics about citations in 10 language versions. Provided methods can be used to verify and synchronize facts depending on the quality assessment of sources.
Presented during SEMANTiCS 2019 on 14th DBpedia Community Meeting in Karlsruhe
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Reference Extraction from Wikipedia Infoboxes
1. 14th DBpedia Community meeting
12 September 2019, Karlsruhe
References extraction
from Wikipedia infoboxes
Włodzimierz Lewoniewski, Krzysztof Węcel
2. Introduction
Wikipedia infoboxes may contain
references, which can be useful for
checking the reliability of the
provided data
References (sources) in Wikipedia
represented in various formats
It is possible to extract source
metadata
Authors, Title, URL, DOI, ISBN etc.
2019.09.12
2
3. Considered data
We used dumps from September 2019 for
10 Wikipedia languages:
2019.09.12
3
Language Articles
Articles with infobox
Number Share
English (en) 5 921 047 3 893 781 65,8%
Swedish (sv) 3 749 566 3 068 186 81,8%
German (de) 2 338 219 773 217 33,1%
French (fr) 2 136 118 1 337 170 62,6%
Dutch (nl) 1 977 370 1 568 009 79,3%
Russian (ru) 1 565 802 1 055 754 67,4%
Italian (it) 1 550 407 1 090 419 70,3%
Spanish (es) 1 542 071 1 060 434 68,8%
Polish (pl) 1 356 252 987 240 72,8%
Portuguese (pt) 1 012 953 539 999 53,3%
4. Infobox extraction
We used own Python Infobox Reference
Extractor (PIRE) to provide statistics
about citations in 10 language
versions.
PIRE input:
infobox names, Wikipedia xml dumps for
each language.
PIRE output:
Wikipedia article name, infobox name,
infobox parameter name, reference code,
citation metadata and others.
2019.09.12
4
5. Infobox extraction - result
General statistics about extraction in
September 2019
2019.09.12
5
Language
Parameters
with value
Parameters with reference
Number Share
English (en) 59 959 916 2 084 768 3,48%
Swedish (sv) 54 789 889 816 569 1,49%
Dutch (nl) 17 889 993 80 741 0,45%
French (fr) 17 437 104 1 157 619 6,64%
Italian (it) 16 166 909 230 494 1,43%
Russian (ru) 15 966 329 232 023 1,45%
Spanish (es) 15 394 821 344 649 2,24%
Polish (pl) 12 770 997 490 368 3,84%
German (de) 11 426 462 347 691 3,04%
Portuguese (pt) 7 249 771 152 380 2,10%
6. Infobox parameters with references
English Wikipedia
2019.09.12
6
Parameter name Refs
refnum 61614
population_footnotes 49797
status_ref 42887
area_footnotes 40574
synonyms_ref 39110
blank_info 37907
footnotes 36165
authority 35741
birth_date 35648
genre 31259
Parameter name Refs
Einwohner-Quelle 11913
NACHWEIS-LÄNGE 9397
Löslichkeit 8928
Quellen Alben 8821
HubbleRef 8063
RekDekRef 7862
Mitarbeiterzahl 7672
Schmelzpunkt 7661
NACHWEIS-EINZUGSGEBIET 7554
Beschreibung 7548
German Wikipedia
Calculation using PIRE based on Wikipedia dumps from September 2019
More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
7. Infoboxes with references
English Wikipedia
2019.09.12
7
German Wikipedia
Calculation using PIRE based on Wikipedia dumps from September 2019
More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
Infobox name Refs
Infobox settlement 272152
Speciesbox 96764
Video game reviews 85487
Taxobox 83279
Infobox NRHP 79235
Infobox football biography 71031
Infobox film 65913
Infobox planet 53536
Infobox person 46782
Infobox company 43595
Infobox name Refs
Infobox Chemikalie 59976
Infobox Galaxie 58135
Infobox Fluss 34890
Infobox Unternehmen 19506
Infobox Chartplatzierungen 15932
Infobox Ortsteil einer
Gemeinde in Deutschland
14550
Infobox Mineral 13244
Infobox Stern 7450
Infobox See 6035
Infobox Hochschule 4407
8. API
• After combining PIRE with DBpedia
Extraction Framework it is possible
to get RDF triples with source
metadata through API:
• URL:
http://dbpedia.informatik.uni-
leipzig.de:8111/infobox/references?
article={Wikipedia_article_URL}&format=json&dbpedia
• User script for Wikipedia:
https://meta.wikimedia.org/wiki/User:JohannesFre/global.js
2019.09.12
8
10. Challenges: infobox names
• There are plenty of different names of
templates in each Wikipedia language.
• Insignificant part of these names
indicates infoboxes.
• Infoboxes names can be listed in
special categories
• https://en.wikipedia.org/wiki/Category:Infobox_templates
• Problems:
• Depending on the language, infoboxes titles
can be placed in subcategories at different
levels
• Not all titles point to infoboxes
2019.09.12
10
11. Challenges: templates in parameters
• Metadata of the source can
contains other templates
• {{cite web |url= {{Allmusic |
class=artist | id=p44722 |
pure_url=yes}} | title=...}}
• {{cite web | url= {{NRHP
url|id=79000934}} |title=…}}
• {{cite web | url = {{BillboardURLbyName
| artist=garth brooks|bio=true}} | last
= Erlewine | first = Stephen Thomas |
title = …}}
• …
2019.09.12
11
12. Challenges: citation templates
• Apart of general citation templates,
there are specific with reference to
concrete source placed within the <ref>
tag (website, encyclopedia, book,
article etc.)
• Examples in English Wikipedia:
• {{London Gazette |issue= |date= |page= }}
• {{NRISref | 2013a | dateform=mdy |
accessdate=September 10, 2019 |
refnum=66000030 | name=Lincoln Memorial}}
• {Iran Census 2006 | 07}}
• {{GEOnet3 | -3064853}}
• …
2019.09.12
12
13. Templates in infobox parameters
English Wikipedia
2019.09.12
13
German Wikipedia
Calculation using PIRE based on Wikipedia dumps from September 2019
More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
Template name Number
Coord 795314
Convert 669388
Cite web 614827
Birth date and age 550629
Flag 304386
Death date and age 256338
Birth date 248617
Flagicon 239973
Start date 201211
URL 177383
Template name Number
Team-Station 384153
AB 125220
0 115069
RSIGN 70032
Charts 62709
USA 27711
DEU 24978
Medaillenspiegel 22391
Internetquelle 21693
Single 20833
14. Challenges: footnote templates
• Some of the citation templates are
not placed within <ref> tag and
generates it after compiling.
• Examples:
• {{sfn|Solomon|1989|p=24}}
• {{sfnm | 1a1=Perramon | 1y=1986 | 1p=242
| 2a1=Clendinnen | 2y=2003 | 2pp=3–4 }}
• {{sfnp|Smith|Jones|Brown|2005|p=25}}
• …
2019.09.12
14
16. Challenges: infobox parameter
not represented in wikitext
2019.09.12
16
?
“Infobox German
location” automatically
transcludes population
data from {{Population
Germany}}
17. Future work
• Improving the extraction algorithm
• Unification of the parameters of source
metadata
• Assessing the quality of the references
• Quality/popularity measures of Wikipedia articles
(WikiRank.net)
• Appearance of the source in different databases
• Assessment of the domain reputation/popularity
• Finding best source for specific data
• Such as population for cities, revenue for
companies etc.
• Integrating PIRE to DBpedia Extraction
Framework
• Integrating citation metadata and
measures to GFS Data Browser
2019.09.12
17
18. Related publications
Multilingual Ranking of Wikipedia Articles with
Quality and Popularity Assessment in Different Topics
(2019)
Measures for Quality Assessment of Articles and
Infoboxes in Multilingual Wikipedia (2019)
Application of SEO Metrics to Determine the Quality
of Wikipedia Articles and Their Sources (2018)
Completeness and Reliability of Wikipedia Infoboxes
in Various Languages (2018)
Relative Quality and Popularity Evaluation of
Multilingual Wikipedia Articles (2017)
Analysis of References across Wikipedia Languages
(2017)
Quality and Importance of Wikipedia Articles in
Different Languages (2016)
Modelling the Quality of Attributes in Wikipedia
Infoboxes (2015)
2019.09.12
18