Reference Extraction from Wikipedia Infoboxes

W
Włodzimierz LewoniewskiAssistant professor at Poznań University of Economics and Business
14th DBpedia Community meeting
12 September 2019, Karlsruhe
References extraction
from Wikipedia infoboxes
Włodzimierz Lewoniewski, Krzysztof Węcel
Introduction
 Wikipedia infoboxes may contain
references, which can be useful for
checking the reliability of the
provided data
 References (sources) in Wikipedia
represented in various formats
 It is possible to extract source
metadata
 Authors, Title, URL, DOI, ISBN etc.
2019.09.12
2
Considered data
 We used dumps from September 2019 for
10 Wikipedia languages:
2019.09.12
3
Language Articles
Articles with infobox
Number Share
English (en) 5 921 047 3 893 781 65,8%
Swedish (sv) 3 749 566 3 068 186 81,8%
German (de) 2 338 219 773 217 33,1%
French (fr) 2 136 118 1 337 170 62,6%
Dutch (nl) 1 977 370 1 568 009 79,3%
Russian (ru) 1 565 802 1 055 754 67,4%
Italian (it) 1 550 407 1 090 419 70,3%
Spanish (es) 1 542 071 1 060 434 68,8%
Polish (pl) 1 356 252 987 240 72,8%
Portuguese (pt) 1 012 953 539 999 53,3%
Infobox extraction
 We used own Python Infobox Reference
Extractor (PIRE) to provide statistics
about citations in 10 language
versions.
 PIRE input:
 infobox names, Wikipedia xml dumps for
each language.
 PIRE output:
 Wikipedia article name, infobox name,
infobox parameter name, reference code,
citation metadata and others.
2019.09.12
4
Infobox extraction - result
 General statistics about extraction in
September 2019
2019.09.12
5
Language
Parameters
with value
Parameters with reference
Number Share
English (en) 59 959 916 2 084 768 3,48%
Swedish (sv) 54 789 889 816 569 1,49%
Dutch (nl) 17 889 993 80 741 0,45%
French (fr) 17 437 104 1 157 619 6,64%
Italian (it) 16 166 909 230 494 1,43%
Russian (ru) 15 966 329 232 023 1,45%
Spanish (es) 15 394 821 344 649 2,24%
Polish (pl) 12 770 997 490 368 3,84%
German (de) 11 426 462 347 691 3,04%
Portuguese (pt) 7 249 771 152 380 2,10%
Infobox parameters with references
English Wikipedia
2019.09.12
6
Parameter name Refs
refnum 61614
population_footnotes 49797
status_ref 42887
area_footnotes 40574
synonyms_ref 39110
blank_info 37907
footnotes 36165
authority 35741
birth_date 35648
genre 31259
Parameter name Refs
Einwohner-Quelle 11913
NACHWEIS-LÄNGE 9397
Löslichkeit 8928
Quellen Alben 8821
HubbleRef 8063
RekDekRef 7862
Mitarbeiterzahl 7672
Schmelzpunkt 7661
NACHWEIS-EINZUGSGEBIET 7554
Beschreibung 7548
German Wikipedia
Calculation using PIRE based on Wikipedia dumps from September 2019
More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
Infoboxes with references
English Wikipedia
2019.09.12
7
German Wikipedia
Calculation using PIRE based on Wikipedia dumps from September 2019
More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
Infobox name Refs
Infobox settlement 272152
Speciesbox 96764
Video game reviews 85487
Taxobox 83279
Infobox NRHP 79235
Infobox football biography 71031
Infobox film 65913
Infobox planet 53536
Infobox person 46782
Infobox company 43595
Infobox name Refs
Infobox Chemikalie 59976
Infobox Galaxie 58135
Infobox Fluss 34890
Infobox Unternehmen 19506
Infobox Chartplatzierungen 15932
Infobox Ortsteil einer
Gemeinde in Deutschland
14550
Infobox Mineral 13244
Infobox Stern 7450
Infobox See 6035
Infobox Hochschule 4407
API
• After combining PIRE with DBpedia
Extraction Framework it is possible
to get RDF triples with source
metadata through API:
• URL:
 http://dbpedia.informatik.uni-
leipzig.de:8111/infobox/references?
article={Wikipedia_article_URL}&format=json&dbpedia
• User script for Wikipedia:
https://meta.wikimedia.org/wiki/User:JohannesFre/global.js
2019.09.12
8
API – User script for Wikipedia
2019.09.12
9
Challenges: infobox names
• There are plenty of different names of
templates in each Wikipedia language.
• Insignificant part of these names
indicates infoboxes.
• Infoboxes names can be listed in
special categories
• https://en.wikipedia.org/wiki/Category:Infobox_templates
• Problems:
• Depending on the language, infoboxes titles
can be placed in subcategories at different
levels
• Not all titles point to infoboxes
2019.09.12
10
Challenges: templates in parameters
• Metadata of the source can
contains other templates
• {{cite web |url= {{Allmusic |
class=artist | id=p44722 |
pure_url=yes}} | title=...}}
• {{cite web | url= {{NRHP
url|id=79000934}} |title=…}}
• {{cite web | url = {{BillboardURLbyName
| artist=garth brooks|bio=true}} | last
= Erlewine | first = Stephen Thomas |
title = …}}
• …
2019.09.12
11
Challenges: citation templates
• Apart of general citation templates,
there are specific with reference to
concrete source placed within the <ref>
tag (website, encyclopedia, book,
article etc.)
• Examples in English Wikipedia:
• {{London Gazette |issue= |date= |page= }}
• {{NRISref | 2013a | dateform=mdy |
accessdate=September 10, 2019 |
refnum=66000030 | name=Lincoln Memorial}}
• {Iran Census 2006 | 07}}
• {{GEOnet3 | -3064853}}
• …
2019.09.12
12
Templates in infobox parameters
English Wikipedia
2019.09.12
13
German Wikipedia
Calculation using PIRE based on Wikipedia dumps from September 2019
More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
Template name Number
Coord 795314
Convert 669388
Cite web 614827
Birth date and age 550629
Flag 304386
Death date and age 256338
Birth date 248617
Flagicon 239973
Start date 201211
URL 177383
Template name Number
Team-Station 384153
AB 125220
0 115069
RSIGN 70032
Charts 62709
USA 27711
DEU 24978
Medaillenspiegel 22391
Internetquelle 21693
Single 20833
Challenges: footnote templates
• Some of the citation templates are
not placed within <ref> tag and
generates it after compiling.
• Examples:
• {{sfn|Solomon|1989|p=24}}
• {{sfnm | 1a1=Perramon | 1y=1986 | 1p=242
| 2a1=Clendinnen | 2y=2003 | 2pp=3–4 }}
• {{sfnp|Smith|Jones|Brown|2005|p=25}}
• …
2019.09.12
14
Challenges: references not aligned
to concrete infobox parameters
2019.09.12
15
Challenges: infobox parameter
not represented in wikitext
2019.09.12
16
?
“Infobox German
location” automatically
transcludes population
data from {{Population
Germany}}
Future work
• Improving the extraction algorithm
• Unification of the parameters of source
metadata
• Assessing the quality of the references
• Quality/popularity measures of Wikipedia articles
(WikiRank.net)
• Appearance of the source in different databases
• Assessment of the domain reputation/popularity
• Finding best source for specific data
• Such as population for cities, revenue for
companies etc.
• Integrating PIRE to DBpedia Extraction
Framework
• Integrating citation metadata and
measures to GFS Data Browser
2019.09.12
17
Related publications
 Multilingual Ranking of Wikipedia Articles with
Quality and Popularity Assessment in Different Topics
(2019)
 Measures for Quality Assessment of Articles and
Infoboxes in Multilingual Wikipedia (2019)
 Application of SEO Metrics to Determine the Quality
of Wikipedia Articles and Their Sources (2018)
 Completeness and Reliability of Wikipedia Infoboxes
in Various Languages (2018)
 Relative Quality and Popularity Evaluation of
Multilingual Wikipedia Articles (2017)
 Analysis of References across Wikipedia Languages
(2017)
 Quality and Importance of Wikipedia Articles in
Different Languages (2016)
 Modelling the Quality of Attributes in Wikipedia
Infoboxes (2015)
2019.09.12
18
1 of 18

Recommended

C How To Program.pdf by
C How To Program.pdfC How To Program.pdf
C How To Program.pdfTemesgen Molla
34 views1560 slides
DBpedia 2014: Highlights and Issues of the New Release by
DBpedia 2014: Highlights and Issues of the New ReleaseDBpedia 2014: Highlights and Issues of the New Release
DBpedia 2014: Highlights and Issues of the New ReleaseVolha Bryl
643 views16 slides
Exploring the Application Potential of Relational Web Tables by
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesChris Bizer
679 views25 slides
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO by
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOLinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOChris Mungall
2.7K views53 slides
Ifla swsig meeting - Puerto Rico - 20110817 by
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Figoblog
1.6K views23 slides
Why I don't use Semantic Web technologies anymore, event if they still influe... by
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Gautier Poupeau
12.4K views41 slides

More Related Content

Similar to Reference Extraction from Wikipedia Infoboxes

A Comprehensive Introduction to Object-Oriented Programming with Java.pdf by
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfA Comprehensive Introduction to Object-Oriented Programming with Java.pdf
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfYasmine Anino
2 views1216 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
2K views47 slides
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit... by
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
1.5K views65 slides
Example PseudocodeProblem Given a sorted array a with n elements .docx by
Example PseudocodeProblem Given a sorted array a with n elements .docxExample PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docxcravennichole326
4 views33 slides
Example PseudocodeProblem Given a sorted array a with n elements .docx by
Example PseudocodeProblem Given a sorted array a with n elements .docxExample PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docxelbanglis
5 views50 slides
A Practical Approach to Design, Implementation, and Management A Practical Ap... by
A Practical Approach to Design, Implementation, and Management A Practical Ap...A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...Cynthia Velynne
12 views1426 slides

Similar to Reference Extraction from Wikipedia Infoboxes(20)

A Comprehensive Introduction to Object-Oriented Programming with Java.pdf by Yasmine Anino
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfA Comprehensive Introduction to Object-Oriented Programming with Java.pdf
A Comprehensive Introduction to Object-Oriented Programming with Java.pdf
Yasmine Anino2 views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit... by Alasdair Gray
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Alasdair Gray1.5K views
Example PseudocodeProblem Given a sorted array a with n elements .docx by cravennichole326
Example PseudocodeProblem Given a sorted array a with n elements .docxExample PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docx by elbanglis
Example PseudocodeProblem Given a sorted array a with n elements .docxExample PseudocodeProblem Given a sorted array a with n elements .docx
Example PseudocodeProblem Given a sorted array a with n elements .docx
elbanglis5 views
A Practical Approach to Design, Implementation, and Management A Practical Ap... by Cynthia Velynne
A Practical Approach to Design, Implementation, and Management A Practical Ap...A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...
Cynthia Velynne12 views
Data Integration And Visualization by Ivan Ermilov
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
Ivan Ermilov985 views
DBpedia as Gaeilge Chapter by Bianca Pereira
DBpedia as Gaeilge ChapterDBpedia as Gaeilge Chapter
DBpedia as Gaeilge Chapter
Bianca Pereira1.3K views
Open source programming by Rizwan Ahmed
Open source programmingOpen source programming
Open source programming
Rizwan Ahmed466 views
Benchmarking Domain-specific Expert Search using Workshop Program Committees by Toine Bogers
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Toine Bogers732 views
Hooking up Semantic MediaWiki with external tools via SPARQL by Samuel Lampa
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQL
Samuel Lampa4.1K views
From enthusiasm to hesitation, and beyond: some German remarks on BIBFRAME by Reinhold Heuvelmann
From enthusiasm to hesitation,and beyond: some German remarks on BIBFRAMEFrom enthusiasm to hesitation,and beyond: some German remarks on BIBFRAME
From enthusiasm to hesitation, and beyond: some German remarks on BIBFRAME
Interconnecting Belgian national and regional address data using EC ISA "Loca... by PeterWinstanley1
Interconnecting Belgian national and regional address data using EC ISA "Loca...Interconnecting Belgian national and regional address data using EC ISA "Loca...
Interconnecting Belgian national and regional address data using EC ISA "Loca...
PeterWinstanley1689 views
A Comparison of Propositionalization Strategies for Creating Features from Li... by Petar Ristoski
A Comparison of Propositionalization Strategies for Creating Features from Li...A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...
Petar Ristoski1.1K views
Linked Data in Learning Analytics Tools by Mathieu d'Aquin
Linked Data in Learning Analytics ToolsLinked Data in Learning Analytics Tools
Linked Data in Learning Analytics Tools
Mathieu d'Aquin1.4K views
Conference Identity: persistent identifiers for conferences by Aliaksandr Birukou
Conference Identity: persistent identifiers for conferencesConference Identity: persistent identifiers for conferences
Conference Identity: persistent identifiers for conferences
Aliaksandr Birukou128 views
Beyond the Record : OCLC & the Future of MARC by tfons
Beyond the Record : OCLC & the Future of MARCBeyond the Record : OCLC & the Future of MARC
Beyond the Record : OCLC & the Future of MARC
tfons1.2K views

Recently uploaded

Inawisdom Quick Sight by
Inawisdom Quick SightInawisdom Quick Sight
Inawisdom Quick SightPhilipBasford
8 views27 slides
K-Drama Recommendation Using Python by
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using PythonFridaPutriassa
7 views20 slides
PyData Global 2022 - Things I learned while running neural networks on microc... by
PyData Global 2022 - Things I learned while running neural networks on microc...PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...SARADINDU SENGUPTA
5 views12 slides
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
9 views37 slides
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning by
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine LearningSARADINDU SENGUPTA
5 views11 slides
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOppotus
34 views19 slides

Recently uploaded(20)

K-Drama Recommendation Using Python by FridaPutriassa
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using Python
FridaPutriassa7 views
PyData Global 2022 - Things I learned while running neural networks on microc... by SARADINDU SENGUPTA
PyData Global 2022 - Things I learned while running neural networks on microc...PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning by SARADINDU SENGUPTA
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus34 views
Best Home Security Systems.pptx by mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
4_4_WP_4_06_ND_Model.pptx by d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 views
Pydata Global 2023 - How can a learnt model unlearn something by SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
AZConf 2023 - Considerations for LLMOps: Running LLMs in production by SARADINDU SENGUPTA
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always8 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204218 views
Listed Instruments Survey 2022.pptx by secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat4130 views

Reference Extraction from Wikipedia Infoboxes

  • 1. 14th DBpedia Community meeting 12 September 2019, Karlsruhe References extraction from Wikipedia infoboxes Włodzimierz Lewoniewski, Krzysztof Węcel
  • 2. Introduction  Wikipedia infoboxes may contain references, which can be useful for checking the reliability of the provided data  References (sources) in Wikipedia represented in various formats  It is possible to extract source metadata  Authors, Title, URL, DOI, ISBN etc. 2019.09.12 2
  • 3. Considered data  We used dumps from September 2019 for 10 Wikipedia languages: 2019.09.12 3 Language Articles Articles with infobox Number Share English (en) 5 921 047 3 893 781 65,8% Swedish (sv) 3 749 566 3 068 186 81,8% German (de) 2 338 219 773 217 33,1% French (fr) 2 136 118 1 337 170 62,6% Dutch (nl) 1 977 370 1 568 009 79,3% Russian (ru) 1 565 802 1 055 754 67,4% Italian (it) 1 550 407 1 090 419 70,3% Spanish (es) 1 542 071 1 060 434 68,8% Polish (pl) 1 356 252 987 240 72,8% Portuguese (pt) 1 012 953 539 999 53,3%
  • 4. Infobox extraction  We used own Python Infobox Reference Extractor (PIRE) to provide statistics about citations in 10 language versions.  PIRE input:  infobox names, Wikipedia xml dumps for each language.  PIRE output:  Wikipedia article name, infobox name, infobox parameter name, reference code, citation metadata and others. 2019.09.12 4
  • 5. Infobox extraction - result  General statistics about extraction in September 2019 2019.09.12 5 Language Parameters with value Parameters with reference Number Share English (en) 59 959 916 2 084 768 3,48% Swedish (sv) 54 789 889 816 569 1,49% Dutch (nl) 17 889 993 80 741 0,45% French (fr) 17 437 104 1 157 619 6,64% Italian (it) 16 166 909 230 494 1,43% Russian (ru) 15 966 329 232 023 1,45% Spanish (es) 15 394 821 344 649 2,24% Polish (pl) 12 770 997 490 368 3,84% German (de) 11 426 462 347 691 3,04% Portuguese (pt) 7 249 771 152 380 2,10%
  • 6. Infobox parameters with references English Wikipedia 2019.09.12 6 Parameter name Refs refnum 61614 population_footnotes 49797 status_ref 42887 area_footnotes 40574 synonyms_ref 39110 blank_info 37907 footnotes 36165 authority 35741 birth_date 35648 genre 31259 Parameter name Refs Einwohner-Quelle 11913 NACHWEIS-LÄNGE 9397 Löslichkeit 8928 Quellen Alben 8821 HubbleRef 8063 RekDekRef 7862 Mitarbeiterzahl 7672 Schmelzpunkt 7661 NACHWEIS-EINZUGSGEBIET 7554 Beschreibung 7548 German Wikipedia Calculation using PIRE based on Wikipedia dumps from September 2019 More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
  • 7. Infoboxes with references English Wikipedia 2019.09.12 7 German Wikipedia Calculation using PIRE based on Wikipedia dumps from September 2019 More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net Infobox name Refs Infobox settlement 272152 Speciesbox 96764 Video game reviews 85487 Taxobox 83279 Infobox NRHP 79235 Infobox football biography 71031 Infobox film 65913 Infobox planet 53536 Infobox person 46782 Infobox company 43595 Infobox name Refs Infobox Chemikalie 59976 Infobox Galaxie 58135 Infobox Fluss 34890 Infobox Unternehmen 19506 Infobox Chartplatzierungen 15932 Infobox Ortsteil einer Gemeinde in Deutschland 14550 Infobox Mineral 13244 Infobox Stern 7450 Infobox See 6035 Infobox Hochschule 4407
  • 8. API • After combining PIRE with DBpedia Extraction Framework it is possible to get RDF triples with source metadata through API: • URL:  http://dbpedia.informatik.uni- leipzig.de:8111/infobox/references? article={Wikipedia_article_URL}&format=json&dbpedia • User script for Wikipedia: https://meta.wikimedia.org/wiki/User:JohannesFre/global.js 2019.09.12 8
  • 9. API – User script for Wikipedia 2019.09.12 9
  • 10. Challenges: infobox names • There are plenty of different names of templates in each Wikipedia language. • Insignificant part of these names indicates infoboxes. • Infoboxes names can be listed in special categories • https://en.wikipedia.org/wiki/Category:Infobox_templates • Problems: • Depending on the language, infoboxes titles can be placed in subcategories at different levels • Not all titles point to infoboxes 2019.09.12 10
  • 11. Challenges: templates in parameters • Metadata of the source can contains other templates • {{cite web |url= {{Allmusic | class=artist | id=p44722 | pure_url=yes}} | title=...}} • {{cite web | url= {{NRHP url|id=79000934}} |title=…}} • {{cite web | url = {{BillboardURLbyName | artist=garth brooks|bio=true}} | last = Erlewine | first = Stephen Thomas | title = …}} • … 2019.09.12 11
  • 12. Challenges: citation templates • Apart of general citation templates, there are specific with reference to concrete source placed within the <ref> tag (website, encyclopedia, book, article etc.) • Examples in English Wikipedia: • {{London Gazette |issue= |date= |page= }} • {{NRISref | 2013a | dateform=mdy | accessdate=September 10, 2019 | refnum=66000030 | name=Lincoln Memorial}} • {Iran Census 2006 | 07}} • {{GEOnet3 | -3064853}} • … 2019.09.12 12
  • 13. Templates in infobox parameters English Wikipedia 2019.09.12 13 German Wikipedia Calculation using PIRE based on Wikipedia dumps from September 2019 More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net Template name Number Coord 795314 Convert 669388 Cite web 614827 Birth date and age 550629 Flag 304386 Death date and age 256338 Birth date 248617 Flagicon 239973 Start date 201211 URL 177383 Template name Number Team-Station 384153 AB 125220 0 115069 RSIGN 70032 Charts 62709 USA 27711 DEU 24978 Medaillenspiegel 22391 Internetquelle 21693 Single 20833
  • 14. Challenges: footnote templates • Some of the citation templates are not placed within <ref> tag and generates it after compiling. • Examples: • {{sfn|Solomon|1989|p=24}} • {{sfnm | 1a1=Perramon | 1y=1986 | 1p=242 | 2a1=Clendinnen | 2y=2003 | 2pp=3–4 }} • {{sfnp|Smith|Jones|Brown|2005|p=25}} • … 2019.09.12 14
  • 15. Challenges: references not aligned to concrete infobox parameters 2019.09.12 15
  • 16. Challenges: infobox parameter not represented in wikitext 2019.09.12 16 ? “Infobox German location” automatically transcludes population data from {{Population Germany}}
  • 17. Future work • Improving the extraction algorithm • Unification of the parameters of source metadata • Assessing the quality of the references • Quality/popularity measures of Wikipedia articles (WikiRank.net) • Appearance of the source in different databases • Assessment of the domain reputation/popularity • Finding best source for specific data • Such as population for cities, revenue for companies etc. • Integrating PIRE to DBpedia Extraction Framework • Integrating citation metadata and measures to GFS Data Browser 2019.09.12 17
  • 18. Related publications  Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics (2019)  Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia (2019)  Application of SEO Metrics to Determine the Quality of Wikipedia Articles and Their Sources (2018)  Completeness and Reliability of Wikipedia Infoboxes in Various Languages (2018)  Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles (2017)  Analysis of References across Wikipedia Languages (2017)  Quality and Importance of Wikipedia Articles in Different Languages (2016)  Modelling the Quality of Attributes in Wikipedia Infoboxes (2015) 2019.09.12 18