Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reference Extraction from Wikipedia Infoboxes

427 views

Published on

In Wikipedia’s infoboxes some facts have references, which can be useful for checking the reliability of the provided data. We present challenges and methods connected with the metadata extraction of the Wikipedia’s sources. We used DBpedia Extraction Framework along with own extensions in Python to provide statistics about citations in 10 language versions. Provided methods can be used to verify and synchronize facts depending on the quality assessment of sources.

Presented during SEMANTiCS 2019 on 14th DBpedia Community Meeting in Karlsruhe

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Reference Extraction from Wikipedia Infoboxes

  1. 1. 14th DBpedia Community meeting 12 September 2019, Karlsruhe References extraction from Wikipedia infoboxes Włodzimierz Lewoniewski, Krzysztof Węcel
  2. 2. Introduction  Wikipedia infoboxes may contain references, which can be useful for checking the reliability of the provided data  References (sources) in Wikipedia represented in various formats  It is possible to extract source metadata  Authors, Title, URL, DOI, ISBN etc. 2019.09.12 2
  3. 3. Considered data  We used dumps from September 2019 for 10 Wikipedia languages: 2019.09.12 3 Language Articles Articles with infobox Number Share English (en) 5 921 047 3 893 781 65,8% Swedish (sv) 3 749 566 3 068 186 81,8% German (de) 2 338 219 773 217 33,1% French (fr) 2 136 118 1 337 170 62,6% Dutch (nl) 1 977 370 1 568 009 79,3% Russian (ru) 1 565 802 1 055 754 67,4% Italian (it) 1 550 407 1 090 419 70,3% Spanish (es) 1 542 071 1 060 434 68,8% Polish (pl) 1 356 252 987 240 72,8% Portuguese (pt) 1 012 953 539 999 53,3%
  4. 4. Infobox extraction  We used own Python Infobox Reference Extractor (PIRE) to provide statistics about citations in 10 language versions.  PIRE input:  infobox names, Wikipedia xml dumps for each language.  PIRE output:  Wikipedia article name, infobox name, infobox parameter name, reference code, citation metadata and others. 2019.09.12 4
  5. 5. Infobox extraction - result  General statistics about extraction in September 2019 2019.09.12 5 Language Parameters with value Parameters with reference Number Share English (en) 59 959 916 2 084 768 3,48% Swedish (sv) 54 789 889 816 569 1,49% Dutch (nl) 17 889 993 80 741 0,45% French (fr) 17 437 104 1 157 619 6,64% Italian (it) 16 166 909 230 494 1,43% Russian (ru) 15 966 329 232 023 1,45% Spanish (es) 15 394 821 344 649 2,24% Polish (pl) 12 770 997 490 368 3,84% German (de) 11 426 462 347 691 3,04% Portuguese (pt) 7 249 771 152 380 2,10%
  6. 6. Infobox parameters with references English Wikipedia 2019.09.12 6 Parameter name Refs refnum 61614 population_footnotes 49797 status_ref 42887 area_footnotes 40574 synonyms_ref 39110 blank_info 37907 footnotes 36165 authority 35741 birth_date 35648 genre 31259 Parameter name Refs Einwohner-Quelle 11913 NACHWEIS-LÄNGE 9397 Löslichkeit 8928 Quellen Alben 8821 HubbleRef 8063 RekDekRef 7862 Mitarbeiterzahl 7672 Schmelzpunkt 7661 NACHWEIS-EINZUGSGEBIET 7554 Beschreibung 7548 German Wikipedia Calculation using PIRE based on Wikipedia dumps from September 2019 More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net
  7. 7. Infoboxes with references English Wikipedia 2019.09.12 7 German Wikipedia Calculation using PIRE based on Wikipedia dumps from September 2019 More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net Infobox name Refs Infobox settlement 272152 Speciesbox 96764 Video game reviews 85487 Taxobox 83279 Infobox NRHP 79235 Infobox football biography 71031 Infobox film 65913 Infobox planet 53536 Infobox person 46782 Infobox company 43595 Infobox name Refs Infobox Chemikalie 59976 Infobox Galaxie 58135 Infobox Fluss 34890 Infobox Unternehmen 19506 Infobox Chartplatzierungen 15932 Infobox Ortsteil einer Gemeinde in Deutschland 14550 Infobox Mineral 13244 Infobox Stern 7450 Infobox See 6035 Infobox Hochschule 4407
  8. 8. API • After combining PIRE with DBpedia Extraction Framework it is possible to get RDF triples with source metadata through API: • URL:  http://dbpedia.informatik.uni- leipzig.de:8111/infobox/references? article={Wikipedia_article_URL}&format=json&dbpedia • User script for Wikipedia: https://meta.wikimedia.org/wiki/User:JohannesFre/global.js 2019.09.12 8
  9. 9. API – User script for Wikipedia 2019.09.12 9
  10. 10. Challenges: infobox names • There are plenty of different names of templates in each Wikipedia language. • Insignificant part of these names indicates infoboxes. • Infoboxes names can be listed in special categories • https://en.wikipedia.org/wiki/Category:Infobox_templates • Problems: • Depending on the language, infoboxes titles can be placed in subcategories at different levels • Not all titles point to infoboxes 2019.09.12 10
  11. 11. Challenges: templates in parameters • Metadata of the source can contains other templates • {{cite web |url= {{Allmusic | class=artist | id=p44722 | pure_url=yes}} | title=...}} • {{cite web | url= {{NRHP url|id=79000934}} |title=…}} • {{cite web | url = {{BillboardURLbyName | artist=garth brooks|bio=true}} | last = Erlewine | first = Stephen Thomas | title = …}} • … 2019.09.12 11
  12. 12. Challenges: citation templates • Apart of general citation templates, there are specific with reference to concrete source placed within the <ref> tag (website, encyclopedia, book, article etc.) • Examples in English Wikipedia: • {{London Gazette |issue= |date= |page= }} • {{NRISref | 2013a | dateform=mdy | accessdate=September 10, 2019 | refnum=66000030 | name=Lincoln Memorial}} • {Iran Census 2006 | 07}} • {{GEOnet3 | -3064853}} • … 2019.09.12 12
  13. 13. Templates in infobox parameters English Wikipedia 2019.09.12 13 German Wikipedia Calculation using PIRE based on Wikipedia dumps from September 2019 More detailed statistics for 10 Wikipedia languages: http://stats.infoboxes.net Template name Number Coord 795314 Convert 669388 Cite web 614827 Birth date and age 550629 Flag 304386 Death date and age 256338 Birth date 248617 Flagicon 239973 Start date 201211 URL 177383 Template name Number Team-Station 384153 AB 125220 0 115069 RSIGN 70032 Charts 62709 USA 27711 DEU 24978 Medaillenspiegel 22391 Internetquelle 21693 Single 20833
  14. 14. Challenges: footnote templates • Some of the citation templates are not placed within <ref> tag and generates it after compiling. • Examples: • {{sfn|Solomon|1989|p=24}} • {{sfnm | 1a1=Perramon | 1y=1986 | 1p=242 | 2a1=Clendinnen | 2y=2003 | 2pp=3–4 }} • {{sfnp|Smith|Jones|Brown|2005|p=25}} • … 2019.09.12 14
  15. 15. Challenges: references not aligned to concrete infobox parameters 2019.09.12 15
  16. 16. Challenges: infobox parameter not represented in wikitext 2019.09.12 16 ? “Infobox German location” automatically transcludes population data from {{Population Germany}}
  17. 17. Future work • Improving the extraction algorithm • Unification of the parameters of source metadata • Assessing the quality of the references • Quality/popularity measures of Wikipedia articles (WikiRank.net) • Appearance of the source in different databases • Assessment of the domain reputation/popularity • Finding best source for specific data • Such as population for cities, revenue for companies etc. • Integrating PIRE to DBpedia Extraction Framework • Integrating citation metadata and measures to GFS Data Browser 2019.09.12 17
  18. 18. Related publications  Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics (2019)  Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia (2019)  Application of SEO Metrics to Determine the Quality of Wikipedia Articles and Their Sources (2018)  Completeness and Reliability of Wikipedia Infoboxes in Various Languages (2018)  Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles (2017)  Analysis of References across Wikipedia Languages (2017)  Quality and Importance of Wikipedia Articles in Different Languages (2016)  Modelling the Quality of Attributes in Wikipedia Infoboxes (2015) 2019.09.12 18

×