DBpedia Citation Challenge
(Not only) Polish Citations in Wikipedia:
analysis, comparison, directions
Krzysztof Węcel, Włodzimierz Lewoniewski, Paweł Sobociński
DBpedia Community Meeting, Leipzig, 15.09.2016
Outline
• Extraction
• Linking
• Exploration
• Ranking
2Krzysztof Węcel
Extraction
3
References and citation templates
<ref name="Trimble 1987">{{cite journal
|last=Trimble |first=V.
|date=1987
|title=Existence and nature of dark matter in the
universe
|journal=[[Annual Review of Astronomy and
Astrophysics]]
|volume=25|pages=425–472
|bibcode=1987ARA&A..25..425T
|doi=10.1146/annurev.aa.25.090187.002233
}}</ref>
4Krzysztof Węcel
Citation rendering – external sites
5Krzysztof Węcel
Citation templates
• {{cite web …
• {{cite journal …
• {{cite book …
• {{cite conference
but also
• {{Google books|ID|title|page=|
keywords=|text=|plainurl=}}
6Krzysztof Węcel
Citation templates cnt’d
• Polish
– {{cytuj
– {{cytuj stronę …
– {{cytuj pismo …
– {{cytuj książkę …
• German
– {{Literatur …
– {{Internetquelle …
– but also
• {{DOI …
• {{ISSN …
7Krzysztof Węcel
Number of templates
8Krzysztof Węcel
Number of templates
9Krzysztof Węcel
Number of templates
10Krzysztof Węcel
Number of templates
11Krzysztof Węcel
Number of templates
12Krzysztof Węcel
Number of templates
13Krzysztof Węcel
Methods
• DBpedia Extraction Framework
– CitationExtractor
• adaptation to Polish templates for citation
• hard-coded rules
– several issues
• incorrect titles for some publications
– <http://doi.org/10.1051/aas:1999404> dc:title
"3.15576E8"^^<http://dbpedia.org/datatype/second> .
• processing limits
– JAXP00010004: The accumulated size of entities is "50 000
001" that exceeded the "50 000 000" ;limit set by
"FEATURE_SECURE_PROCESSING"
• PyCiExtractor
– own implementation in Python
14Krzysztof Węcel
Specific issues
• titles can vary significantly
• given name and family name are sometimes distinguished
• specific naming of consecutive authors
– first1, last1, first2, last2, …
– imię1, nazwisko1, imię2, nazwisko2, …
• date field
– various formats
• access data is (an should be) different for individual items
15Krzysztof Węcel
Sample variants of title
16Krzysztof Węcel
Linking
17
Reuse of attributes
18Krzysztof Węcel
Completeness of attributes
19Krzysztof Węcel
Ontologies/Vocabularies
• bibo:
– The Bibliographic Ontology, http://bibliographic-
ontology.org/, 2016
– http://purl.org/ontology/bibo/
• fabio:
– FaBiO, the FRBR-aligned Bibliographic Ontology,
http://www.sparontologies.net/ontologies/fabio/so
urce.html, 2016
– http://purl.org/spar/fabio
20Krzysztof Węcel
Mappings to ontologies
21Krzysztof Węcel
External citation databases
• benefits and tasks
– disambiguation of reference details
– fusion of references
– real statistics on publication’s citation
– classification of publications (topic, quality, IF, stats)
• dereferencing identifiers:
– DOI, arXiv, bibcode, LCCN, …
• libraries/repositories
– Google Scholar, Mendeley, ResearchGate, BibSonomy, Microsoft
Academic Search, many more
22Krzysztof Węcel
Our scenario: Worldcat
• the world’s largest library catalog
• collections of 72,000 libraries in 170 countries
• WorldCat Search API
23Krzysztof Węcel
Exploration
24
Characteristics of citations
• focus on Polish citations
• other languages for comparison
• several aspects analysed:
– citing templates
– citing articles
– cited domains
• charts
– frequency vs. frequency rank (Zipf law)
– frequency vs. number of citations
25Krzysztof Węcel
Frequency vs. number of citations (PL)
Observation
Zipf’s law is
suprisingly
accurate
26Krzysztof Węcel
Frequency vs. frequency rank (PL)
27Krzysztof Węcel
Frequency rank – articles (PL)
Observation
Zipf works for
articless, too
28Krzysztof Węcel
Number of citations – articles (PL)
29Krzysztof Węcel
Frequency rank for domains (PL)
Comment
unique citation,
i.e. counted in
Wikipedia
article only
once
30Krzysztof Węcel
Frequency rank for articles (PL)
Comment
ID’d, i.e.
identified
citation, e.g. by
URL, ISBN or
DOI
31Krzysztof Węcel
Citations by type (PL)
Observation
books seem to
dominate in
Polish
32Krzysztof Węcel
Citations by type (EN)
Observation
other/hash
sources seem
to dominate in
English
33Krzysztof Węcel
Identification of articles (EN)
Observation
there is
probably an
issue with
hashed articles
in English, i.e.
no stright line
34Krzysztof Węcel
Comparison: freq rank for domains
Observation
more domains
are cited in
English
35Krzysztof Węcel
Comparison: freq rank for all articles
Observation
there are more
citations in
Polish than in
English
(cited at least
10 times)
36Krzysztof Węcel
New data, all languages - domains
Comment
data extracted
using
PyCiExtractor,
numbers seem
to better reflect
reality
37Krzysztof Węcel
New data, all languages - articles
38Krzysztof Węcel
Ranking
39
Wikirank.net
• we develop a portal for ranking Wikipedia articles in various
language according to their quality criteria
• languages: Belarusian, English, French, German, Polish,
Russian, Ukrainian
• current modules:
– WikiRank
– Top Articles
– Citation Index
– Websites Rank
http://wikirank.net
40Krzysztof Węcel
WikiRank – sample article
41Krzysztof Węcel
Wikirank – sample article cnt’d
42Krzysztof Węcel
Citation Index
43Krzysztof Węcel
Websites Rank
44Krzysztof Węcel
CiteRank
• a new module with a goal to rank citations used within
various language editions of Wikipedia
http://cite.wikirank.net/ (DBpedia framework)
http://cite2.wikirank.net/ (PyCiExtractor)
45Krzysztof Węcel
Top titles
• still a problem with title extraction
• geography is a dominating topic
46Krzysztof Węcel
Top titles
• some titles are very popular
47Krzysztof Węcel
Top titles
• even for frequent references there are plenty of ambiguities
48Krzysztof Węcel
Completeness
• yes, we agree, it might be misleading…
49Krzysztof Węcel
Most cited in Poland
50Krzysztof Węcel
Most cited – plants taxonomy
51Krzysztof Węcel
Details of citation – author name variants
52Krzysztof Węcel
Surprise – 7th place in Polish Wiki
• www.navin.org.np – National Association of Village
Development Committees in Nepal (NAVIN)
53Krzysztof Węcel
NAVIN citation – details
54Krzysztof Węcel
Sample article citing NAVIN
55Krzysztof Węcel
Surprise 2 – 1st place in English wiki
but: 404 Link broken 56Krzysztof Węcel
Lessons learnt
• Extraction methods should be improved.
• Mapping to ontologies can be useful for comparison.
• Identification of publications (better than hash) is needed.
• External repositories are not open enough.
• Distributions point at some problems with extraction.
• The are plenty of use cases for analyses of citations.
Citation statistics can improve quality modelling
of Wikipedia articles.
57Krzysztof Węcel

DBpedia Citation Challenge. (Not only) Polish Citations in Wikipedia: analysis, comparison, directions

  • 1.
    DBpedia Citation Challenge (Notonly) Polish Citations in Wikipedia: analysis, comparison, directions Krzysztof Węcel, Włodzimierz Lewoniewski, Paweł Sobociński DBpedia Community Meeting, Leipzig, 15.09.2016
  • 2.
    Outline • Extraction • Linking •Exploration • Ranking 2Krzysztof Węcel
  • 3.
  • 4.
    References and citationtemplates <ref name="Trimble 1987">{{cite journal |last=Trimble |first=V. |date=1987 |title=Existence and nature of dark matter in the universe |journal=[[Annual Review of Astronomy and Astrophysics]] |volume=25|pages=425–472 |bibcode=1987ARA&A..25..425T |doi=10.1146/annurev.aa.25.090187.002233 }}</ref> 4Krzysztof Węcel
  • 5.
    Citation rendering –external sites 5Krzysztof Węcel
  • 6.
    Citation templates • {{citeweb … • {{cite journal … • {{cite book … • {{cite conference but also • {{Google books|ID|title|page=| keywords=|text=|plainurl=}} 6Krzysztof Węcel
  • 7.
    Citation templates cnt’d •Polish – {{cytuj – {{cytuj stronę … – {{cytuj pismo … – {{cytuj książkę … • German – {{Literatur … – {{Internetquelle … – but also • {{DOI … • {{ISSN … 7Krzysztof Węcel
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    Methods • DBpedia ExtractionFramework – CitationExtractor • adaptation to Polish templates for citation • hard-coded rules – several issues • incorrect titles for some publications – <http://doi.org/10.1051/aas:1999404> dc:title "3.15576E8"^^<http://dbpedia.org/datatype/second> . • processing limits – JAXP00010004: The accumulated size of entities is "50 000 001" that exceeded the "50 000 000" ;limit set by "FEATURE_SECURE_PROCESSING" • PyCiExtractor – own implementation in Python 14Krzysztof Węcel
  • 15.
    Specific issues • titlescan vary significantly • given name and family name are sometimes distinguished • specific naming of consecutive authors – first1, last1, first2, last2, … – imię1, nazwisko1, imię2, nazwisko2, … • date field – various formats • access data is (an should be) different for individual items 15Krzysztof Węcel
  • 16.
    Sample variants oftitle 16Krzysztof Węcel
  • 17.
  • 18.
  • 19.
  • 20.
    Ontologies/Vocabularies • bibo: – TheBibliographic Ontology, http://bibliographic- ontology.org/, 2016 – http://purl.org/ontology/bibo/ • fabio: – FaBiO, the FRBR-aligned Bibliographic Ontology, http://www.sparontologies.net/ontologies/fabio/so urce.html, 2016 – http://purl.org/spar/fabio 20Krzysztof Węcel
  • 21.
  • 22.
    External citation databases •benefits and tasks – disambiguation of reference details – fusion of references – real statistics on publication’s citation – classification of publications (topic, quality, IF, stats) • dereferencing identifiers: – DOI, arXiv, bibcode, LCCN, … • libraries/repositories – Google Scholar, Mendeley, ResearchGate, BibSonomy, Microsoft Academic Search, many more 22Krzysztof Węcel
  • 23.
    Our scenario: Worldcat •the world’s largest library catalog • collections of 72,000 libraries in 170 countries • WorldCat Search API 23Krzysztof Węcel
  • 24.
  • 25.
    Characteristics of citations •focus on Polish citations • other languages for comparison • several aspects analysed: – citing templates – citing articles – cited domains • charts – frequency vs. frequency rank (Zipf law) – frequency vs. number of citations 25Krzysztof Węcel
  • 26.
    Frequency vs. numberof citations (PL) Observation Zipf’s law is suprisingly accurate 26Krzysztof Węcel
  • 27.
    Frequency vs. frequencyrank (PL) 27Krzysztof Węcel
  • 28.
    Frequency rank –articles (PL) Observation Zipf works for articless, too 28Krzysztof Węcel
  • 29.
    Number of citations– articles (PL) 29Krzysztof Węcel
  • 30.
    Frequency rank fordomains (PL) Comment unique citation, i.e. counted in Wikipedia article only once 30Krzysztof Węcel
  • 31.
    Frequency rank forarticles (PL) Comment ID’d, i.e. identified citation, e.g. by URL, ISBN or DOI 31Krzysztof Węcel
  • 32.
    Citations by type(PL) Observation books seem to dominate in Polish 32Krzysztof Węcel
  • 33.
    Citations by type(EN) Observation other/hash sources seem to dominate in English 33Krzysztof Węcel
  • 34.
    Identification of articles(EN) Observation there is probably an issue with hashed articles in English, i.e. no stright line 34Krzysztof Węcel
  • 35.
    Comparison: freq rankfor domains Observation more domains are cited in English 35Krzysztof Węcel
  • 36.
    Comparison: freq rankfor all articles Observation there are more citations in Polish than in English (cited at least 10 times) 36Krzysztof Węcel
  • 37.
    New data, alllanguages - domains Comment data extracted using PyCiExtractor, numbers seem to better reflect reality 37Krzysztof Węcel
  • 38.
    New data, alllanguages - articles 38Krzysztof Węcel
  • 39.
  • 40.
    Wikirank.net • we developa portal for ranking Wikipedia articles in various language according to their quality criteria • languages: Belarusian, English, French, German, Polish, Russian, Ukrainian • current modules: – WikiRank – Top Articles – Citation Index – Websites Rank http://wikirank.net 40Krzysztof Węcel
  • 41.
    WikiRank – samplearticle 41Krzysztof Węcel
  • 42.
    Wikirank – samplearticle cnt’d 42Krzysztof Węcel
  • 43.
  • 44.
  • 45.
    CiteRank • a newmodule with a goal to rank citations used within various language editions of Wikipedia http://cite.wikirank.net/ (DBpedia framework) http://cite2.wikirank.net/ (PyCiExtractor) 45Krzysztof Węcel
  • 46.
    Top titles • stilla problem with title extraction • geography is a dominating topic 46Krzysztof Węcel
  • 47.
    Top titles • sometitles are very popular 47Krzysztof Węcel
  • 48.
    Top titles • evenfor frequent references there are plenty of ambiguities 48Krzysztof Węcel
  • 49.
    Completeness • yes, weagree, it might be misleading… 49Krzysztof Węcel
  • 50.
    Most cited inPoland 50Krzysztof Węcel
  • 51.
    Most cited –plants taxonomy 51Krzysztof Węcel
  • 52.
    Details of citation– author name variants 52Krzysztof Węcel
  • 53.
    Surprise – 7thplace in Polish Wiki • www.navin.org.np – National Association of Village Development Committees in Nepal (NAVIN) 53Krzysztof Węcel
  • 54.
    NAVIN citation –details 54Krzysztof Węcel
  • 55.
    Sample article citingNAVIN 55Krzysztof Węcel
  • 56.
    Surprise 2 –1st place in English wiki but: 404 Link broken 56Krzysztof Węcel
  • 57.
    Lessons learnt • Extractionmethods should be improved. • Mapping to ontologies can be useful for comparison. • Identification of publications (better than hash) is needed. • External repositories are not open enough. • Distributions point at some problems with extraction. • The are plenty of use cases for analyses of citations. Citation statistics can improve quality modelling of Wikipedia articles. 57Krzysztof Węcel

Editor's Notes

  • #7 {{Google books|7ydCAAAAIAAJ|History of the Western Insurrection|page=42}} https://en.wikipedia.org/wiki/Template:Google_books news, press,
  • #26 citing templates – 1 citation can be used many time within article citing articles – only unique citations identified, i.e. one per article cited domains – many web citations can point to a single source, thus increasing the „rank” of the source
  • #35 They are not evenly distributed
  • #50 There are just so many authors…