SlideShare a Scribd company logo
Enrichment
of Multilingual Wikipedia
Based on Quality Analysis
Włodzimierz Lewoniewski,
Poznań University of Economics and Business
2
Agenda
• Introduction
• Quality in Wikipedia
• Automatic Assessment of the Quality of Wikipedia Articles
• Quality Measures and Dimensions of Wikipedia Articles
• Building Quality Models for Automatic Quality Assessment
• Quality of Infoboxes
• Enrichment of Wikipedia
• Future Work
3
Introduction
• Department of Information Systems
(DIS) belongs to the Faculty of
Informatics and Electronic Economy,
which is acknowledged as
outstanding by the Accreditation
Committee by Polish Ministry of
Science and Higher Education.
• Head of the department: prof. Witold Abramowicz
kie.ue.poznan.pl
4
Multilinguality of Wikipedia
Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2017). Relative Quality and Popularity Evaluation of
Multilingual Wikipedia Articles. In Informatics (Vol. 4, No. 4, p. 43). Multidisciplinary Digital Publishing Institute.
Which language version(s) has the data with the highest quality?
5
Quality of Articles
Colors
are marked grades that
have similar characteristics
• Wikipedia articles can
get quality grades
from users.
• There are differences
between grading
schemes in language
versions
6
Automatic Assessment
of the Quality of Wikipedia Articles
• It is possible to build models for quality assessment of
Wikipedia articles based on different measures using
data mining algorithms.
• There are different approaches, which use various
measures and algorithms to assess quality of articles.
7
Related Work
• Most of the works focus on English Wikipedia
• One of the first studies showed that longer articles in Wikipedia
often have higher quality grades (Blumenstock 2008).
• Often the best articles have more images, sections, use bigger
number of references than articles with lower quality (Warncke-
Wang et al., 2013; Węcel et al., 2015; Lewoniewski et al., 2016).
• Characteristics related to and edition history can also help to
predict articles quality in Wikipedia (Dalip et al., 2014; Suzuki et
al., 2016; Dang et al., 2016)
8
Measures Distribution
Source: Lewoniewski, W., Węcel, K. (2017). Relative quality assessment of Wikipedia articles in different languages
using synthetic measure. In International Conference on Business Information Systems (pp. 282-292). Springer, Cham.
• Often we can observe a positive
correlation between the article
quality and the value of each
measures.
• Figure show distribution of
articles measures of each quality
class in English Wikipedia.
• To build this chart we use
randomly chosen 1000 articles
from each quality class.
9
Quality Dimensions
Source: Lewoniewski, W. (2018). Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia.
21st International Conference on Business Information Systems, Berlin. (in press)
Some of the measures related to dimensions:
• Number of references (Credibility)
• Articles length (Completeness)
• Number of unique authors (Objectivity,
Relevance)
• Automated Readability Index (Readability)
• Articles age (Timeliness, Relevance)
• Number of the sections (Style)
• Citation templates (Credibility,
Completeness)
• Many more ...
10
Measures Extraction
• We used different sources to extract
measures for Wikipedia articles.
• Most of the measures are extracted
from Wikimedia dump files
• We developed various applications
to obtain measures (over 200) for all
or selected articles in different
language editions of Wikipedia
11
Database Dumps
• enwiki-latest-pages-meta-current.xml.bz2: recombine all
pages (including articles), current versions only. This file is used
for obtaining a majority of the articles measures.
• enwiki-latest-pages-articles.xml.bz2: consist articles,
templates, media/file descriptions, and primary meta-pages.
Can be used also for obtaining a majority of the articles
measures (excluding statistics from discussion pages).
• enwiki-latest-pagelinks.sql.gz: wiki page-to-page link records.
Used for network measures - for example incoming links from
other articles.
• enwiki-latest-categorylinks.sql.gz: wiki category membership
link records. Can be used for category count measure.
• enwiki-latest-externallinks.sql.gz: wiki external URL link
records. can be used for external link count measure.
• enwiki-latest-imagelinks.sql.gz: wiki media/files usage
records. Can be used to image count measure.
• enwiki-latest-stub-meta-history.xml.gz: contain only
historical revision metadata. Can be used to extract number of
the editors from different groups (bots, anonymous users,
administartors etc.) and alsa number of the edits of various
types (e.g. minor edits, edits comments).
• enwiki-latest-iwlinks.sql.gz: Interwiki link tracking records.
Can be used to extract number of the unique internal links
(links to other Wikipedia articles).
• enwiki-latest-templatelinks.sql.gz: Wiki template inclusion
link records. Used for templates count measure, also it is
possible to check if article has infobox
• enwiki-latest-page.sql.gz: base per-page data (id, title, old
restrictions, etc). Can be used to extract last edit time, page
length in bytes.
• Other….
12
Building the Models
• Quality of articles can be measured using
features related to:
– Content: text length, number of images, sections, references
and others.
– Editors: reputation, network of the users, comparison of edits
and others.
• Quality can be measured as the probability of
belonging to one of the specific classes (groups).
13
Binary Classification
Some of the approaches
divide articles into two
groups:
• Complete: articles with the
highest quality grades (FA, GA)
• Incomplete: other articles,
which have lower quality grades
14
Measures Importance
Source: Węcel, K., Lewoniewski, W. (2015). Modelling the quality of attributes in Wikipedia infoboxes.
In International Conference on Business Information Systems (pp. 308-320). Springer, Cham.
15
Extended Assessment
• To build a model we can use more than two
categories, depending on quality grades
– Number of categories can be different in each
language
• ORES score
– Only for selected language versions
• Synthetic Measure
– For all Wikipedia languages that have the highest
grade (FA equivalent)
16
WikiRank
Source: https://wikirank.net/en/Lviv
17
Article and Infobox Quality
• Completeness
• Credibility
• Objectivity
• Readability
• Relevance
• Style
• Timeliness
• …
• Completeness
• Credibility
• Relevance
• Timeliness
• …
Source: Lewoniewski, W. (2018). Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia.
21st International Conference on Business Information Systems, Berlin. (in press)
18
Simple Measures for Infobox
Source: Lewoniewski, W. (2017). Completeness and Reliability of Wikipedia Infoboxes in Various Languages.
In International Conference on Business Information Systems (pp. 295-305). Springer, Cham.
19
Infobox Timeliness
• Timeliness measures can be related to currency
and volatility of the infoboxes.
• Example: history of changes in the ''leader name''
parameter of the Poznań infobox
Source: Lewoniewski, W. (2018). Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia.
21st International Conference on Business Information Systems, Berlin. (in press)
20
Correlation of Measures
related to Articles and Infoboxes
Source: Lewoniewski, W. (2017). Enrichment of information in multilingual Wikipedia based on quality analysis.
In International Conference on Business Information Systems (pp. 216-227). Springer, Cham.
21
Infoboxes.net
Source: http://infoboxes.net
22
WikiBest
Users can vote for the best
infobox in four nominations:
• the best quality
• the best completeness
• the best credibility
• the best timeliness
Source: https://wikibest.net
23
Enrichment of Wikipedia
Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2017). Relative Quality and Popularity Evaluation of Multilingual
Wikipedia Articles. In Informatics (Vol. 4, No. 4, p. 43). Multidisciplinary Digital Publishing Institute.
24
Potential for New Articles
• Despite the fact that English Wikipedia is the largest, it can be also
enriched by other language versions.
• Table below presents potential number of articles in each language
and each topic that can be created or enriched using infoboxes from
other language version of Wikipedia
Source: Lewoniewski, W. (2017). Completeness and Reliability of Wikipedia Infoboxes in Various Languages.
In International Conference on Business Information Systems (pp. 295-305). Springer, Cham.
25
Future work
• Expanding number of measures (including linguistic) for
predicting quality of Wikipedia articles.
• Sentiment analysis of Wikipedia articles.
• Fact extraction from the content in different languages.
• Improving projects related to the quality of Wikipedia.
• Analysis of references measures on various granularity:
– host level, domain, path and url.
• Detection of language sensitive topics in Wikipedia.
26
Thank You
Additional
Information
28
Unification of Infobox Parameters
Source: Lewoniewski, W., Kasprzak, A., Węcel, K., Abramowicz, W., (2018),
Kompletność danych o produktach w różnych wersjach językowych Wikipedii (in press)
29
SEO Measures
Source: Lewoniewski, W., Härting, R. C., Wecel, K., Reichstein, C., Abramowicz, W. (2018). Application of SEO Metrics to Determine the Quality
of Wikipedia Articles and Their Sources. In International Conference on Information and Software Technologies (pp. 139-152). Springer, Cham.
Mean of Visibility Index from
different countries perspectives:
Mean of each social
indicators:
30
Linguistic Measures
• In Polish Wikipedia we
extracted over 100 linguistic
measures of articles
• Model shows over 93%
classification precision
The most important features:
• impersonal verbs,
• third person words,
• unique nouns,
• unique verbs.
Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2018). Determining Quality of Articles in Polish Wikipedia Based on
Linguistic Features. In International Conference on Information and Software Technologies (pp. 546-558). Springer, Cham.
31
Fact Extraction
Source: Khairova, N., Lewoniewski, W., & Węcel, K. (2017). Estimating the quality of articles in Russian Wikipedia using the
logical-linguistic model of fact extraction. In International Conference on Business Information Systems (pp. 28-40). Springer, Cham.
• Logical-linguistic model of fact extraction in Russian texts
• Density of simple and complex facts can determine the quality of Wikipedia articles
32
Quality and Importance Models
Source: Lewoniewski, W., Węcel, K.,
Abramowicz, W. (2016).
Quality and importance of Wikipedia
articles in different languages.
In International Conference on
Information and Software Technologies
(pp. 613-624). Springer, Cham.
33
Analysis of References in Wikipedia
Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2017). Analysis of references across Wikipedia languages.
In International Conference on Information and Software Technologies (pp. 561-573). Springer, Cham..
Overlaps of unique references : Overlaps of domains of references

More Related Content

Similar to Enrichment of multilingual Wikipedia based on quality analysis

E Write Intro To Web 2
E Write   Intro To Web 2E Write   Intro To Web 2
E Write Intro To Web 2
LeslieOflahavan
 
Wikipedia for Researchers
Wikipedia for ResearchersWikipedia for Researchers
Wikipedia for Researchers
Andrew Gray
 
Wikis as Social Networks: Evolution and Dynamics
Wikis as Social Networks:Evolution and Dynamics Wikis as Social Networks:Evolution and Dynamics
Wikis as Social Networks: Evolution and Dynamics
Ralf Klamma
 
Quality assessment of Wikipedia and its sources
Quality assessment of Wikipedia and its sourcesQuality assessment of Wikipedia and its sources
Quality assessment of Wikipedia and its sources
Włodzimierz Lewoniewski
 
A preliminary approach to knowledge integrity risk assessment in Wikipedia p...
A preliminary approach to knowledge integrity  risk assessment in Wikipedia p...A preliminary approach to knowledge integrity  risk assessment in Wikipedia p...
A preliminary approach to knowledge integrity risk assessment in Wikipedia p...
Pablo Aragón
 
Citation Management Using Mendeley Software
Citation Management  Using Mendeley SoftwareCitation Management  Using Mendeley Software
Citation Management Using Mendeley Software
Dave Marcial
 
Using Moodle to support academic research for Wikimedia Foundation
Using Moodle to support academic research for  Wikimedia FoundationUsing Moodle to support academic research for  Wikimedia Foundation
Using Moodle to support academic research for Wikimedia Foundation
Epic
 
Mediawiki and Wiki As a Medium
Mediawiki and Wiki As a MediumMediawiki and Wiki As a Medium
Mediawiki and Wiki As a Medium
Randy Thornton
 
finde datasets repository.pptx
finde datasets repository.pptxfinde datasets repository.pptx
finde datasets repository.pptx
hasanrdhaiwi
 
FirstWorkshopOnWikipediaResearch
FirstWorkshopOnWikipediaResearchFirstWorkshopOnWikipediaResearch
FirstWorkshopOnWikipediaResearch
webuploader
 
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
Nees Jan van Eck
 
Towards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial FindingsTowards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial Findings
alc28
 
Enhancing scholarly publishing, jankowski, tatum, tatum, & scharnhorst, pkp c...
Enhancing scholarly publishing, jankowski, tatum, tatum, & scharnhorst, pkp c...Enhancing scholarly publishing, jankowski, tatum, tatum, & scharnhorst, pkp c...
Enhancing scholarly publishing, jankowski, tatum, tatum, & scharnhorst, pkp c...
Nick Jankowski
 
Digital Research Conference 2012, Oxford: Re-imagining the literary essay for...
Digital Research Conference 2012, Oxford: Re-imagining the literary essay for...Digital Research Conference 2012, Oxford: Re-imagining the literary essay for...
Digital Research Conference 2012, Oxford: Re-imagining the literary essay for...
Helen K Jeffrey
 
Semantic mediawiki
Semantic mediawikiSemantic mediawiki
Semantic mediawiki
Karsten Krumrück
 
Loops of humans and bots in Wikidata
Loops of humans and bots in WikidataLoops of humans and bots in Wikidata
Loops of humans and bots in Wikidata
Elena Simperl
 
Analyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikisAnalyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikis
Brian Keegan
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
Marko Grobelnik
 
Semtech web-protege-tutorial
Semtech web-protege-tutorialSemtech web-protege-tutorial
Semtech web-protege-tutorial
matthewhorridge
 
Wikis and collaboration: approaches to deploying wikis in educational settings
Wikis and collaboration: approaches to deploying wikis in educational settingsWikis and collaboration: approaches to deploying wikis in educational settings
Wikis and collaboration: approaches to deploying wikis in educational settings
University of Newcastle, NSW.
 

Similar to Enrichment of multilingual Wikipedia based on quality analysis (20)

E Write Intro To Web 2
E Write   Intro To Web 2E Write   Intro To Web 2
E Write Intro To Web 2
 
Wikipedia for Researchers
Wikipedia for ResearchersWikipedia for Researchers
Wikipedia for Researchers
 
Wikis as Social Networks: Evolution and Dynamics
Wikis as Social Networks:Evolution and Dynamics Wikis as Social Networks:Evolution and Dynamics
Wikis as Social Networks: Evolution and Dynamics
 
Quality assessment of Wikipedia and its sources
Quality assessment of Wikipedia and its sourcesQuality assessment of Wikipedia and its sources
Quality assessment of Wikipedia and its sources
 
A preliminary approach to knowledge integrity risk assessment in Wikipedia p...
A preliminary approach to knowledge integrity  risk assessment in Wikipedia p...A preliminary approach to knowledge integrity  risk assessment in Wikipedia p...
A preliminary approach to knowledge integrity risk assessment in Wikipedia p...
 
Citation Management Using Mendeley Software
Citation Management  Using Mendeley SoftwareCitation Management  Using Mendeley Software
Citation Management Using Mendeley Software
 
Using Moodle to support academic research for Wikimedia Foundation
Using Moodle to support academic research for  Wikimedia FoundationUsing Moodle to support academic research for  Wikimedia Foundation
Using Moodle to support academic research for Wikimedia Foundation
 
Mediawiki and Wiki As a Medium
Mediawiki and Wiki As a MediumMediawiki and Wiki As a Medium
Mediawiki and Wiki As a Medium
 
finde datasets repository.pptx
finde datasets repository.pptxfinde datasets repository.pptx
finde datasets repository.pptx
 
FirstWorkshopOnWikipediaResearch
FirstWorkshopOnWikipediaResearchFirstWorkshopOnWikipediaResearch
FirstWorkshopOnWikipediaResearch
 
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
 
Towards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial FindingsTowards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial Findings
 
Enhancing scholarly publishing, jankowski, tatum, tatum, & scharnhorst, pkp c...
Enhancing scholarly publishing, jankowski, tatum, tatum, & scharnhorst, pkp c...Enhancing scholarly publishing, jankowski, tatum, tatum, & scharnhorst, pkp c...
Enhancing scholarly publishing, jankowski, tatum, tatum, & scharnhorst, pkp c...
 
Digital Research Conference 2012, Oxford: Re-imagining the literary essay for...
Digital Research Conference 2012, Oxford: Re-imagining the literary essay for...Digital Research Conference 2012, Oxford: Re-imagining the literary essay for...
Digital Research Conference 2012, Oxford: Re-imagining the literary essay for...
 
Semantic mediawiki
Semantic mediawikiSemantic mediawiki
Semantic mediawiki
 
Loops of humans and bots in Wikidata
Loops of humans and bots in WikidataLoops of humans and bots in Wikidata
Loops of humans and bots in Wikidata
 
Analyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikisAnalyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikis
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
 
Semtech web-protege-tutorial
Semtech web-protege-tutorialSemtech web-protege-tutorial
Semtech web-protege-tutorial
 
Wikis and collaboration: approaches to deploying wikis in educational settings
Wikis and collaboration: approaches to deploying wikis in educational settingsWikis and collaboration: approaches to deploying wikis in educational settings
Wikis and collaboration: approaches to deploying wikis in educational settings
 

Recently uploaded

Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 

Recently uploaded (20)

Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 

Enrichment of multilingual Wikipedia based on quality analysis

  • 1. Enrichment of Multilingual Wikipedia Based on Quality Analysis Włodzimierz Lewoniewski, Poznań University of Economics and Business
  • 2. 2 Agenda • Introduction • Quality in Wikipedia • Automatic Assessment of the Quality of Wikipedia Articles • Quality Measures and Dimensions of Wikipedia Articles • Building Quality Models for Automatic Quality Assessment • Quality of Infoboxes • Enrichment of Wikipedia • Future Work
  • 3. 3 Introduction • Department of Information Systems (DIS) belongs to the Faculty of Informatics and Electronic Economy, which is acknowledged as outstanding by the Accreditation Committee by Polish Ministry of Science and Higher Education. • Head of the department: prof. Witold Abramowicz kie.ue.poznan.pl
  • 4. 4 Multilinguality of Wikipedia Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2017). Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles. In Informatics (Vol. 4, No. 4, p. 43). Multidisciplinary Digital Publishing Institute. Which language version(s) has the data with the highest quality?
  • 5. 5 Quality of Articles Colors are marked grades that have similar characteristics • Wikipedia articles can get quality grades from users. • There are differences between grading schemes in language versions
  • 6. 6 Automatic Assessment of the Quality of Wikipedia Articles • It is possible to build models for quality assessment of Wikipedia articles based on different measures using data mining algorithms. • There are different approaches, which use various measures and algorithms to assess quality of articles.
  • 7. 7 Related Work • Most of the works focus on English Wikipedia • One of the first studies showed that longer articles in Wikipedia often have higher quality grades (Blumenstock 2008). • Often the best articles have more images, sections, use bigger number of references than articles with lower quality (Warncke- Wang et al., 2013; Węcel et al., 2015; Lewoniewski et al., 2016). • Characteristics related to and edition history can also help to predict articles quality in Wikipedia (Dalip et al., 2014; Suzuki et al., 2016; Dang et al., 2016)
  • 8. 8 Measures Distribution Source: Lewoniewski, W., Węcel, K. (2017). Relative quality assessment of Wikipedia articles in different languages using synthetic measure. In International Conference on Business Information Systems (pp. 282-292). Springer, Cham. • Often we can observe a positive correlation between the article quality and the value of each measures. • Figure show distribution of articles measures of each quality class in English Wikipedia. • To build this chart we use randomly chosen 1000 articles from each quality class.
  • 9. 9 Quality Dimensions Source: Lewoniewski, W. (2018). Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia. 21st International Conference on Business Information Systems, Berlin. (in press) Some of the measures related to dimensions: • Number of references (Credibility) • Articles length (Completeness) • Number of unique authors (Objectivity, Relevance) • Automated Readability Index (Readability) • Articles age (Timeliness, Relevance) • Number of the sections (Style) • Citation templates (Credibility, Completeness) • Many more ...
  • 10. 10 Measures Extraction • We used different sources to extract measures for Wikipedia articles. • Most of the measures are extracted from Wikimedia dump files • We developed various applications to obtain measures (over 200) for all or selected articles in different language editions of Wikipedia
  • 11. 11 Database Dumps • enwiki-latest-pages-meta-current.xml.bz2: recombine all pages (including articles), current versions only. This file is used for obtaining a majority of the articles measures. • enwiki-latest-pages-articles.xml.bz2: consist articles, templates, media/file descriptions, and primary meta-pages. Can be used also for obtaining a majority of the articles measures (excluding statistics from discussion pages). • enwiki-latest-pagelinks.sql.gz: wiki page-to-page link records. Used for network measures - for example incoming links from other articles. • enwiki-latest-categorylinks.sql.gz: wiki category membership link records. Can be used for category count measure. • enwiki-latest-externallinks.sql.gz: wiki external URL link records. can be used for external link count measure. • enwiki-latest-imagelinks.sql.gz: wiki media/files usage records. Can be used to image count measure. • enwiki-latest-stub-meta-history.xml.gz: contain only historical revision metadata. Can be used to extract number of the editors from different groups (bots, anonymous users, administartors etc.) and alsa number of the edits of various types (e.g. minor edits, edits comments). • enwiki-latest-iwlinks.sql.gz: Interwiki link tracking records. Can be used to extract number of the unique internal links (links to other Wikipedia articles). • enwiki-latest-templatelinks.sql.gz: Wiki template inclusion link records. Used for templates count measure, also it is possible to check if article has infobox • enwiki-latest-page.sql.gz: base per-page data (id, title, old restrictions, etc). Can be used to extract last edit time, page length in bytes. • Other….
  • 12. 12 Building the Models • Quality of articles can be measured using features related to: – Content: text length, number of images, sections, references and others. – Editors: reputation, network of the users, comparison of edits and others. • Quality can be measured as the probability of belonging to one of the specific classes (groups).
  • 13. 13 Binary Classification Some of the approaches divide articles into two groups: • Complete: articles with the highest quality grades (FA, GA) • Incomplete: other articles, which have lower quality grades
  • 14. 14 Measures Importance Source: Węcel, K., Lewoniewski, W. (2015). Modelling the quality of attributes in Wikipedia infoboxes. In International Conference on Business Information Systems (pp. 308-320). Springer, Cham.
  • 15. 15 Extended Assessment • To build a model we can use more than two categories, depending on quality grades – Number of categories can be different in each language • ORES score – Only for selected language versions • Synthetic Measure – For all Wikipedia languages that have the highest grade (FA equivalent)
  • 17. 17 Article and Infobox Quality • Completeness • Credibility • Objectivity • Readability • Relevance • Style • Timeliness • … • Completeness • Credibility • Relevance • Timeliness • … Source: Lewoniewski, W. (2018). Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia. 21st International Conference on Business Information Systems, Berlin. (in press)
  • 18. 18 Simple Measures for Infobox Source: Lewoniewski, W. (2017). Completeness and Reliability of Wikipedia Infoboxes in Various Languages. In International Conference on Business Information Systems (pp. 295-305). Springer, Cham.
  • 19. 19 Infobox Timeliness • Timeliness measures can be related to currency and volatility of the infoboxes. • Example: history of changes in the ''leader name'' parameter of the Poznań infobox Source: Lewoniewski, W. (2018). Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia. 21st International Conference on Business Information Systems, Berlin. (in press)
  • 20. 20 Correlation of Measures related to Articles and Infoboxes Source: Lewoniewski, W. (2017). Enrichment of information in multilingual Wikipedia based on quality analysis. In International Conference on Business Information Systems (pp. 216-227). Springer, Cham.
  • 22. 22 WikiBest Users can vote for the best infobox in four nominations: • the best quality • the best completeness • the best credibility • the best timeliness Source: https://wikibest.net
  • 23. 23 Enrichment of Wikipedia Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2017). Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles. In Informatics (Vol. 4, No. 4, p. 43). Multidisciplinary Digital Publishing Institute.
  • 24. 24 Potential for New Articles • Despite the fact that English Wikipedia is the largest, it can be also enriched by other language versions. • Table below presents potential number of articles in each language and each topic that can be created or enriched using infoboxes from other language version of Wikipedia Source: Lewoniewski, W. (2017). Completeness and Reliability of Wikipedia Infoboxes in Various Languages. In International Conference on Business Information Systems (pp. 295-305). Springer, Cham.
  • 25. 25 Future work • Expanding number of measures (including linguistic) for predicting quality of Wikipedia articles. • Sentiment analysis of Wikipedia articles. • Fact extraction from the content in different languages. • Improving projects related to the quality of Wikipedia. • Analysis of references measures on various granularity: – host level, domain, path and url. • Detection of language sensitive topics in Wikipedia.
  • 28. 28 Unification of Infobox Parameters Source: Lewoniewski, W., Kasprzak, A., Węcel, K., Abramowicz, W., (2018), Kompletność danych o produktach w różnych wersjach językowych Wikipedii (in press)
  • 29. 29 SEO Measures Source: Lewoniewski, W., Härting, R. C., Wecel, K., Reichstein, C., Abramowicz, W. (2018). Application of SEO Metrics to Determine the Quality of Wikipedia Articles and Their Sources. In International Conference on Information and Software Technologies (pp. 139-152). Springer, Cham. Mean of Visibility Index from different countries perspectives: Mean of each social indicators:
  • 30. 30 Linguistic Measures • In Polish Wikipedia we extracted over 100 linguistic measures of articles • Model shows over 93% classification precision The most important features: • impersonal verbs, • third person words, • unique nouns, • unique verbs. Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2018). Determining Quality of Articles in Polish Wikipedia Based on Linguistic Features. In International Conference on Information and Software Technologies (pp. 546-558). Springer, Cham.
  • 31. 31 Fact Extraction Source: Khairova, N., Lewoniewski, W., & Węcel, K. (2017). Estimating the quality of articles in Russian Wikipedia using the logical-linguistic model of fact extraction. In International Conference on Business Information Systems (pp. 28-40). Springer, Cham. • Logical-linguistic model of fact extraction in Russian texts • Density of simple and complex facts can determine the quality of Wikipedia articles
  • 32. 32 Quality and Importance Models Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2016). Quality and importance of Wikipedia articles in different languages. In International Conference on Information and Software Technologies (pp. 613-624). Springer, Cham.
  • 33. 33 Analysis of References in Wikipedia Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2017). Analysis of references across Wikipedia languages. In International Conference on Information and Software Technologies (pp. 561-573). Springer, Cham.. Overlaps of unique references : Overlaps of domains of references