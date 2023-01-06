What Wikidata teaches us about knowledge engineering
WHAT WIKIDATA
TEACHES US
ABOUT
KNOWLEDGE
ENGINEERING
Elena Simperl
King’s College London, UK
@esimperl
ISWS2022
July 2022
OVERVIEW
Knowledge graphs have
become a critical AI resource
We study them associo-
technical constructs
Our research
Explores knowledge-graph engineering
empirically to understand practices and
improve processes and tools
Uncovers links between social and technical
qualities of knowledge graphs to make them
better
Picture from https://medium.com/@sderymail/challenges-of-knowledge-graph-part-1-d9ffe9e35214
KNOWLEDGE ENGINEERING AS HUMAN-
MACHINE PROCESS
Human-
machine
interactions
Knowledge
engineering
communities
WIKIDATA
Collaborative knowledge graph,
Wikimedia project (2012),
23k active users, 99m items, 1.6b edits
Open license
RDF support, links to LOD cloud
5.
THE KNOWLEDGE GRAPH
STATEMENTS, ITEMS, PROPERTIES
Item identifiers start with a Q, property identifiers
start with a P
8
Q84
London
Q334155
Sadiq Khan
P6
head of government
THE KNOWLEDGE GRAPH
ITEMS CAN BE CLASSES, ENTITIES, VALUES
9
Q7259
Ada Lovelace
Q84
London
Q334155
Sadiq Khan
P6
head of government
Q727
Amsterdam
Q515
city
Q6581097
male
Q59360
Labour party
Q145
United Kingdom
THE KNOWLEDGE GRAPH
STATEMENTS IN CONTEXT
Statements may include context
Qualifiers (optional)
References (required)
Two types of references
Internal, linking to another item
External, linking to webpage
10
Q84
London
Q334155
Sadiq Khan
P6
head
of government
9 May 2016
london.gov.uk
THE KNOWLEDGE GRAPH
CO-EDITED BY BOTS AND HUMANS
Human editors can register or work anonymously
Community creates bots to automate routine tasks
23k active human users, 340+ bots
IN THIS TALK
Assembling your KG team
KG quality as a function of
data provenance
Supporting editors and
improving retention
‘ONTOLOGIES ARE US’
Piscopo, A., Phethean, C., & Simperl, E. (2017). What Makes a
Good Collaborative Knowledge Graph: Group Composition and
Quality in Wikidata. International Conference on Social
Informatics, 305-322, Springer.
Piscopo, A., & Simperl, E. (2018). Who Models the World?:
Collaborative Ontology Creation and User Roles in
Wikidata. Proceedings of the ACM on Human-Computer
Interaction, 2(CSCW), 141.
BACKGROUND
Editors have varied tenure and interests
Their profile and editing behaviour impact
outcomes
Group composition has varied effects
Tenure and interest diversity can increase outcome
quality and group productivity
Different editors groups focus on different types of
activities
Chen, J., Ren, Y., Riedl, J.: The effects of diversity on group productivity and member withdrawal in online volunteer groups. In: Proceedings of the 28th international
conference on human factors in computing systems - CHI ’10. p. 821. ACM Press, New York, USA (2010)
12.
STUDY 1: ITEM QUALITY
Analysed edit history of items
Corpus of 5k items, whose quality has been
manually assessed (5 levels)*
Edit history focused on community make-up
Community is defined as set of editors of item
Considered features from group diversity
literature and Wikidata-specific aspects
*https://www.wikidata.org/wiki/Wikidata:Item_quality
DATA AND METHODS
Ordinal regression analysis, trained four models
Dependent variable: 5k labelled Wikidata items
Independent variables
Proportion of bot edits
Bot human edit proportion
Proportion of anonymous edits
Tenure diversity: Coefficient of variation
Interest diversity: User editing matrix
Control variables: group size, item age
15.
RESULTS
ALL HYPOTHESES SUPPORTED
H1
H2
H3 H4
H5
16.
SUMMARY AND IMPLICATIONS
The more is
not always
the merrier
01
Bot edits are
key for quality,
but bots and
humans are
better
02
Registered
editors have
a positive
impact
Diversity
matters
04
Encourage
registration
01
Identify further
areas for bot
editing
02
Design effective
human-bot
workflows
03
Suggest items
to edit based
on tenure and
interests
04
03
ONGOING WORK
Analysing how
the community
awards quality
labels to items
Developing
robust machine
learning models
Spotting biases
STUDY 2: ONTOLOGY QUALITY
Analysed Wikidata ontology and its edit
context
Defined as the graph of all items linked through
P31 (instance of) & P279 (subclass of)
Calculated evolution of quality indicators and
editing activity in time and the links between them
Based on features from literature on ontology
evaluation and collaborative ontology engineering
19.
DATA AND METHODS
Wikidata dumps from March 2013 (creation of P279)
to September 2017
Analysed data in 55 monthly time frames
Literature survey to define Wikidata ontology quality
framework
Clustering to identify ontology editor roles
Lagged multiple regression to link roles and ontology
features
Dependent variable: Changes in ontology quality across time
Independent variables: Number of edits by different roles
Control variables: Bot and anonymous edits
20.
ONTOLOGY QUALITY: INDICATORS
Based on 7 ontology evaluation frameworks
Compiled structural indicators that can be computed from the data
23
Indicator Description Indicator Description
noi Number of instances ap; mp Average and median population
noc Number of classes rr Relationship richness
norc Number of root classes ir, mr Inheritance and median richness
nolc Number of leaf classes cr Class richness
nop Number of properties ad, md, maxd Average, median, max explicit depth
Sicilia, M. A., Rodríguez, D., García-Barriocanal, E., & Sánchez-Alonso, S. (2012). Empirical findings on ontology metrics. Expert Systems with
Applications, 39(8), 6706-6711.
21.
ONTOLOGY QUALITY: RESULTS
LARGE ONTOLOGY, UNEVEN QUALITY
>1.5M classes, ~4000 properties
No of classes increases at same rate as overall
no of items, likely due to users incorrectly using
P31 & P279
ap and cr decrease over time, several classes
are either without instances, sub-classes or both
ir & maxd increase over time, part of the
Wikidata ontology is distributed vertically
22.
EDITOR ROLES: METHODS
K-means, features based on previous studies in ontology engineering
Analysis by yearly cohort
25
Feature Description Feature Description
# edits Total number of edits per month. # property edits Total number of edits on
Properties in a month.
# ontology edits Number of edits on classes. # taxonomy edits Number of edits on P31 and P279
statements.
# discussion edits Number of edits on talk pages. p batch edits Number of edits done through
automated tools.
# modifying editsNumber of revisions on previously
existing statements.
item diversity Proportion between number of edits
and number of items edited.
admin True if user in an admin user
group, false otherwise.
lower admin True if user in a user group
with enhanced user rights,
false otherwise.
23.
EDITOR ROLES: RESULTS
190,765 unique editors over 55 months (783k
total)
18k editors active for 10+ months
2 clusters, obtained using gap statistic (tested
2≥k≥8)
Leaders: more active minority (~1%), higher
number of contributions to ontology, engaged
within the community
Contributors: less active, lower number of
contributions to ontology and lower proportion of
batch edits
24.
EDITOR ROLES: RESULTS
People who joined the project early tend to be
more active & are more likely to become leaders
Levels of activity of leaders decrease over time
(alternatively, people move on to different tasks)
25.
RESEARCH HYPOTHESES
H1 Higher levels of leader activity are negatively correlated to
number of classes (noc), number of root classes (norc), and
number of leaf classes (nolc)
H2 Higher levels of leader activity are positively correlated to
inheritance richness (ir), average population (ap), and average
depth (ad)
26.
ROLES & ONTOLOGY: RESULTS
H1 not supported
H2 partially supported
Only inheritance richness (ir) and average depth (ad)
related significantly with leader edits (p<0.01)
Bot edits significantly and positively affect the number of
subclasses and instances per class (ir & ap) (p<0.05)
27.
CONCLUSIONS
Creating ontologies still a challenging task
Size of ontology renders existing automatic quality
assessment methods unfeasible
Broader curation efforts needed: large number of empty
classes
Editor roles less well articulated than in other ontology
engineering projects
Possible decline in motivation after several months
ONGOING WORK
The role of discussions in
Wikidata
Studying participation
pathways to observe
learning effects
Analysis of ShEx
schemas
THE CONTENT IS AS
GOOD AS ITS
REFERENCES
Amaral, G., Piscopo, A., Kaffee, L. A., Odinaldo, R., &
Simperl, E. (2021). Assessing the quality of sources in
Wikidata across languages: a hybrid approach. ACM
JDIQ 13(4), 1-35.
30.
PROVENANCE IN WIKIDATA
Statements may include context
Qualifiers (optional)
References (required)
Two types of references
Internal, linking to another item
External, linking to webpage
Q84
London
Q334155
Sadiq Khan
P6
head
of government
9 May 2016
london.gov.uk
THE ROLE OF PROVENANCE
Wikidata aims to become a hub of references
Provenance increases trust in Wikidata
Lack of provenance hinders downstream reuse
Quality of references underexplored
Hartig, O. (2009). Provenance Information in the Web of Data. LDOW, 538.
32.
STUDY 3
Approach to evaluate quality of external and
internal references in Wikidata
Quality defined by the Wikidata verifiability policy
Relevant: support the statement they are attached to
Authoritative: trustworthy, up-to-date, and free of bias for supporting a
particular statement
Easy to access: users should be able to access the information in references with
low perceived effort
Large-scale (the whole of Wikidata)
Multiple languages (EN, ES, PT, SV, NL, JA)
33.
RESEARCH QUESTIONS
RQ1 How easy it is to access sources?
RQ2 How authoritative are sources?
According to existing guidance, do they match the
author and publisher types from the Wikidata policy?
RQ3 Are sources relevant to claims?
34.
RESEARCH QUESTIONS
RQ4 Are external and internal references of
varying quality?
RQ5 Are references in different languages of
varying quality?
RQ6 Can we predict relevance, authoritativeness
and ease of use of a reference without relying on
the content of the references themselves? Can we
do this for references in different languages?
35.
METHODS
THREE STAGE MIXED APPROACH
1. Descriptive statistics
Understand differences in external/internal references
Determine data distributions
Language
Predicates
Domains
Prepare data to crowdsource
36.
METHODS
THREE STAGE MIXED APPROACH
2. Microtask crowdsourcing
Evaluate relevance, authoritativeness & ease
of access of a reference sample
Create training set for ML
3. Machine learning
Large-scale reference quality prediction
RQs 1-5
RQ6
37.
MICROTASK CROWDSOURCING
2 tasks on Amazon’s Mechanical Turk
5 workers/task, majority voting
Quality assurance through golden data & attention checks
Feature Microtask Description
Relevance / Ease
of access
T1.1 Does the reference support the statement?
T1.2 How easy was the navigation to verify the reference?
T1.3 What was the biggest barrier to access, if any?
Authoritativeness
T2.1 Choose author type from list
T2.2 Choose publisher type from list
T2.3 Choose publisher sub-type from list
38.
MACHINE LEARNING
Compared 5 algorithms
Naïve Bayes, Random Forest, SVM, XGBoost, Neural Network
Features taken from URLs and Wikidata ontology
3 tasks: relevance, ease of navigation (data from T1.2), and
authoritativeness
Features
Features of the representative URL extracted
Features of the reference node’s coding and structure
Features of the website available through the representative URL
Features of the claim node associated to this reference node
39.
DATA
Snapshot up to April 16th, 2020
15m individual items used in Stage 1 (20% of
total)
385 random references from each language
(95% conf., 5% m. of error)
2,310 references in total
Six languages, both internal and external
Around 940 could be automatically checked (40% of total)
40.
RESULTS: CROWDSOURCING
CROWDSOURCING WORKS
Agreement calculated per question and language
Agreement overall moderate (over 0.4) and often substantial (over
0.6)
Multiple trusted metrics used for robustness
* Kappa metrics can’t be used for interval data
Task No. of microtasks Total workers Fleiss’ k Randolph’s k Kripp’s α
T1.1 460 microtasks 87 0.507702 0.745626 0.537647
T1.2 N/A* N/A* 0.614661
T1.3 0.405330 0.592810 0.415045
T2.1 370 microtasks 74 0.496904 0.781481 0.645795
T2.2 0.603499 0.695833 0.657171
T2.3 0.515551 0.604419 0.572782
41.
RESULTS: CROWDSOURCING
MAJORITY OF REFERENCES ARE HIGH QUALITY
2310 references evaluated
Found 91.73% are relevant
Found 66.97% are authoritative
Most (95%) unauthoritative references link to Wikipedia
On ease of access:
5-level Likert Scale was used
78.43% fall on levels 3 (easy) or 4 (very easy)
6.66% fall on levels 0 (very difficult) and 1 (difficult)
76.44% of irrelevant references were due to working links with
wrong information (different from Wikidata)
19.37% were due to bad links
RQ1
RQ2
RQ3
42.
RESULTS: CROWDSOURCING
VARYING REFERENCE QUALITY
RQ4
External references had better agreement than internal
10% better on average (Krippendorff’s Alpha)
No noticeable quality difference between external and
internal references
Quality differs a lot based on language
Examples:
English references are much harder to access (19.37% voted as
very difficult)
Non-authoritative references: 4.94% for Dutch , but 43.12% for
Japanese
Likely due to specific languages used for specific topics
RQ5
43.
RESULTS: CROWDSOURCING
DATA FROM GOVERNMENT AND ACADEMIC SOURCES
Most common author type (T2.1)
Organization (67%)
Most common publisher types (T2.2)
Companies/Organizations (37.75%)
Self-published (28.79%)
Academic/Scientific (22.16%)
This also varies based on language
Academic/Scientific for English is 83.12%
Self-published for Swedish is 57.14%
RQ2
RQ5
44.
RESULTS: MACHINE LEARNING
RANDOM FORESTS PERFORM BEST
F1 Acc
Relevance
Random Forest 0.78 0.93
XGBoost 0.76 0.91
Naïve Bayes 0.66 0.86
SVM 0.68 0.85
Neural Network 0.79 0.94
Authoritativeness
Random Forest 0.981 0.983
XGBoost 0.977 0.980
Naïve Bayes 0.974 0.977
SVM 0.975 0.979
Neural Network 0.982 0.984
RQ6
45.
RESULTS: MACHINE LEARNING
RANDOM FORESTS PERFORM BEST
F1 Acc
Ease of Navigation
Random Forest 0.815 0.834
XGBoost 0.183 0.06
Naïve Bayes 0.811 0.806
SVM 0.780 0.824
Neural network 0.831 0.936
RQ6
KG embeddings trained on Wikidata did not improve the task overall
when compared to simple one-hot encoding
46.
CONCLUSIONS
Crowdsourcing + ML works
Many external sources are high-quality
Bad references are mainly working links with bad
information
External references were much clearer to workers
There are topical biases in different languages,
leading quality to vary by language
47.
ONGOING WORK
Editor study to
understand
how references
are chosen
Automatic
claim
verification
48.
ENCOURAGING EDITOR
DIVERSITY, IMPROVING
RETENTION
Alghamdi, K., Shi, M. & Simperl, E. (2021). Learning to
recommend items to Wikidata editors. International
Semantic Web Conference, 163-181.
49.
OVERVIEW
Content growth is
outpacing community
growth
Tenure and interest
diversity help with item
quality
Recommenders could
help retain new editors
Source: Lydia Pintcher @Wikidata
Pages per active editor
50.
DATASETS
LONG TAIL DISTRIBUTION OF PARTICIPATION
All editors Editors w/
200+ items
51.
WIKIDATAREC
MODEL
Recommendation
modelled as classification
Sparse data, editor is
interested in item if they
edit it, no explicit
feedback of lack of
interest
Features of items, items
relations and editors
52.
RESULTS: WIKIDATAREC VS BASELINES
OUTPERFORMS COLLABORATIVE FILTERING AND CONTENT-
BASED BASELINES
Baselines: collaborative filtering (CF) and content-based
GMF = neural network implementation of collaborative filtering (CF)
BPR-MF = matrix factorization (MF), for recommenders with implicit feedback
eALS = MF, for implicit feedback scenarios, optimises a different function
YouTube DNN = leans item and relational embeddings in a neural network with
random initialisation
53.
RESULTS: ABLATION STUDY
GRAPH FEATURES MAKE A DIFFERENCE
Performance increases when items and relations are considered
Item content contributed more than relations
WikidataRec w NMoR (neural mixture of representations) performs
significantly better than WikidataRec w/o NMoR
54.
RESULTS: ACTIVE EDITORS
FURTHER PERFORMANCE INCREASE ON LESS SPARSE DATA
Wikidata-14M
Wikidata-Active-Editor
55.
CONCLUSIONS
WikidataRec shows recommendations are
technically feasible
Alternative means to collect editor
preferences and learn about their
motivations needed because of data
sparsity
56.
ONGOING WORK
Study with editors suggested editors work
within the same topics for some time, then
change to new topics
Extension to WikidataRec for topic
recommendations yields better results than
item recommendations
WikidataRecSeq taking for edit sequences
under development
57.
HUMAN-MACHINE
INTERACTIONS IN
KNOWLEDGE ENGINEERING
Use of AI tools e.g. language
models in KG engineering:
explainability, trust
Use of KGs in downstream AI
applications e.g. through
embeddings
58.
WHAT WIKIDATA
TEACHES US ABOUT
KNOWLEDGE
ENGINEERING
Wikidata logs are a
huge source of
observational data in
knowledge engineering
Sociotechnical methods
improve on purely
technical ones to address
community growth and
diversity, content
reliability
Many cross-disciplinary
methods e.g. psychology,
sociology, neurosciences
underexplored
59.
WHAT
WIKIDATA
TEACHES US
ABOUT
KNOWLEDGE
ENGINEERING
