Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Quality and collaboration in Wikidata

756 views

Published on

Studies in socio-technical aspects of collaborative knowledge graphs

Published in: Science
  • Be the first to comment

Quality and collaboration in Wikidata

  1. 1. QUALITY AND COLLABORATION IN WIKIDATA Elena Simperl and Alessandro Piscopo University of Southampton, UK @esimperl
  2. 2. OVERVIEW Wikidata is a critical AI asset in many applications Recent project of Wikimedia (2012), edited collaboratively Our research assesses the quality of Wikidata and the link between community processes and quality
  3. 3. WHAT IS WIKIDATA
  4. 4. BASIC FACTS Collaborative knowledge graph 100k registered users, 35M items Open licence RDF exports, connected to Linked Open Data Cloud
  5. 5. THE KNOWLEDGE GRAPH STATEMENTS, ITEMS, PROPERTIES Item identifiers start with a Q, property identifiers start with a P 5 Q84 London Q334155 Sadiq Khan P6 head of government
  6. 6. THE KNOWLEDGE GRAPH ITEMS CAN BE CLASSES, ENTITIES, VALUES 6 Q7259 Ada Lovelace Q84 London Q334155 Sadiq Khan P6 head of government Q727 Amsterdam Q515 city Q6581097 male Q59360 Labour party Q145 United Kingdom
  7. 7. THE KNOWLEDGE GRAPH ADDING CONTEXT TO STATEMENTS Statements may include context  Qualifiers (optional)  References (required) Two types of references  Internal, linking to another item  External, linking to webpage 7 Q84 London Q334155 Sadiq Khan P6 head of government 9 May 2016 https://www.london.gov.uk/...
  8. 8. THE KNOWLEDGE GRAPH CO-EDITED BY BOTS AND HUMANS Human editors can register or work anonymously Bots created by community for routine tasks
  9. 9. OUR WORK Influence of community make-up on outcomes Effects of editing practice on outcomes Data quality, as a function of its provenance
  10. 10. THE RIGHT MIX OF USERS Piscopo, A., Phethean, C., & Simperl, E. (2017) What Makes a Good Collaborative Knowledge Graph: Group Composition and Quality in Wikidata. International Conference on Social Informatics, 305- 322, Springer.
  11. 11. BACKGROUND Wikidata editors have varied tenure and interests Group composition impacts outcomes  Diversity can multiple effects  Moderate tenure diversity increases outcome quality  Interest diversity leads to increased group productivity Chen, J., Ren, Y., Riedl, J.: The effects of diversity on group productivityand member withdrawalin online volunteer groups. In: Proceedingsof the 28th international conference on human factors in computing systems - CHI ’10. p. 821. ACM Press, New York, USA (2010)
  12. 12. OUR STUDY Analysed the edit history of items Used corpus of 5000 items, whose quality has been manually assessed (5 levels)* Edit history focused on community make-up Community is defined as set of editors of item Considered features from group diversity literature and Wikidata-specific aspects *https://www.wikidata.org/wiki/Wikidata:Item_quality
  13. 13. RESEARCH HYPOTHESES Activity Outcome H1 Bots edits Item quality H2 Bot-human interaction Item quality H3 Anonymous edits Item quality H4 Tenure diversity Item quality H5 Interest diversity Item quality
  14. 14. DATA AND METHODS  Ordinal regression analysis, four models were trained  Dependent variable: 5000 labelled Wikidata items  Independent variables  Proportion of bot edits  Bot human edit proportion  Proportion of anonymous edits  Tenure diversity: Coefficient of variation  Interest diversity: User editing matrix  Control variables: group size, item age
  15. 15. RESULTS ALL HYPOTHESES SUPPORTED H1 H2 H3 H4 H5
  16. 16. LESSONS LEARNED The more is not always the merrier 01 Bot edits are key for quality, but bots and humans are better 02 Diversity matters 03
  17. 17. IMPLICATIONS Encourage registration 01 Identify further areas for bot editing 02 Design effective human-bot workflows 03 Suggest items to edit based on tenure and interests 04
  18. 18. LIMITATIONS AND FUTURE WORK ▪ Measures of quality over time required ▪ Sample vs Wikidata (most items C or lower) ▪ Other group features (e.g., coordination) not considered ▪ No distinction between editing activities (e.g., schema vs instances, topics etc.) ▪ Different metrics of interest (topics, type of activity) 18
  19. 19. THE DATA IS AS GOOD AS ITS REFERENCES Piscopo, A., Kaffee, L. A., Phethean, C., & Simperl, E. (2017). Provenance Information in a Collaborative Knowledge Graph: an Evaluation of Wikidata External References. International Semantic Web Conference, 542-558, Springer. 19
  20. 20. PROVENANCE IN WIKIDATA Statements may include context  Qualifiers (optional)  References (required) Two types of references  Internal, linking to another item  External, linking to webpage Q84 London Q334155 Sadiq Khan P6 head of government 9 May 2016 https://www.london.gov.uk/...
  21. 21. THE ROLE OF PROVENANCE Wikidata aims to become a hub of references Data provenance increases trust in Wikidata Lack of provenance hinders data reuse Quality of references is yet unknown Hartig, O. (2009). Provenance Information in the Web of Data. LDOW, 538.
  22. 22. OUR STUDY Approach to evaluate quality of external references in Wikidata Quality is defined by the Wikidata verifiability policy  Relevant: support the statement they are attached to  Authoritative: trustworthy, up-to-date, and free of bias for supporting a particular statement Large-scale (the whole of Wikidata) Bot vs. human-contributed references
  23. 23. RESEARCH QUESTIONS RQ1 Are Wikidata external references relevant? RQ2 Are Wikidata external references authoritative? ▪I.e., do they match the author and publisher types from the Wikidata policy? RQ3 Can we automatically detect non-relevant and non-authoritative references?
  24. 24. METHODS TWO STAGE MIXED APPROACH 1. Microtask crowdsourcing ▪Evaluate relevance & authoritativeness of a reference sample ▪Create training set for machine learning model 2. Machine learning ▪Large-scale reference quality prediction RQ1 RQ2 RQ3
  25. 25. STAGE 1: MICROTASK CROWDSOURCING ▪3 tasks on Crowdflower ▪5 workers/task, majority voting ▪Test questions to select workers 25 Feature Microtask Description Relevance T1 Does the reference support the statement? Authoritativeness T2 Choose author type from list T3.A Choose publisher type from list T3.B Verify publisher type, then choose sub-type from list RQ1 RQ2
  26. 26. STAGE 2: MACHINE LEARNING Compared three algorithms  Naïve Bayes, Random Forest, SVM Features based on [Lehmann et al., 2012 & Potthast et al. 2008] Baseline: item labels matching (relevance); deprecated domains list (authoritativeness) RQ3 Features URL reference uses Subject parent class Source HTTP code Property parent class Statement item vector Object parent class Statement object vector Author type Author activity Author activity on references
  27. 27. DATA 1.6M external references (6% of total)  1.4M from two sources (protein KBs) 83,215 English-language references  Sample of 2586 (99% conf., 2.5% m. of error)  885 assessed automatically, e.g., links not working or csv files
  28. 28. RESULTS: CROWDSOURCING CROWDSOURCING WORKS ▪Trusted workers: >80% accuracy ▪95% of responses from T3.A confirmed in T3.B Task No. of microtasks Total workers Trusted workers Workers’ accuracy Fleiss’ k T1 1701 references 457 218 75% 0.335 T2 1178 links 749 322 75% 0.534 T3.A 335 web domains 322 60 66% 0.435 T3.B 335 web domains 239 116 68% 0.391
  29. 29. RESULTS: CROWDSOURCING MAJORITY OF REFERENCES ARE HIGH QUALITY 2586 references evaluated Found 1674 valid references from 345 domains Broken URLs deemed not relevant and not authoritative RQ1 RQ2
  30. 30. RESULTS: CROWDSOURCING HUMANS ARE BETTER AT EDITING REFERENCES RQ1 RQ2
  31. 31. RESULTS: CROWDSOURCING DATA FROM GOVT. AND ACADEMIA Most common author type (T2)  Organisation (78%) Most common publisher types (T3)  Governmental agencies (37%)  Academic organisations (24%) RQ2
  32. 32. RESULTS: MACHINE LEARNING RANDOM FORESTS PERFORM BEST F1 MCC Relevance Baseline 0.84 0.68 Naïve Bayes 0.90 0.86 Random Forest 0.92 0.89 SVM 0.91 0.87 Authoritativeness Baseline 0.53 0.16 Naïve Bayes 0.86 0.78 Random Forest 0.89 0.83 SVM 0.89 0.79 RQ3
  33. 33. LESSONS LEARNED Crowdsourcing+ML works! Many external sources are high quality Bad references mainly non-working links, continuous control required Lack of diversity in bot-added sources Humans and bots are good at different things
  34. 34. LIMITATIONS AND FUTURE WORK Studies with non-English sources New approach for internal references Deployment in Wikidata, including changes in editing behaviour
  35. 35. THE COST OF FREEDOM: ON THE ROLE OF PROPERTY CONSTRAINTS IN WIKIDATA 35
  36. 36. BACKGROUND Wikidata is built by the community, from scratch Editors are free to carry out any kind of edit There is tension between editing freedom and quality of the modelling Property constraints have been introduced at a later stage Currently 18 constraints, but they are not enforced 36 Hall, A., McRoberts, S., Thebault-Spieker, J., Lin, Y., Sen, S., Hecht, B., & Terveen, L. (2017, May). Freedom versus standardization: structured data generation in a peer production community. In Proceedingsof the 2017 CHI Conferenceon human fators in computing sytems(pp. 6352-6362). ACM.
  37. 37. OUR STUDY Effects of property constraints on Content quality, i.e., increasing user awareness of property use Diversity of expression Editor behaviour, by increasing conflict level
  38. 38. ▪Several claims can be expressed for a statement, thanks to qualifiers and references 38 Q84 London Q334155 Sadiq Khan P6 9 May 2016 https://www.london.gov.u k/… The cost of freedom: Claims Q180589 Boris Johnson 4 May 2008 https://www.london.gov.u k/…
  39. 39. RESEARCH HYPOTHESES Activity Outcome H1 Property constraints Property perspicuity H2 Property constraints Knowledge diversity H3 Property constraints Level of conflict
  40. 40. METRICS ▪ Property perspicuity: V = Nviolations/Nclaims ▪ Knowledge diversity: KDscore = Nclaims/Nstatements ▪ Controversy metric: ▪ Conflicting edits ▪ Cscore = Nconfl.edits/Nedits (0> Cscore>>1) 40
  41. 41. METHODS H1: Linear trend analysis of Cviolations H2 and H3: Lagged, multiple regression models to predict changes between Tn & Tn–1in KDscore and Cscore
  42. 42. RESULTS H1 was supported, but limited to some constraints 12 constraints out of 18 showed significant variations along the time frame observed Constraint with largest variation was type (i.e., property domain)
  43. 43. RESULTS H2 was rejected, but more property constraints at the beginning of a time frame lead to decreased knowledge diversity
  44. 44. RESULTS H3 was rejected, constraints lead to fewer conflicts
  45. 45. LIMITATIONS Wikidata still in early state of development Metrics need further refinement Changes were made to constraints after our analysis, which could produce new effects
  46. 46. LESSONS LEARNED Editors seem to understand meaning of property constraints Low level of knowledge diversity and conflict overall Non-enforcement of constraints seems to have only limited effect on community dynamics Effects of when and how constraints are introduced not explored yet 46
  47. 47. CONCLUSIONS 47
  48. 48. SUMMARY OF FINDINGS Collaboration between human and bots is important Tools needed to identify tasks for bots and continuously study their effects on outcomes and community References are high quality, though biases exist in terms of choice of sources Wikidata’s approach to knowledge engineering questions existing theoretical and empirical literature

×