Wikidata as a hub for the linked data cloudWikidata as a hub for the linked data cloud
TUTORIAL AT DCMI CONFERENCE, SEOUL, 2019-09-25TUTORIAL AT DCMI CONFERENCE, SEOUL, 2019-09-25
Tom Baker, Joachim Neubert, Andra Waagmeester
Slides (partitially) at https://jneubert.github.io/wd-dcmi2019/#/
OverviewOverview
Part 1: Using and querying Wikidata
Part 2: Wikidata as a linking hub
Part 3: Applications based on Wikidata
Part 4: Wikidata usage scenarios
Scenarios
Intro and details
Hands-on: Mix-n-match
Quality control tools and procedures
Wikidata community
Wikidata as linking hubWikidata as linking hub
The idea of linking hubsThe idea of linking hubs
Connect concepts via identifiers/URLs
Existing hubs: , , ...
Image by Jakob Voss
VIAF sameAs.org
Different linking propertiesDifferent linking properties
1. (datatype URL)
generic link to URL in the meaning of skos:exactMatch
2. : more than 4000 specialized properties (datatype external identifier)
exact match
Pxxxx
Examples for external identifiersExamples for external identifiers
GND / VIAF identifiers
geogaphical entities
proteins
Swedish cultural heritage objects
African plants
baseball players
TED conference speakers
Property definitionsProperty definitions
subject item for the property
examples
constraints on values, cardinality, etc.
: creates a clickable link for the ID
start at the property page, e.g., for the ISSN:
formatter URL
https://www.wikidata.org/wiki/Property:P236
Property DocumentationProperty Documentation
Beyond sameness - mapping relationsBeyond sameness - mapping relations
Wikidata external ids imply "sameness" of linked concepts
even with geographic names, other mapping relations are required in some
cases.
examples:
close matches, e.g., "Yugoslavia" (1918-1992) (Wikidata) ≅ "Yugoslavia (until
1990)" (STW)
related matches, e.g. a company and its founder
Mapping relation type (P4390)Mapping relation type (P4390)
introduced after a community discussion in October 2017
to be used as qualifier for external id entries
fixed value set - SKOS mapping relations (exact, close, broad, narrow, related
match)
EXAMPLE AT ITEMEXAMPLE AT ITEM ASSESSMENT CENTERASSESSMENT CENTER
How does that relate to the Linked Data model?How does that relate to the Linked Data model?
Internal data model and storage (Wikibase) is transformed to RDF for:
RDF dumps
Query Service
RDF linking from WikidataRDF linking from Wikidata
: linked data URI
e.g., , (vs. formatter URL
)
linked external RDF resources
plus ~950.000 relations to individual URIs
formatter URI for RDF resource
http://sws.geonames.org/$1/
https://www.geonames.org/$1
List of 130+ relationships to external RDF datasets
26+ million
exact match
Links in the RDF dumpsLinks in the RDF dumps
Output has full URLs to external resources, however with Wikidata-specific
properties:
This creates a hurdle for generic Linked Data browsers and tools - not even
is translated to skos:exactMatch
wd:Q123 wdt:P234 "External-ID" ;
wdtn:P234 <http://example.com/reference/External-ID>
exact
match
Federated SPARQL queriesFederated SPARQL queries
Example use case: GND authority has information about the
professions/occupations of people which is not known in Wikidata.
So get that information dynamically from a GND SPARQL endpoint.
Here, we are interested in economists, in particular.
From Wikidata to a remote endpointFrom Wikidata to a remote endpoint
From a remote endpoint to WikidataFrom a remote endpoint to Wikidata
<== not working currently
query to WDQS
query to GND endpoint
Several points for attentionSeveral points for attention
Direction and sequence of statements often matters for performance
To reach out from Wikidata, endpoints have to be ( )
In the other direction, access is normally not restricted
Some federated queries get extremely slow, when large sets of bindings exist before the remote
service is invoked
be sure to exclude variables bound to blank nodes ('unknown value' in Wikidata)
approved full list
Further reading on Wikidata/RDFFurther reading on Wikidata/RDF
( )
Critical comments/suggestions:
RDF dump format (documentation)
Waagmeester: Integrating Wikidata and other linked data sources -
Federated SPARQL queries more examples
Malyshev et al.: Semantic Technology Usage in Wikipedia’s
Knowledge Graph
Freire/Isaac: Technical usability of
Wikidatas linked data
Application process for a new propertyApplication process for a new property
Double-check, that the property does not already exist
Prepare a property proposal in the according section, e.g., Wikidata:Property
proposal/Authority control
Hints for getting it approved smoothlyHints for getting it approved smoothly
Clearly lay out the motivation and planned use for the property
Provide working examples (with the formatter URI you are suggesting)
Be responsive to comments
Wikidata as a universal linking hubWikidata as a universal linking hub
easy extensibility with new properties for external identifiers
immense fund of existing items, with the full set of SKOS mapping relations for
more or less exact mappings to these
immediate extensibility with new items
Linking content via Mix-n-MatchLinking content via Mix-n-Match
Mix-n-match is a widely used tool (by Magnus Manske) to link external databases,
catalogs, etc. to existing Wikidata items (or to create new ones).
Example list:Example list:
Newspapers and journalsNewspapers and journals
from the 20th Century Press Archivefrom the 20th Century Press Archive
Please navigate to our example catalog
Mix-n-match manual
TasksTasks
Login through Widar
In "Automatically matched" list:
Connect matching items
Remove non-matching entries
In "Unmatched" list:
Search for existing items
Create missing items ( )suggested properties
Supplementary materialSupplementary material
Item creation "on the go"Item creation "on the go"
With Mix-n-match "New item": rudimentary, no references
Custom list of prepared QuickStatements insert blocks ( from STW
Thesaurus for Economics - please don't mess with it, this is work in progress)
Workflow-wise, use same sequence for M-n-m input and prepared insert blocks
example
Recommendations for item creationRecommendations for item creation
Pay attention to (much more relaxed then
Wikipedias)
Explain your plan and ask for feedback in the
to make mass edits ( )
Source every statement ( )
Create input in
Check with a few statements, verify result
Run as batch, document input and batch URL
Wikidata's notability criteria
Wikidata project chat
Apply for a bot account example
hints
QuickStatements text format
Matching from WD to the external database entriesMatching from WD to the external database entries
Normally requires an endpoint for the external source, where you can search for
the labels, aliases or other data of Wikidata items
Insert statement for external id into Wikidata can be prepared for cut&paste or
even semi-automatic execution in QuickStatements
Some hints and linked code here
Import catalog data to Mix-n-MatchImport catalog data to Mix-n-Match
Prepare dataPrepare data
... as tab-separated table (one line per record) with three columns
1. identitfier
2. name
3. description
Input file for the example used earlier
Pay attention toPay attention to
description column: include everything useful for intelectual identification
order: the sequence may help structuring your workflow (e.g., most used entries
first)
Load data via web interfaceLoad data via web interface
... at https://tools.wmflabs.org/mix-n-match/import.php
Sync existing ids from WikidataSync existing ids from Wikidata
Quality control tools and proceduresQuality control tools and procedures
Perception: Anybody can edit anything - so Wikidata is no reliable source of
knowledge
Seen as a threat for information systems based on Wikidata
particularly by some large Wikipedias (e.g., the English one)
Basic policy to address this: Statements should be referenced
QA support for editorsQA support for editors
Contraint definition for properties
raise warnings during data input, when, e.g.
a format definition (ISBN, DOI etc.) is violated
a supposedly unique identifier is added to more than one item
generated lists of constraint violations (e.g. )
Constraints can be very helpful, but do not cover complex cases
ZDB ID format
More QA support for editorsMore QA support for editors
Additional reports can be created via SPARQL queries
Shape Expressions (ShEx) allow to define complex constraints and conformance
checks
ShEx Primer
How to get started with Shape Expressions in Wikidata?
Revision control and patrolingRevision control and patroling
Versioned edits and version control
Manual and tool supported vandalism prevention
Watchlists
Automated flagging of suspect edits (e.g., "new editor deleteing statements")
Technically very easy to revert edits
Semi-protection or protection of oftenly-vandalized items
Patroling
Automated tools for vandalism detectionAutomated tools for vandalism detection
Fighting to keep up with rate of human edits in Wikidata (multiple per second)
... requires reducing the manual workload, e.g. via
Objective Revision Evaluation Service ( )
and other rule-based and machine-learning tools
Wikidata Abuse Filter
ORES
Ongoing researchOngoing research
Heindorf et al.: Vandalism Detection in Wikidata
Sarabadani et al.: Building automated vandalism detection tools for
Wikidata
The Wikidata communityThe Wikidata community
Everybody can participate
No central "committee" or decision structure
Desisions are made via discussion and community consensus
Main entry point for all kind of discussions
Resolved discussions archived after 7 days -
Beginner's questions welcome (but please try to find the answer online before,
particularly in the , which has a search link to the help pages)
Compared to Wikipedia, the overall atmosphere is constructive (though
exceptions exist sometimes in some sub-communities)
English is the lingua franca, but a few questions show up in other languages,
and receive comments, too
Project chatProject chat
searchable
FAQ
User page and user talk pageUser page and user talk page
Introduce yourself - especially if you work in a professional context with
Wikidata ( )
Activate notifications to your email address
Be responsive to comments on your talk page
You can address other users on their talk page, too
example
Talk pages of propertiesTalk pages of properties
Questions on the use of a certain property
Suggestions for changes or enhancements of the property definition ( )
Consider adding properties you are interested in to your watchlist
example
WikiProjectsWikiProjects
Often a great source to find documentation about the community consensus in
certain fields
Many WikiProjects pages contain data structuring recommendations - see, e.g.,
for
Current WikiProjects on Wikidata
periodicals
Thank you - questions welcome!Thank you - questions welcome!
Joachim Neubert
j.neubert@zbw.eu
Jneubert on WD

Wikidata as a hub for the linked data cloud

  • 1.
    Wikidata as ahub for the linked data cloudWikidata as a hub for the linked data cloud TUTORIAL AT DCMI CONFERENCE, SEOUL, 2019-09-25TUTORIAL AT DCMI CONFERENCE, SEOUL, 2019-09-25 Tom Baker, Joachim Neubert, Andra Waagmeester Slides (partitially) at https://jneubert.github.io/wd-dcmi2019/#/
  • 2.
    OverviewOverview Part 1: Usingand querying Wikidata Part 2: Wikidata as a linking hub Part 3: Applications based on Wikidata Part 4: Wikidata usage scenarios Scenarios Intro and details Hands-on: Mix-n-match Quality control tools and procedures Wikidata community
  • 3.
    Wikidata as linkinghubWikidata as linking hub
  • 4.
    The idea oflinking hubsThe idea of linking hubs Connect concepts via identifiers/URLs Existing hubs: , , ... Image by Jakob Voss VIAF sameAs.org
  • 5.
    Different linking propertiesDifferentlinking properties 1. (datatype URL) generic link to URL in the meaning of skos:exactMatch 2. : more than 4000 specialized properties (datatype external identifier) exact match Pxxxx
  • 6.
    Examples for externalidentifiersExamples for external identifiers GND / VIAF identifiers geogaphical entities proteins Swedish cultural heritage objects African plants baseball players TED conference speakers
  • 8.
    Property definitionsProperty definitions subjectitem for the property examples constraints on values, cardinality, etc. : creates a clickable link for the ID start at the property page, e.g., for the ISSN: formatter URL https://www.wikidata.org/wiki/Property:P236
  • 9.
  • 11.
    Beyond sameness -mapping relationsBeyond sameness - mapping relations Wikidata external ids imply "sameness" of linked concepts even with geographic names, other mapping relations are required in some cases. examples: close matches, e.g., "Yugoslavia" (1918-1992) (Wikidata) ≅ "Yugoslavia (until 1990)" (STW) related matches, e.g. a company and its founder
  • 12.
    Mapping relation type(P4390)Mapping relation type (P4390) introduced after a community discussion in October 2017 to be used as qualifier for external id entries fixed value set - SKOS mapping relations (exact, close, broad, narrow, related match)
  • 13.
    EXAMPLE AT ITEMEXAMPLEAT ITEM ASSESSMENT CENTERASSESSMENT CENTER
  • 14.
    How does thatrelate to the Linked Data model?How does that relate to the Linked Data model? Internal data model and storage (Wikibase) is transformed to RDF for: RDF dumps Query Service
  • 15.
    RDF linking fromWikidataRDF linking from Wikidata : linked data URI e.g., , (vs. formatter URL ) linked external RDF resources plus ~950.000 relations to individual URIs formatter URI for RDF resource http://sws.geonames.org/$1/ https://www.geonames.org/$1 List of 130+ relationships to external RDF datasets 26+ million exact match
  • 16.
    Links in theRDF dumpsLinks in the RDF dumps Output has full URLs to external resources, however with Wikidata-specific properties: This creates a hurdle for generic Linked Data browsers and tools - not even is translated to skos:exactMatch wd:Q123 wdt:P234 "External-ID" ; wdtn:P234 <http://example.com/reference/External-ID> exact match
  • 17.
    Federated SPARQL queriesFederatedSPARQL queries Example use case: GND authority has information about the professions/occupations of people which is not known in Wikidata. So get that information dynamically from a GND SPARQL endpoint. Here, we are interested in economists, in particular.
  • 18.
    From Wikidata toa remote endpointFrom Wikidata to a remote endpoint From a remote endpoint to WikidataFrom a remote endpoint to Wikidata <== not working currently query to WDQS query to GND endpoint
  • 19.
    Several points forattentionSeveral points for attention Direction and sequence of statements often matters for performance To reach out from Wikidata, endpoints have to be ( ) In the other direction, access is normally not restricted Some federated queries get extremely slow, when large sets of bindings exist before the remote service is invoked be sure to exclude variables bound to blank nodes ('unknown value' in Wikidata) approved full list
  • 20.
    Further reading onWikidata/RDFFurther reading on Wikidata/RDF ( ) Critical comments/suggestions: RDF dump format (documentation) Waagmeester: Integrating Wikidata and other linked data sources - Federated SPARQL queries more examples Malyshev et al.: Semantic Technology Usage in Wikipedia’s Knowledge Graph Freire/Isaac: Technical usability of Wikidatas linked data
  • 21.
    Application process fora new propertyApplication process for a new property Double-check, that the property does not already exist Prepare a property proposal in the according section, e.g., Wikidata:Property proposal/Authority control
  • 23.
    Hints for gettingit approved smoothlyHints for getting it approved smoothly Clearly lay out the motivation and planned use for the property Provide working examples (with the formatter URI you are suggesting) Be responsive to comments
  • 24.
    Wikidata as auniversal linking hubWikidata as a universal linking hub easy extensibility with new properties for external identifiers immense fund of existing items, with the full set of SKOS mapping relations for more or less exact mappings to these immediate extensibility with new items
  • 25.
    Linking content viaMix-n-MatchLinking content via Mix-n-Match
  • 26.
    Mix-n-match is awidely used tool (by Magnus Manske) to link external databases, catalogs, etc. to existing Wikidata items (or to create new ones).
  • 27.
    Example list:Example list: Newspapersand journalsNewspapers and journals from the 20th Century Press Archivefrom the 20th Century Press Archive
  • 32.
    Please navigate toour example catalog Mix-n-match manual
  • 33.
    TasksTasks Login through Widar In"Automatically matched" list: Connect matching items Remove non-matching entries In "Unmatched" list: Search for existing items Create missing items ( )suggested properties
  • 34.
  • 35.
    Item creation "onthe go"Item creation "on the go" With Mix-n-match "New item": rudimentary, no references Custom list of prepared QuickStatements insert blocks ( from STW Thesaurus for Economics - please don't mess with it, this is work in progress) Workflow-wise, use same sequence for M-n-m input and prepared insert blocks example
  • 36.
    Recommendations for itemcreationRecommendations for item creation Pay attention to (much more relaxed then Wikipedias) Explain your plan and ask for feedback in the to make mass edits ( ) Source every statement ( ) Create input in Check with a few statements, verify result Run as batch, document input and batch URL Wikidata's notability criteria Wikidata project chat Apply for a bot account example hints QuickStatements text format
  • 37.
    Matching from WDto the external database entriesMatching from WD to the external database entries
  • 39.
    Normally requires anendpoint for the external source, where you can search for the labels, aliases or other data of Wikidata items Insert statement for external id into Wikidata can be prepared for cut&paste or even semi-automatic execution in QuickStatements Some hints and linked code here
  • 40.
    Import catalog datato Mix-n-MatchImport catalog data to Mix-n-Match
  • 41.
    Prepare dataPrepare data ...as tab-separated table (one line per record) with three columns 1. identitfier 2. name 3. description Input file for the example used earlier
  • 42.
    Pay attention toPayattention to description column: include everything useful for intelectual identification order: the sequence may help structuring your workflow (e.g., most used entries first)
  • 43.
    Load data viaweb interfaceLoad data via web interface ... at https://tools.wmflabs.org/mix-n-match/import.php
  • 48.
    Sync existing idsfrom WikidataSync existing ids from Wikidata
  • 52.
    Quality control toolsand proceduresQuality control tools and procedures Perception: Anybody can edit anything - so Wikidata is no reliable source of knowledge Seen as a threat for information systems based on Wikidata particularly by some large Wikipedias (e.g., the English one) Basic policy to address this: Statements should be referenced
  • 53.
    QA support foreditorsQA support for editors Contraint definition for properties raise warnings during data input, when, e.g. a format definition (ISBN, DOI etc.) is violated a supposedly unique identifier is added to more than one item generated lists of constraint violations (e.g. ) Constraints can be very helpful, but do not cover complex cases ZDB ID format
  • 54.
    More QA supportfor editorsMore QA support for editors Additional reports can be created via SPARQL queries Shape Expressions (ShEx) allow to define complex constraints and conformance checks ShEx Primer How to get started with Shape Expressions in Wikidata?
  • 55.
    Revision control andpatrolingRevision control and patroling Versioned edits and version control Manual and tool supported vandalism prevention Watchlists Automated flagging of suspect edits (e.g., "new editor deleteing statements") Technically very easy to revert edits Semi-protection or protection of oftenly-vandalized items Patroling
  • 57.
    Automated tools forvandalism detectionAutomated tools for vandalism detection Fighting to keep up with rate of human edits in Wikidata (multiple per second) ... requires reducing the manual workload, e.g. via Objective Revision Evaluation Service ( ) and other rule-based and machine-learning tools Wikidata Abuse Filter ORES
  • 58.
    Ongoing researchOngoing research Heindorfet al.: Vandalism Detection in Wikidata Sarabadani et al.: Building automated vandalism detection tools for Wikidata
  • 59.
    The Wikidata communityTheWikidata community Everybody can participate No central "committee" or decision structure Desisions are made via discussion and community consensus
  • 60.
    Main entry pointfor all kind of discussions Resolved discussions archived after 7 days - Beginner's questions welcome (but please try to find the answer online before, particularly in the , which has a search link to the help pages) Compared to Wikipedia, the overall atmosphere is constructive (though exceptions exist sometimes in some sub-communities) English is the lingua franca, but a few questions show up in other languages, and receive comments, too Project chatProject chat searchable FAQ
  • 61.
    User page anduser talk pageUser page and user talk page Introduce yourself - especially if you work in a professional context with Wikidata ( ) Activate notifications to your email address Be responsive to comments on your talk page You can address other users on their talk page, too example
  • 62.
    Talk pages ofpropertiesTalk pages of properties Questions on the use of a certain property Suggestions for changes or enhancements of the property definition ( ) Consider adding properties you are interested in to your watchlist example
  • 63.
  • 65.
    Often a greatsource to find documentation about the community consensus in certain fields Many WikiProjects pages contain data structuring recommendations - see, e.g., for Current WikiProjects on Wikidata periodicals
  • 66.
    Thank you -questions welcome!Thank you - questions welcome! Joachim Neubert j.neubert@zbw.eu Jneubert on WD