Best query ever -> bad metadata = match Mediocre query -> bad metadata = match Horrible query -> bad metadata = match Best query ever -> good metadata = match ✓+ Mediocre query -> good metadata = match (probably) ✓ Horrible query -> good metadata = match (maybe) ✓- Metadata Quality Audit: Overview Accurate and complete metadata is vital to querying and citation linking. If the metadata for a DOI is incorrect, incomplete, or messy, a match can't be made, regardless of the quality of a query.
Current efforts include:
Resolution report (emailed monthly) depositor report (on website) crawler (on website) field report (on website) conflict report (on website, emailed monthly) schematron reports (emailed weekly) failed query report (on website) DOI error reports (emailed daily)
Contact members individually (as issues arise)
Documentation and communication
Metadata Quality Audit A Metadata Quality Audit will:
provide publishers with detailed feedback on the quality of their metadata by identifying problem areas
identify members who need attention
provide motivation and support to members with metadata issues
The intent of the audit is to provide information, but there may be consequences for extreme abusers.
Audit Scope DOI resolution Conflicts Overall metadata quality Metadata maintenance Hooray! Great, lets get started! Hello, I’d like to audit you
Level I:DOIs that have been distributed but not deposited and resolve to the Handle error page. *
Level II: DOIs resolving to an error page * Level III:DOIs with response page blocked by access control
Level IV:DOIs that resolve to an inadequate response page. I. DOI Resolution * actionable transgressions
II. Conflicts Conflicts occur when two (or more) DOIs are deposited with identical metadata. Level I: conflicts created between members * Level II: conflicts within a publisher prefix(es) * Level III: conflicts created due to insufficient metadata + Level IV: conflicts created due to item/content type + * actionable transgressions + this may change, more later
Quality of deposited metadata Missing metadata: is all available metadata deposited? II. Accuracy: is metadata correct? III. Unusual metadata: does metadata fit into the correct content type? IV. Overall quality: is metadata messy?
Maintenance Gaps in coverage- this usually indicates undepositedDOIs (very very bad) II. Currency of deposits - are deposits made ahead of DOIs being distributed? III. Title maintenance - less of a problem with recent title restrictions, but we still have problems, title abbreviations IV. Reference linking compliance
Actionable Areas DOI Resolution: Level I (UndepositedDOIs) Level II (DOIs resolving to error page)
If action is not taken within a reasonable time period (TBD), DOIs will be registered on behalf of the member (eventually for a fee)
Continual distribution of unregistered DOIs may affect membership
Conflicts: Level I conflict created between members Level IIconflicts within a publisher prefix
A $2 per DOI conflict penalty fee may be imposed for conflicts of this type if they are not resolved within a reasonable time period (TBD).
Metadata Maintenance: Outbound linking compliance
members found to not be linking during the audit will be subject to non-linking penalties
II. DOI Registration Pilot DOIs should without exception be registered before they are released to the public. Most DOIs resolve, but the ones that don’t are a big problem. Solution: we’re going to register them* *(ideal solution: publisher registers them)
DOI selection: At the moment, we will register DOIs reported by end users, using the DOI error report as a source.
DOI error report:
~4,000 DOI errors reported monthly
> 1,400 fixed monthly through publisher deposits
Some of the unfixed DOIs are not ‘real’ DOIs, but many are.
We will register DOIs that meet the following criteria:
Have been distributed publicly by the publisher/prefix owner
Have an identifiable response page
Have been reported to the publisher’s technical and business contacts
DOI Registration Process DOI reported: a user reports an unresolving DOI using the DOI error form Technical contact notified (DOI error report email) CrossRef review: CR staff reviews reported DOIs and expiresDOIs that do not meet our registration criteria Business contact notified: 2 weeks from the initial report, business contact is notified of remaining valid unregisteredDOIs. CR deposit: after 2 weeks have passed from business contact notification, CrossRef will register any undepositedDOIs.
Conflicts overhaul Conflicts occur when two (or more) DOIs share the same metadata, suggesting two DOIs are assigned to a single item.
Why are conflicts bad?
Only one DOI should be assigned per item
Queries will return multiple DOIs, causing confusion
Some queries (OpenURL) may not return a DOI if multiple results are present
Conflicts between two DOIs often result in one of the DOIs being neglected***
We currently have ~200,000+ conflicts in our system. Not all of them are a problem:
For some items, our schema only allows minimal metadata
Some content types require matching metadata (standards and book chapters with minimal metadata (dictionaries) for example)
Legitimate conflicts Conflict between 2 prefixes: http://dx.doi.org/10.1639/0044-7447(2001)030[0037:IOPOFU]2.0.CO;2http://dx.doi.org/10.1579/0044-7447-30.1.37 Sample query Conflict within 1 prefix: http://dx.doi.org/10.3724/SP.J.1006.2008.00070http://dx.doi.org/10.3724/SP.J.1006.2008.00770
‘Bad’ conflicts Conflicts with minimal metadata: 10.1002/ijc.1109510.1002/ijc.11093 Conflict due to content type: 10.1520/C0506-1010.1520/C0506-10A10.1520/C0506-10B
Elements considered during conflict generation:
Journal, book and/or series title
Article title /content_item title (book chapters)
If there is a match between all deposited elements, a conflict is generated. 2 Items with matching journal title, volume, issue, and article title will cause a conflict.
Ideas? What should our minimum set of metadata be? How should conflicts be monitored/reported?
Managing your metadata quality
Sample #1: incorrect metadata Q: My link resolver is retrieving the wrong metadata for DOI 10.1002/rra.1288, causing our links to break - here is my query*: http://firstname.lastname@example.org&aulast=Null&title=River Research and Applications&volume=26&issue=6&page=663&year=2010 *query metadata matches the response page metadata A: Two problems with deposited metadata (DOI query): #1 <year media_type="print">2009</year>#2 <pages> <first_page>n/a</first_page> <last_page>n/a</last_page> </pages>
Sample #2: messy metadata Q: I know DOI 10.1068/p6742 exists, why doesn’t my query work? A: Let’s check the guest query form Metadata for article: Newport R, Preston C, 2010, "Pulling the finger off disrupts agency, embodiment and peripersonal space" Perception39(9) 1296 – 1298 Problem is: author surname is deposited as: <person_name sequence="first" contributor_role="author"> <given_name>Roger</given_name></given_name> <surname><surname>Newport</surname></surname> </person_name>
Sample #3: duplicate authors Q: Why does DOI 10.2307/1382491 have multiple versions of the same author? A: attempt to improve query matching <contributors> <person_name sequence="first" contributor_role="author"> <given_name>Erling Johan</given_name> <surname>Solberg</surname> </person_name> <person_name sequence="additional" contributor_role="author"> <given_name>Bernt-Erik</given_name> <surname>Sæther</surname> </person_name> <person_name sequence="additional" contributor_role="author"> <given_name>Bernt-Erik</given_name> <surname>Saether</surname> </person_name></contributors>
New(ish) tools for managing metadata and deposit problems Schema documentation: http://www.crossref.org/schema/documentation/ or linked from help doc Reporting problems / asking for help:
Help documentation (http://www.crossref.org/help/)
Support portal and forums (http://support.crossref.org)
Schematron update Schematron reports notify depositors of non-fatal deposit issues
35-40 emails sent out weekly
Alerts are generated for < 1% of deposits
Tend to identify ‘messy’ deposits
Rules updated periodically
Schematron Warnings Jr. in surname: AraújoJr Prata Jr. Szezech Jr. Punctuation in surname: (Earven) Tribble Frederick (Frikkie) J. Arch Marin email@example.com Plauchu******** Other rules:
‘ed’ ‘iss’ ‘vol’ in edition, issue, volume elements