Metadata Quality Evaluation: Experience from the Open Language Archives Community
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Metadata Quality Evaluation: Experience from the Open Language Archives Community



Paper at ICADL2004 (December 2004, Shanghai)

Paper at ICADL2004 (December 2004, Shanghai)



Total Views
Views on SlideShare
Embed Views



2 Embeds 3 2 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Metadata Quality Evaluation: Experience from the Open Language Archives Community Presentation Transcript

  • 1. Metadata Quality Evaluation: Experience from the Open Language Archives Community Baden Hughes Department of Computer Science and Software Engineering University of Melbourne [email_address]
  • 2. Presentation Overview
    • Introduction
    • OLAC Community Background
    • Motivation
    • Algorithm Design
    • Implementation
    • Demo
    • Evaluation
    • Future Directions
    • Conclusion
  • 3. Introduction
    • It is unfortunate that distributed metadata creation practices result in highly variable metadata quality
    • A lack of extant metadata quality evaluation tools and methodologies means addressing this variation is difficult
    • Our contribution is a suite of metadata quality evaluation tools within a specific OAI sub-domain, the Open Language Archives Community
    • Significant feature of these tools is the ability to assess metadata quality against actual community practice and external best practice standards
  • 4. Open Language Archives Community (OLAC)
    • An open consortium of 29 linguistic data archives cataloguing 27K language-related objects
    • OLAC metadata is a Dublin Core application profile for domain-specific areas: language, linguistic type, subject language, linguistic subject, linguistic role
    • Based on OAI architecture, 2 tiered data provider and service provider model linked by OAI Protocol for Metadata Harvesting (OAI-PMH)
    • A number of OLAC innovations have motivated the development of OAI services: static repositories, virtual service providers, personal metadata creation and management tools
  • 5. Motivation
    • To establish infrastructural support for ongoing metadata quality evaluation
    • Validation tools for higher layer interoperability such as OAI work well for conformance checking
    • At a community level, we are generally lacking tools which provide qualitative analyses in both semantic and syntactic modes
    • Differentiating factors of our work: establishing a common baseline; assisting individual data providers directly; assessing use of OLAC controlled vocabularies (CVs)
  • 6. Algorithm Design #1
    • Basic objective is to generate a score for each metadata record based on Dublin Core and OLAC best practice recommendations
    • Code Existence Score: number of elements containing code attributes divided by the number of elements of a type associated with a controlled vocabulary in the record
    • Element Absence Penalty: number of core elements absent divided by total number of core elements in the record
  • 7. Algorithm Design #2
    • Per Metadata Record Weighted Aggregate: an arbitrary maximum multiplied by the weighted product of Code Existence Score and Element Absence Penalty
    • Derivative metrics: archive diversity, metadata quality score, core elements per record, core element usage, code usage, code and element usage, “star rating”
    • Using these metrics, we compute a score for each metadata record in an archive; each archive in total; and for the whole community
  • 8. Implementation
    • Live service at
    • Metadata quality evaluation suite installed in the service layer on top of OLAC Harvestor and Aggregator
    • Based on Apache, MySQL, and PHP – runs on Windows/Mac/Linux
    • All codebase components are open source, licensed under GPL, and available from SourceForge
  • 9. Demo
    • Metadata quality report on all OLAC Data Providers
      • [ Live ]
      • [ Local ]
    • Metadata quality report for a single OLAC data provider (PARADISEC)
      • [ Live ]
      • [ Local ]
  • 10. Evaluation #1
    • Creating a data provider ranking system was not a primary goal of the work reported here
    • Per data provider
      • Apparently no systematic correlation between size of archive and overall metadata quality
      • A positive correlation between size of archive and the average number of elements per metadata record
    • Community-wide
      • Additional evidence supporting earlier work as to most common metadata elements
      • 4 distinct classes: subject; title, description, date, identifier, creator; format, type, contributor, publisher, isPartOf; all others (including OLAC CVs)
  • 11. Evaluation #2
    • Qualitatively-based archive clustering
      • 3 distinct groups of archives based on Per Metadata Record Weighted Aggregate
      • Characterised by metadata creation technique, size, number of elements used, application of OLAC controlled vocabularies
    • Use of OLAC CVs
      • Subject: OLAC CV used 56% of the time, for language identification where the DC recommendation of ISO 639-2 is too coarse
      • Contributor: OLAC CV used 78% of the time, for distinct roles in the linguistic data creation/curation process
      • Type: OLAC CV used 33% of the time, suprising given domain requirement for differentiating linguistic data types
  • 12. Future Directions
    • Algorithm improvements – particularly weighting in proportion to size of data provider
    • A longitudinal study of metadata evolution, including qualitative aspects (commenced, and retrofitted to Jan 2002)
    • New services based on quality attributes – the OLAC Search Engine uses metadata quality as a ranking scheme for result sets
    • New metrics which reflect other values of the OLAC community eg online data, use of CVs
  • 13. Conclusions
    • Reported the design and deployment of scalable, dynamic metadata quality evaluation infrastructure
    • A distinct contribution in the absence of comparable services and models, our code is open to the community to experiment with
    • Allowing more accurate identification of leverage points for metadata enrichment effort
    • Promoting better practice in metadata development and management
    • Ultimately enabling better search and retrieval experiences for end users
  • 14. Acknowledgements
    • National Science Foundation Grants #9910603 (International Standards in Language Engineering) and #0094934 (Querying Linguistic Databases)
    • Amol Kamat, Steven Bird and Gary Simons
    • ICADL Program Committee and Reviewers