Metadata Quality Evaluation: Experience from the Open Language Archives Community Baden Hughes Department of Computer Scie...
Presentation Overview <ul><li>Introduction </li></ul><ul><li>OLAC Community Background </li></ul><ul><li>Motivation </li><...
Introduction <ul><li>It is unfortunate that distributed metadata creation practices result in highly variable metadata qua...
Open Language Archives Community (OLAC) <ul><li>An open consortium of 29 linguistic data archives cataloguing 27K language...
Motivation <ul><li>To establish infrastructural support for ongoing metadata quality evaluation </li></ul><ul><li>Validati...
Algorithm Design #1 <ul><li>Basic objective is to generate a score for each metadata record based on Dublin Core and OLAC ...
Algorithm Design #2 <ul><li>Per Metadata Record Weighted Aggregate: an arbitrary maximum multiplied by the weighted produc...
Implementation <ul><li>Live service at  http://www.language-archives.org/tools/reports/archiveReportCard.php </li></ul><ul...
Demo <ul><li>Metadata quality report on all OLAC Data Providers  </li></ul><ul><ul><li>[ Live ] </li></ul></ul><ul><ul><li...
Evaluation #1 <ul><li>Creating a data provider ranking system was not a primary goal of the work reported here </li></ul><...
Evaluation #2 <ul><li>Qualitatively-based archive clustering  </li></ul><ul><ul><li>3 distinct groups of archives based on...
Future Directions <ul><li>Algorithm improvements – particularly weighting in proportion to size of data provider </li></ul...
Conclusions <ul><li>Reported the design and deployment of scalable, dynamic metadata quality evaluation infrastructure </l...
Acknowledgements <ul><li>National Science Foundation Grants #9910603 (International Standards in Language Engineering) and...
Upcoming SlideShare
Loading in …5
×

Metadata Quality Evaluation: Experience from the Open Language Archives Community

1,154 views

Published on

Paper at ICADL2004 (December 2004, Shanghai)

Published in: Economy & Finance, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,154
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Metadata Quality Evaluation: Experience from the Open Language Archives Community

  1. 1. Metadata Quality Evaluation: Experience from the Open Language Archives Community Baden Hughes Department of Computer Science and Software Engineering University of Melbourne [email_address]
  2. 2. Presentation Overview <ul><li>Introduction </li></ul><ul><li>OLAC Community Background </li></ul><ul><li>Motivation </li></ul><ul><li>Algorithm Design </li></ul><ul><li>Implementation </li></ul><ul><li>Demo </li></ul><ul><li>Evaluation </li></ul><ul><li>Future Directions </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Introduction <ul><li>It is unfortunate that distributed metadata creation practices result in highly variable metadata quality </li></ul><ul><li>A lack of extant metadata quality evaluation tools and methodologies means addressing this variation is difficult </li></ul><ul><li>Our contribution is a suite of metadata quality evaluation tools within a specific OAI sub-domain, the Open Language Archives Community </li></ul><ul><li>Significant feature of these tools is the ability to assess metadata quality against actual community practice and external best practice standards </li></ul>
  4. 4. Open Language Archives Community (OLAC) <ul><li>An open consortium of 29 linguistic data archives cataloguing 27K language-related objects </li></ul><ul><li>OLAC metadata is a Dublin Core application profile for domain-specific areas: language, linguistic type, subject language, linguistic subject, linguistic role </li></ul><ul><li>Based on OAI architecture, 2 tiered data provider and service provider model linked by OAI Protocol for Metadata Harvesting (OAI-PMH) </li></ul><ul><li>A number of OLAC innovations have motivated the development of OAI services: static repositories, virtual service providers, personal metadata creation and management tools </li></ul>
  5. 5. Motivation <ul><li>To establish infrastructural support for ongoing metadata quality evaluation </li></ul><ul><li>Validation tools for higher layer interoperability such as OAI work well for conformance checking </li></ul><ul><li>At a community level, we are generally lacking tools which provide qualitative analyses in both semantic and syntactic modes </li></ul><ul><li>Differentiating factors of our work: establishing a common baseline; assisting individual data providers directly; assessing use of OLAC controlled vocabularies (CVs) </li></ul>
  6. 6. Algorithm Design #1 <ul><li>Basic objective is to generate a score for each metadata record based on Dublin Core and OLAC best practice recommendations </li></ul><ul><li>Code Existence Score: number of elements containing code attributes divided by the number of elements of a type associated with a controlled vocabulary in the record </li></ul><ul><li>Element Absence Penalty: number of core elements absent divided by total number of core elements in the record </li></ul>
  7. 7. Algorithm Design #2 <ul><li>Per Metadata Record Weighted Aggregate: an arbitrary maximum multiplied by the weighted product of Code Existence Score and Element Absence Penalty </li></ul><ul><li>Derivative metrics: archive diversity, metadata quality score, core elements per record, core element usage, code usage, code and element usage, “star rating” </li></ul><ul><li>Using these metrics, we compute a score for each metadata record in an archive; each archive in total; and for the whole community </li></ul>
  8. 8. Implementation <ul><li>Live service at http://www.language-archives.org/tools/reports/archiveReportCard.php </li></ul><ul><li>Metadata quality evaluation suite installed in the service layer on top of OLAC Harvestor and Aggregator </li></ul><ul><li>Based on Apache, MySQL, and PHP – runs on Windows/Mac/Linux </li></ul><ul><li>All codebase components are open source, licensed under GPL, and available from SourceForge http://sf.net/projects/olac </li></ul>
  9. 9. Demo <ul><li>Metadata quality report on all OLAC Data Providers </li></ul><ul><ul><li>[ Live ] </li></ul></ul><ul><ul><li>[ Local ] </li></ul></ul><ul><li>Metadata quality report for a single OLAC data provider (PARADISEC) </li></ul><ul><ul><li>[ Live ] </li></ul></ul><ul><ul><li>[ Local ] </li></ul></ul>
  10. 10. Evaluation #1 <ul><li>Creating a data provider ranking system was not a primary goal of the work reported here </li></ul><ul><li>Per data provider </li></ul><ul><ul><li>Apparently no systematic correlation between size of archive and overall metadata quality </li></ul></ul><ul><ul><li>A positive correlation between size of archive and the average number of elements per metadata record </li></ul></ul><ul><li>Community-wide </li></ul><ul><ul><li>Additional evidence supporting earlier work as to most common metadata elements </li></ul></ul><ul><ul><li>4 distinct classes: subject; title, description, date, identifier, creator; format, type, contributor, publisher, isPartOf; all others (including OLAC CVs) </li></ul></ul>
  11. 11. Evaluation #2 <ul><li>Qualitatively-based archive clustering </li></ul><ul><ul><li>3 distinct groups of archives based on Per Metadata Record Weighted Aggregate </li></ul></ul><ul><ul><li>Characterised by metadata creation technique, size, number of elements used, application of OLAC controlled vocabularies </li></ul></ul><ul><li>Use of OLAC CVs </li></ul><ul><ul><li>Subject: OLAC CV used 56% of the time, for language identification where the DC recommendation of ISO 639-2 is too coarse </li></ul></ul><ul><ul><li>Contributor: OLAC CV used 78% of the time, for distinct roles in the linguistic data creation/curation process </li></ul></ul><ul><ul><li>Type: OLAC CV used 33% of the time, suprising given domain requirement for differentiating linguistic data types </li></ul></ul>
  12. 12. Future Directions <ul><li>Algorithm improvements – particularly weighting in proportion to size of data provider </li></ul><ul><li>A longitudinal study of metadata evolution, including qualitative aspects (commenced, and retrofitted to Jan 2002) </li></ul><ul><li>New services based on quality attributes – the OLAC Search Engine uses metadata quality as a ranking scheme for result sets </li></ul><ul><li>New metrics which reflect other values of the OLAC community eg online data, use of CVs </li></ul>
  13. 13. Conclusions <ul><li>Reported the design and deployment of scalable, dynamic metadata quality evaluation infrastructure </li></ul><ul><li>A distinct contribution in the absence of comparable services and models, our code is open to the community to experiment with </li></ul><ul><li>Allowing more accurate identification of leverage points for metadata enrichment effort </li></ul><ul><li>Promoting better practice in metadata development and management </li></ul><ul><li>Ultimately enabling better search and retrieval experiences for end users </li></ul>
  14. 14. Acknowledgements <ul><li>National Science Foundation Grants #9910603 (International Standards in Language Engineering) and #0094934 (Querying Linguistic Databases) </li></ul><ul><li>Amol Kamat, Steven Bird and Gary Simons </li></ul><ul><li>ICADL Program Committee and Reviewers </li></ul>

×