Metadata Quality Evaluation: Experience from the Open Language Archives Community Baden Hughes Department of Computer Science and Software Engineering University of Melbourne [email_address]
Presentation Overview Introduction OLAC Community Background Motivation Algorithm Design Implementation Demo Evaluation Future Directions Conclusion
Introduction It is unfortunate that distributed metadata creation practices result in highly variable metadata quality A lack of extant metadata quality evaluation tools and methodologies means addressing this variation is difficult Our contribution is a suite of metadata quality evaluation tools within a specific OAI sub-domain, the Open Language Archives Community Significant feature of these tools is the ability to assess metadata quality against actual community practice and external best practice standards
Open Language Archives Community (OLAC) An open consortium of 29 linguistic data archives cataloguing 27K language-related objects  OLAC metadata is a Dublin Core application profile for domain-specific areas: language, linguistic type, subject language, linguistic subject, linguistic role Based on OAI architecture, 2 tiered data provider and service provider model linked by OAI Protocol for Metadata Harvesting (OAI-PMH) A number of OLAC innovations have motivated the development of OAI services: static repositories, virtual service providers, personal metadata creation and management tools
Motivation To establish infrastructural support for ongoing metadata quality evaluation Validation tools for higher layer interoperability such as OAI work well for conformance checking At a community level, we are generally lacking tools which provide qualitative analyses in both semantic and syntactic modes Differentiating factors of our work: establishing a common baseline; assisting individual data providers directly; assessing use of OLAC controlled vocabularies (CVs)
Algorithm Design #1 Basic objective is to generate a score for each metadata record based on Dublin Core and OLAC best practice recommendations Code Existence Score: number of elements containing code attributes divided by the number of elements of a type associated with a controlled vocabulary in the record Element Absence Penalty: number of core elements absent divided by total number of core elements in the record
Algorithm Design #2 Per Metadata Record Weighted Aggregate: an arbitrary maximum multiplied by the weighted product of Code Existence Score and Element Absence Penalty Derivative metrics: archive diversity, metadata quality score, core elements per record, core element usage, code usage, code and element usage, “star rating” Using these metrics, we compute a score for each metadata record in an archive; each archive in total; and for the whole community
Implementation Live service at  http://www.language-archives.org/tools/reports/archiveReportCard.php Metadata quality evaluation suite installed in the service layer on top of OLAC Harvestor and Aggregator Based on Apache, MySQL, and PHP – runs on Windows/Mac/Linux All codebase components are open source, licensed under GPL, and available from SourceForge http://sf.net/projects/olac
Demo Metadata quality report on all OLAC Data Providers  [ Live ] [ Local ] Metadata quality report for a single OLAC data provider (PARADISEC)  [ Live ] [ Local ]
Evaluation #1 Creating a data provider ranking system was not a primary goal of the work reported here Per data provider Apparently no systematic correlation between size of archive and overall metadata quality A positive correlation between size of archive and the average number of elements per metadata record Community-wide Additional evidence supporting earlier work as to most common metadata elements 4 distinct classes: subject; title, description, date, identifier, creator; format, type, contributor, publisher, isPartOf; all others (including OLAC CVs)
Evaluation #2 Qualitatively-based archive clustering  3 distinct groups of archives based on Per Metadata Record Weighted Aggregate Characterised by metadata creation technique, size, number of elements used, application of OLAC controlled vocabularies Use of OLAC CVs Subject: OLAC CV used 56% of the time, for language identification where the DC recommendation of ISO 639-2 is too coarse Contributor: OLAC CV used 78% of the time, for distinct roles in the linguistic data creation/curation process Type: OLAC CV used 33% of the time, suprising given domain requirement for differentiating linguistic data types
Future Directions Algorithm improvements – particularly weighting in proportion to size of data provider A longitudinal study of metadata evolution, including qualitative aspects (commenced, and retrofitted to Jan 2002) New services based on quality attributes – the OLAC Search Engine uses metadata quality as a ranking scheme for result sets New metrics which reflect other values of the OLAC community eg online data, use of CVs
Conclusions Reported the design and deployment of scalable, dynamic metadata quality evaluation infrastructure A distinct contribution in the absence of comparable services and models, our code is open to the community to experiment with Allowing more accurate identification of leverage points for metadata enrichment effort Promoting better practice in metadata development and management Ultimately enabling better search and retrieval experiences for end users
Acknowledgements National Science Foundation Grants #9910603 (International Standards in Language Engineering) and #0094934  (Querying Linguistic Databases) Amol Kamat, Steven Bird and Gary Simons ICADL Program Committee and Reviewers

Metadata Quality Evaluation: Experience from the Open Language Archives Community

  • 1.
    Metadata Quality Evaluation:Experience from the Open Language Archives Community Baden Hughes Department of Computer Science and Software Engineering University of Melbourne [email_address]
  • 2.
    Presentation Overview IntroductionOLAC Community Background Motivation Algorithm Design Implementation Demo Evaluation Future Directions Conclusion
  • 3.
    Introduction It isunfortunate that distributed metadata creation practices result in highly variable metadata quality A lack of extant metadata quality evaluation tools and methodologies means addressing this variation is difficult Our contribution is a suite of metadata quality evaluation tools within a specific OAI sub-domain, the Open Language Archives Community Significant feature of these tools is the ability to assess metadata quality against actual community practice and external best practice standards
  • 4.
    Open Language ArchivesCommunity (OLAC) An open consortium of 29 linguistic data archives cataloguing 27K language-related objects OLAC metadata is a Dublin Core application profile for domain-specific areas: language, linguistic type, subject language, linguistic subject, linguistic role Based on OAI architecture, 2 tiered data provider and service provider model linked by OAI Protocol for Metadata Harvesting (OAI-PMH) A number of OLAC innovations have motivated the development of OAI services: static repositories, virtual service providers, personal metadata creation and management tools
  • 5.
    Motivation To establishinfrastructural support for ongoing metadata quality evaluation Validation tools for higher layer interoperability such as OAI work well for conformance checking At a community level, we are generally lacking tools which provide qualitative analyses in both semantic and syntactic modes Differentiating factors of our work: establishing a common baseline; assisting individual data providers directly; assessing use of OLAC controlled vocabularies (CVs)
  • 6.
    Algorithm Design #1Basic objective is to generate a score for each metadata record based on Dublin Core and OLAC best practice recommendations Code Existence Score: number of elements containing code attributes divided by the number of elements of a type associated with a controlled vocabulary in the record Element Absence Penalty: number of core elements absent divided by total number of core elements in the record
  • 7.
    Algorithm Design #2Per Metadata Record Weighted Aggregate: an arbitrary maximum multiplied by the weighted product of Code Existence Score and Element Absence Penalty Derivative metrics: archive diversity, metadata quality score, core elements per record, core element usage, code usage, code and element usage, “star rating” Using these metrics, we compute a score for each metadata record in an archive; each archive in total; and for the whole community
  • 8.
    Implementation Live serviceat http://www.language-archives.org/tools/reports/archiveReportCard.php Metadata quality evaluation suite installed in the service layer on top of OLAC Harvestor and Aggregator Based on Apache, MySQL, and PHP – runs on Windows/Mac/Linux All codebase components are open source, licensed under GPL, and available from SourceForge http://sf.net/projects/olac
  • 9.
    Demo Metadata qualityreport on all OLAC Data Providers [ Live ] [ Local ] Metadata quality report for a single OLAC data provider (PARADISEC) [ Live ] [ Local ]
  • 10.
    Evaluation #1 Creatinga data provider ranking system was not a primary goal of the work reported here Per data provider Apparently no systematic correlation between size of archive and overall metadata quality A positive correlation between size of archive and the average number of elements per metadata record Community-wide Additional evidence supporting earlier work as to most common metadata elements 4 distinct classes: subject; title, description, date, identifier, creator; format, type, contributor, publisher, isPartOf; all others (including OLAC CVs)
  • 11.
    Evaluation #2 Qualitatively-basedarchive clustering 3 distinct groups of archives based on Per Metadata Record Weighted Aggregate Characterised by metadata creation technique, size, number of elements used, application of OLAC controlled vocabularies Use of OLAC CVs Subject: OLAC CV used 56% of the time, for language identification where the DC recommendation of ISO 639-2 is too coarse Contributor: OLAC CV used 78% of the time, for distinct roles in the linguistic data creation/curation process Type: OLAC CV used 33% of the time, suprising given domain requirement for differentiating linguistic data types
  • 12.
    Future Directions Algorithmimprovements – particularly weighting in proportion to size of data provider A longitudinal study of metadata evolution, including qualitative aspects (commenced, and retrofitted to Jan 2002) New services based on quality attributes – the OLAC Search Engine uses metadata quality as a ranking scheme for result sets New metrics which reflect other values of the OLAC community eg online data, use of CVs
  • 13.
    Conclusions Reported thedesign and deployment of scalable, dynamic metadata quality evaluation infrastructure A distinct contribution in the absence of comparable services and models, our code is open to the community to experiment with Allowing more accurate identification of leverage points for metadata enrichment effort Promoting better practice in metadata development and management Ultimately enabling better search and retrieval experiences for end users
  • 14.
    Acknowledgements National ScienceFoundation Grants #9910603 (International Standards in Language Engineering) and #0094934 (Querying Linguistic Databases) Amol Kamat, Steven Bird and Gary Simons ICADL Program Committee and Reviewers