Easter JISC metadata May25 DT

EASTER

Evaluating Automated Subject Tools
for Enhancing Retrieval

Douglas Tudhope
Hypermedia Research Unit
University of Glamorgan

JISC Automatic Metadata Generation Meeting, London, May 25, 2010

Background

• EASTER is an 18-month JISC project funded under the Information
Environment Programme 2009-11.

• Started April 2009 and involves eight institutional partners

• Aim is to test and evaluate a range of current tools for automated
subject metadata generation

• Anticipated outcomes:
– better understanding of limitations and what possible
– recommendations for services employing subject metadata in JISC community

Rationale – problems, issues, relevance

• EASTER investigates the creation and enrichment of subject
metadata using existing automated tools.

• Subject metadata are the most important in resource discovery, yet
most expensive to produce manually. In addition, they are more
difficult to generate automatically compared to formal metadata
such as file type, title, etc. Wide uses in retrieval and NLP tools.

• Due to the high cost of evaluation, automated subject metadata
tools are rarely tested in live environments of use.

• Challenge facing digital collections, institutional repositories, and
aggregators of how to provide high quality subject metadata at
reasonable costs.

Intute testbed

• Test-bed is Intute http://www.intute.ac.uk
- a collection of websites (mostly)
However results intended to be generally applicable

• Tools for automated subject metadata generation
will be tested in two contexts:
Intute cataloguers in the cataloguing workflow;
end-users of Intute who search for information

• Task-based end-user retrieval study will examine contribution of
automatically assigned terms and manually assigned terms

Methodology

• A methodology for evaluating such tools is intended as a significant
project outcome/contribution

• Low reliability rates between cataloguers and different times of
indexing is a recognised problem

• EASTER methodology includes creating an enhanced ‘gold
standard’ test collection by careful manual cataloguing and expert
review by cataloguers and users. Provision for consideration of
automatic indexing output within enhanced gold standard in
methodology.

Candidate Tools

Initial candidate tools (a subset will be selected after review)

1) Temis Categorizer (French SME – inhouse)
2) KEA -- new version Maui (Waikato)
3) TextGarden
4) TerMine (NACTEM)
5) KnowLib’s automated classifier (Lund)
6) Scorpion (OCLC)
7) iVia project’s libiViaClassification (UC Riverside)

Candidate Tools

Initial candidate tools (a subset will be selected after review)

1) Temis Categorizer (machine learning, classification)
2) KEA (http://www.nzdl.org/Kea/) -- new version Maui (indexing)
3) TextGarden (http://kt.ijs.si/Dunja/textgarden/)
4) TerMine (http://www.nactem.ac.uk/software/termine/) (noun phrase)
5) KnowLib’s automated classifier (classification)
(http://www.it.lth.se/knowlib/auto.htm)
6) Scorpion
(http://www.oclc.org/research/software/scorpion/default.htm)
7) iVia project’s libiViaClassification
(http://ivia.ucr.edu/manuals/stable/libiViaClassification/5.4.0/)

Progress

• Distinguish 3 subject domains associated with different thesauri
• VETINERARY - CAB Thesaurus
• VISUAL ARTS - AAT
• POLITICS - HASSET, (IBSS?)

• KEA/Maui thesauri and training set
• AutoClass thesauri – need to consider main classes to classify
• TERMINE none
• TEMIS thesauri and training set depending on mode
(IPR of thesauri for commercial use an issue)

• Conversion of thesauri to SKOS format underway
• Web crawler for EASTER purposes implemented

Lessons learned
Preliminary stages – provisional general observations

• Subject metadata generation tools typically complex layered
software. Require maintenance to stay current. Installation may not
be trivial. Resource implications.

• General subject metadata generation tools often require tuning and
adaptation for different contexts and subject domains?
Resource implications.

• Subject metadata generation for what purpose? Classification,
indexing, annotation associated with different use cases.
Eg browsing and search require different metadata for best results.
An individual tool may not deliver all use cases.

• Possibilities for pipelining different approaches (tools) in sequence

STAR/STELLAR Projects also relevant
Information Extraction from archaeology grey literature (AHRC)

 ‘Rich’, semantic indexing of Archaeology fieldwork reports (ADS
OASIS Grey Literature) with respect to the English Heritage
extension of the CRM Conceptual Reference Model (Ontology),
making use of EH thesauri/glossaries and the GATE NLP tool.

 Transforms GATE XML annotations to RDF triples conformant to
conceptual model, allowing cross search with datasets.

 In progress
Web service interface planned to NLP semantic indexing

 STAR terminology services (based on SKOS vocabularies)
JavaScript widgets browser neutral

STAR/STELLAR Projects also relevant
Information Extraction from archaeology grey literature (AHRC)

 Archaeology domain specific but investigating generalisation to
cultural heritage more generally
eg classical art history domain (with OUCS)

 STELLAR (AHRC) investigates generalising data mapping tool
and producing linked data (with ADS)
http://hypermedia.research.glam.ac.uk/kos/star/
http://hypermedia.research.glam.ac.uk/kos/stellar

Grey Literature Information Extraction
(Andreas Vlachidis)
• Looking to extract
CRM-EH period,
context, find,
sample entities
• Aim to cross
search with
archaeology
datasets

CRM-EH Entities and Events (Example)

Contact

EASTER project website

http://www.ukoln.ac.uk/projects/easter/

Project publications
http://www.ukoln.ac.uk/projects/easter/dissemination/

dstudhope@glam.ac.uk

Easter JISC metadata May25 DT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Easter JISC metadata May25 DT

Similar to Easter JISC metadata May25 DT (20)

Recently uploaded

Recently uploaded (20)

Easter JISC metadata May25 DT