Bibliographic metadata (including citation) Tuesday 7 th July 2009 AMG 2 nd workshop, University of Leicester , Leicester www.bath.ac.uk UKOLN is supported by: Alexey Strelnikov Research Officer UKOLN Contributions from Emma Tonkin
Agenda
Introduction
What and why
Use cases
Key points
Issues
Recommendations
Introduction
Metadata extraction is the process of describing extrinsic and intrinsic qualities of a resource
Bibliographic metadata
Bibliographic metadata is a particular case of metadata extraction.
For example:
Title
Authors
Emails
Citations
What and why
General metadata extraction – tends to involve machine learning
Citation and reference analysis – usually involves regular expressions
Might involve visual structure analysis and text mining
What and why (2)
In order to improve long/boring manual operations with metadata:
Generation metadata on document deposit
Revision of metadata
Comparison and aggregation
<Put your own operation here>
What and why (3)
Automatic extraction can make a system more robust (in addition to existing approaches)
It is not a drop-in replacement for manual creation, but semi-automated feature extraction can make for better metadata quality overall
Use case (1)
Dominik – is a researcher, publishing his new paper
Instead of fully manual deposit (typing in all values) he makes use of system suggestions, which make the process faster and simpler
Use case (2)
Fiona – is a researcher, assessing impact made by her paper
How many citations of my work?
Network of citations (existing system: Google scholar, citeseer.net...)
Use case (3)
Bob – is a repository manager, checking inconsistency in the repository's metadata
Make use of system recommendations, and a generated value confidence level
Easier to find invalid or obsolete metadata values
Use case (4)
Edward – is an application profile/standard curator, checking inter-repository metadata
Have application profile, but no feedback on how it is followed
Consistent errors:
Not filled
Systematically wrong value (might be related to research field, environment)
Comparison & aggregation report
Summary for use cases
All approaches have a manual analogue
Automated metadata extraction would be an improvement, but not replacement
Service is invisible , it just makes suggestions: for example – 'the metadata field “title” should be “Some name”'
Key points
Standards - involved in the workflow make a big impact
“The nice thing about standards is that there are so many of them to choose from” Andrew S. Tanenbaum
Tools – existing applications to extract metadata
Standards
Should consider a number of standards for representation, format, as well as languages and locales
Document encoding
Metadata encoding
Locale specifics
Citation formats
Document encoding
Important because this may impact correct reading of a resource
Document formats:
PDF, Doc, PPT, etc.
Font encoding:
UTF, locale specific
Metadata encoding
This has a direct impact on the result's usability in a given context
Examples of metadata standards:
OAI-DC
SWAP
LOM
OAI-ORE
MARC
Locale specifics
Country and culture specific formats of text elements
For example:
Right-to-left languages
Date format:
dd/mm/yyyy
mm/dd/yyyy
Citation and reference formats
There exist many citation/reference formats, different standards exist for most research fields
For example:
APA – social sciences
MLA – literature and the arts
AMA - biology
Turabian – multi-field
Chicago standard – publications
Harvard, Numerical, MHRA - multi-field
Tools
Automated metadata extraction is a workflow, which involves several interconnected software systems
Helps to overcome standards heterogeneity
Examples of Tools
Examples of existing tools:
DC-dot (variety of doc/web formats -> DC metadata)
DepositPlait (var. format metadata -> metadata repository)
DataFountains (var. format->metadata)
paperBase (prototype concentrating on eprint documents)
Issues
Full-text resource availability
Readability of the text
Legal issues
Engineering constraints - machine suggestions might be imperfect
Language & localization - need to retrain system for the other locale
Recommendations
A robust system that is easy to retrain, customizable input & outputs plugins
A potential gain:
Simplify (re)extraction of metadata, faster repository operations, validation
Making use of confidence level assigned to the metadata field
A potential gain:
Identifying possibly incorrect metadata records
Recommendations (2)
Make full-text document available to the system
A potential gain:
Periodical re-exploration of the resource and updating the metadata
A talk were given at automatic metadata extraction more
A talk were given at automatic metadata extraction workshop by Intrallect and Jisc. This particular talk is about bibliographical metadata extraction in context of automated extraction. less
0 comments
Post a comment