Practical use as part of a repository deposit workflow
A user study comparing manual and hybrid input
Towards metadata triage
Whatever can go wrong... (II)
Function following form – interface
Model adapted to suit unique user needs
Data model incompletely supported
Input validation issues
Systematic error; typos; localisation; encoding; etc.
Lots of past work in characterising manual input errors
Challenges in metadata extraction
Manual metadata generation
Metadata extraction in brief
Practical use as part of a repository deposit workflow
A user study comparing manual and hybrid input
Image segmentation, templating & OCR
Working from text
There are a number of possible states (ie. title, author, email, affiliation, abstract)
Directed graph with probabilities
Markov chain: for example,
Title Author Email Affil.
Hidden Markov Model
We cannot directly see these states – only the words
But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented
This may be expressed in terms of an HMM
Bayesian statistics used across term appearance
Example parse
Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
...
Confirmation-Guided Discovery of First-Order Rules , PETER A. FLACH, NICOLAS LACHICHE
Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection
Challenges in metadata extraction
Manual metadata generation
Metadata extraction in brief
Practical use as part of a repository deposit workflow
A user study comparing manual and hybrid input
Towards metadata triage
Aims
Adaption of existing interfaces
Enhancing rather than rewriting
Cross-platform, accessible interface
Simple reusable REST API, metadata as DC/XML
Sample interfaces
Sample interfaces
Architecture
Using what we know...
Challenges in metadata extraction
Manual metadata generation
Metadata extraction in brief
Practical use as part of a repository deposit workflow
A user study comparing manual and hybrid input
Towards metadata triage
Question:
“ Do people accept ‘hybrid’ interfaces?”
Here’s one we did earlier…
Hypotheses
Correcting extracted metadata is faster than entering or cutting-and-pasting metadata.
The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct.
User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails.
Measured: speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction
Results: Timing
Hybrid faster under both conditions
(Summary of median times)
Results: Accuracy
Tested against ground-truth
Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords.
Manual metadata accuracy:
Few users use cut and paste
Capitalisation, punctuation frequently differs
Synonyms are accidentally substituted
Hybrid closer to ground-truth, and more complete, but results not clear-cut.
Qualitative results
Most users preferred the hybrid mode
Most perceived it to be faster than manual data entry
Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach
Both were good - quality
Discussion
Results support hypotheses
People prefer the hybrid interface, and found it more satisfying to use
Accessibility issues exist, but can be overcome
The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie. it did more than the subject wanted!
Challenges in metadata extraction
Manual metadata generation
Metadata extraction in brief
Practical use as part of a repository deposit workflow
A user study comparing manual and hybrid input
Towards metadata triage
MetRe prototype (2008)
Characteristic classes of individual/systematic error highlighted
Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error
Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences
v
Issues
Discipline/domain-specific issues
Lots of information required to do this right (see metadata schema/terminology registry)
Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’)
Approach
Generally dependent on heuristics over available data
Powered by very specific functions (classifiers, validation, etc…)
Potentially expensive, not always domain-independent
Future work
More!
Data
Filters (input/output formats)
Methods
Evaluation
Service availability (mail me for announcements!)
Conclusion
Metadata creation can be supported through software
Presentation given at the Text Mining for Scholarly more
Presentation given at the Text Mining for Scholarly Communications and Repositories
Joint Workshop, 28-29 Oct 2009 (http://www.nactem.ac.uk/tm-ukoln.php) less
0 comments
Post a comment