Approaches to automated metadata extraction : FixRep Project

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Approaches to automated metadata extraction : FixRep Project - Presentation Transcript

    1. UKOLN is supported by: Approaches to automated metadata extraction : FixRep Project Emma Tonkin [email_address] www.bath.ac.uk
    2. Wouldn't it be nice if...
      • ...computers could author our metadata for us, thus saving a lot of hassle?
      • Mechanical metadata extraction vs manual metadata input
    3. But...
      • Automated tools are fallible
      • There's never quite enough information available
      • Templates change, different domains have different standards
      • In short, computers are often wrong
        • and so are people
      • Hybrid approach:
        • Get what metadata you can
        • Ask the user to check and clean it if necessary
      • Philosophy:
        • If the computer gets it wrong, we can fix it later
      The 'half a loaf' hypothesis
    4. Wouldn’t it be nice if…
      • … computers could fix our metadata for us?
      • Or, more realistically, help us do this work for ourselves.
      • All about ‘fixing it later’, doing what we can with what we have
      • Automated metadata extraction + metadata consistency assessment
      • Metadata generation, evaluation, characterisation: enabling metadata triage
      • Challenges in automated metadata extraction
      • Manual metadata generation
      • Metadata extraction in brief
      • Practical use as part of a repository deposit workflow
      • A user study comparing manual and hybrid input
      • Towards metadata triage
    5. Whatever can go wrong...
      • PDFs can be:
        • Encrypted
        • Corrupted
        • Oddly encoded
        • An image file without embedded text
        • Occurrence: ~3-6%
    6. Character sets
      • Ligatures,
      • Accents,
      • Symbols - may not always be extractable from PDFs
      • Image © Daniel Ullrich
    7. Document formats/layouts
      • Many possible formats
      • Some formats not widely supported
      • Document layouts vary widely, esp. by discipline
      • Challenges in metadata extraction
      • Manual metadata generation
      • Metadata extraction in brief
      • Practical use as part of a repository deposit workflow
      • A user study comparing manual and hybrid input
      • Towards metadata triage
    8. Whatever can go wrong... (II)
      • Function following form – interface
      • Model adapted to suit unique user needs
      • Data model incompletely supported
      • Input validation issues
      • Systematic error; typos; localisation; encoding; etc.
      • Lots of past work in characterising manual input errors
      • Challenges in metadata extraction
      • Manual metadata generation
      • Metadata extraction in brief
      • Practical use as part of a repository deposit workflow
      • A user study comparing manual and hybrid input
    9. Image segmentation, templating & OCR
    10. Working from text
      • There are a number of possible states (ie. title, author, email, affiliation, abstract)
      • Directed graph with probabilities
        • Markov chain: for example,
      Title Author Email Affil.
    11. Hidden Markov Model
      • We cannot directly see these states – only the words
      • But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented
      • This may be expressed in terms of an HMM
      • Bayesian statistics used across term appearance
    12. Example parse
      • Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
      • Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
      • ...
      • Confirmation-Guided Discovery of First-Order Rules , PETER A. FLACH, NICOLAS LACHICHE
      • Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection
      • Challenges in metadata extraction
      • Manual metadata generation
      • Metadata extraction in brief
      • Practical use as part of a repository deposit workflow
      • A user study comparing manual and hybrid input
      • Towards metadata triage
    13. Aims
      • Adaption of existing interfaces
      • Enhancing rather than rewriting
      • Cross-platform, accessible interface
      • Simple reusable REST API, metadata as DC/XML
    14. Sample interfaces
    15. Sample interfaces
    16. Architecture
    17. Using what we know...
      • Challenges in metadata extraction
      • Manual metadata generation
      • Metadata extraction in brief
      • Practical use as part of a repository deposit workflow
      • A user study comparing manual and hybrid input
      • Towards metadata triage
    18. Question:
      • “ Do people accept ‘hybrid’ interfaces?”
      • Here’s one we did earlier…
    19. Hypotheses
      • Correcting extracted metadata is faster than entering or cutting-and-pasting metadata.
      • The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct.
      • User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails.
      • Measured: speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction
    20. Results: Timing
      • Hybrid faster under both conditions
      • (Summary of median times)‏
    21. Results: Accuracy
      • Tested against ground-truth
      • Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords.
      • Manual metadata accuracy:
        • Few users use cut and paste
        • Capitalisation, punctuation frequently differs
        • Synonyms are accidentally substituted
      • Hybrid closer to ground-truth, and more complete, but results not clear-cut.
    22. Qualitative results
      • Most users preferred the hybrid mode
      • Most perceived it to be faster than manual data entry
      • Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach
      • Both were good - quality
    23. Discussion
      • Results support hypotheses
      • People prefer the hybrid interface, and found it more satisfying to use
      • Accessibility issues exist, but can be overcome
      • The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie. it did more than the subject wanted!
      • Challenges in metadata extraction
      • Manual metadata generation
      • Metadata extraction in brief
      • Practical use as part of a repository deposit workflow
      • A user study comparing manual and hybrid input
      • Towards metadata triage
    24. MetRe prototype (2008)
      • Characteristic classes of individual/systematic error highlighted
      • Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error
      • Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences
    25. v
    26.  
    27. Issues
      • Discipline/domain-specific issues
      • Lots of information required to do this right (see metadata schema/terminology registry)
      • Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’)
    28. Approach
      • Generally dependent on heuristics over available data
      • Powered by very specific functions (classifiers, validation, etc…)
      • Potentially expensive, not always domain-independent
    29. Future work
      • More!
        • Data
        • Filters (input/output formats)
        • Methods
        • Evaluation
        • Service availability (mail me for announcements!)
    30. Conclusion
      • Metadata creation can be supported through software
      • Specific problem sets in metadata triage
      • Work continues in the FixRep project
    31. Conclusion (II)
      • Formal Metadata Extraction/evaluation
      • Metadata review process
      • Accessibility metadata
      • Entity extraction (named entities, geographical, temporal [k-int!])
      • Repository integration
      • Thanks!
      • Comments/Questions?
      • www.ukoln.ac.uk/projects/fixrep
    SlideShare Zeitgeist 2009

    + UKOLN (dev), University of BathUKOLN (dev), University of Bath Nominate

    custom

    59 views, 0 favs, 0 embeds more stats

    Presentation given at the Text Mining for Scholarly more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 59
      • 59 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 3
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories