UKOLN is supported  by: Approaches to automated metadata extraction : FixRep Project Emma Tonkin [email_address] www.bath....
Wouldn't it be nice if... <ul><li>...computers could author our metadata for us, thus saving a lot of hassle? </li></ul><u...
But... <ul><li>Automated tools are fallible </li></ul><ul><li>There's never quite enough information available </li></ul><...
<ul><li>Hybrid approach: </li></ul><ul><ul><li>Get what metadata you can </li></ul></ul><ul><ul><li>Ask the user to check ...
Wouldn’t it be nice if… <ul><li>… computers could  fix  our metadata for us? </li></ul><ul><li>Or, more realistically, hel...
<ul><li>All about ‘fixing it later’, doing what we can with what we have </li></ul><ul><li>Automated metadata extraction +...
<ul><li>Challenges in automated metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadat...
Whatever can go wrong... <ul><li>PDFs can be: </li></ul><ul><ul><li>Encrypted </li></ul></ul><ul><ul><li>Corrupted </li></...
Character sets <ul><li>Ligatures, </li></ul><ul><li>Accents, </li></ul><ul><li>Symbols - may not always be extractable  fr...
Document formats/layouts <ul><li>Many possible formats </li></ul><ul><li>Some formats not widely supported </li></ul><ul><...
<ul><li>Challenges in metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extracti...
Whatever can go wrong... (II) <ul><li>Function following form – interface  </li></ul><ul><li>Model adapted to suit unique ...
<ul><li>Challenges in metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extracti...
Image segmentation, templating & OCR
Working from text <ul><li>There are a number of possible states (ie. title, author, email, affiliation, abstract)  </li></...
Hidden Markov Model <ul><li>We cannot directly see these states – only the words </li></ul><ul><li>But we can gather stati...
Example parse <ul><li>Confirmation-Guided Discovery  of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE  </li></ul><ul...
<ul><li>Challenges in metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extracti...
Aims <ul><li>Adaption of existing interfaces </li></ul><ul><li>Enhancing rather than rewriting </li></ul><ul><li>Cross-pla...
Sample interfaces
Sample interfaces
Architecture
Using what we know...
<ul><li>Challenges in metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extracti...
Question: <ul><li>“ Do people accept ‘hybrid’ interfaces?” </li></ul><ul><li>Here’s one we did earlier… </li></ul>
Hypotheses <ul><li>Correcting extracted metadata is faster than entering or cutting-and-pasting metadata. </li></ul><ul><l...
Results: Timing <ul><li>Hybrid faster under both conditions </li></ul><ul><li>(Summary  of median times)‏ </li></ul>
Results: Accuracy <ul><li>Tested against ground-truth </li></ul><ul><li>Keyword accuracy: First keyword listed was relevan...
Qualitative results <ul><li>Most users preferred the hybrid mode </li></ul><ul><li>Most perceived it to be faster than man...
Discussion <ul><li>Results support hypotheses </li></ul><ul><li>People prefer the hybrid interface, and found it more sati...
<ul><li>Challenges in metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extracti...
MetRe prototype (2008) <ul><li>Characteristic classes of individual/systematic error highlighted </li></ul><ul><li>Nb. loc...
v
 
Issues <ul><li>Discipline/domain-specific issues </li></ul><ul><li>Lots of information required to do this right (see meta...
Approach <ul><li>Generally dependent on heuristics over available data </li></ul><ul><li>Powered by very specific function...
Future work <ul><li>More!  </li></ul><ul><ul><li>Data </li></ul></ul><ul><ul><li>Filters (input/output formats) </li></ul>...
Conclusion <ul><li>Metadata creation can be supported through software </li></ul><ul><li>Specific problem sets in metadata...
Conclusion (II) <ul><li>Formal Metadata Extraction/evaluation </li></ul><ul><li>Metadata review process </li></ul><ul><li>...
<ul><li>Thanks! </li></ul><ul><li>Comments/Questions? </li></ul><ul><li>www.ukoln.ac.uk/projects/fixrep </li></ul>
Upcoming SlideShare
Loading in …5
×

Approaches to automated metadata extraction : FixRep Project

1,195 views

Published on

Presentation given at the Text Mining for Scholarly Communications and Repositories
Joint Workshop, 28-29 Oct 2009 (http://www.nactem.ac.uk/tm-ukoln.php)

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,195
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Approaches to automated metadata extraction : FixRep Project

  1. 1. UKOLN is supported by: Approaches to automated metadata extraction : FixRep Project Emma Tonkin [email_address] www.bath.ac.uk
  2. 2. Wouldn't it be nice if... <ul><li>...computers could author our metadata for us, thus saving a lot of hassle? </li></ul><ul><li>Mechanical metadata extraction vs manual metadata input </li></ul>
  3. 3. But... <ul><li>Automated tools are fallible </li></ul><ul><li>There's never quite enough information available </li></ul><ul><li>Templates change, different domains have different standards </li></ul><ul><li>In short, computers are often wrong </li></ul><ul><ul><li>and so are people </li></ul></ul>
  4. 4. <ul><li>Hybrid approach: </li></ul><ul><ul><li>Get what metadata you can </li></ul></ul><ul><ul><li>Ask the user to check and clean it if necessary </li></ul></ul><ul><li>Philosophy: </li></ul><ul><ul><li>If the computer gets it wrong, we can fix it later </li></ul></ul>The 'half a loaf' hypothesis
  5. 5. Wouldn’t it be nice if… <ul><li>… computers could fix our metadata for us? </li></ul><ul><li>Or, more realistically, help us do this work for ourselves. </li></ul>
  6. 6. <ul><li>All about ‘fixing it later’, doing what we can with what we have </li></ul><ul><li>Automated metadata extraction + metadata consistency assessment </li></ul><ul><li>Metadata generation, evaluation, characterisation: enabling metadata triage </li></ul>
  7. 7. <ul><li>Challenges in automated metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extraction in brief </li></ul><ul><li>Practical use as part of a repository deposit workflow </li></ul><ul><li>A user study comparing manual and hybrid input </li></ul><ul><li>Towards metadata triage </li></ul>
  8. 8. Whatever can go wrong... <ul><li>PDFs can be: </li></ul><ul><ul><li>Encrypted </li></ul></ul><ul><ul><li>Corrupted </li></ul></ul><ul><ul><li>Oddly encoded </li></ul></ul><ul><ul><li>An image file without embedded text </li></ul></ul><ul><ul><li>Occurrence: ~3-6% </li></ul></ul>
  9. 9. Character sets <ul><li>Ligatures, </li></ul><ul><li>Accents, </li></ul><ul><li>Symbols - may not always be extractable from PDFs </li></ul><ul><li>Image © Daniel Ullrich </li></ul>
  10. 10. Document formats/layouts <ul><li>Many possible formats </li></ul><ul><li>Some formats not widely supported </li></ul><ul><li>Document layouts vary widely, esp. by discipline </li></ul>
  11. 11. <ul><li>Challenges in metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extraction in brief </li></ul><ul><li>Practical use as part of a repository deposit workflow </li></ul><ul><li>A user study comparing manual and hybrid input </li></ul><ul><li>Towards metadata triage </li></ul>
  12. 12. Whatever can go wrong... (II) <ul><li>Function following form – interface </li></ul><ul><li>Model adapted to suit unique user needs </li></ul><ul><li>Data model incompletely supported </li></ul><ul><li>Input validation issues </li></ul><ul><li>Systematic error; typos; localisation; encoding; etc. </li></ul><ul><li>Lots of past work in characterising manual input errors </li></ul>
  13. 13. <ul><li>Challenges in metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extraction in brief </li></ul><ul><li>Practical use as part of a repository deposit workflow </li></ul><ul><li>A user study comparing manual and hybrid input </li></ul>
  14. 14. Image segmentation, templating & OCR
  15. 15. Working from text <ul><li>There are a number of possible states (ie. title, author, email, affiliation, abstract) </li></ul><ul><li>Directed graph with probabilities </li></ul><ul><ul><li>Markov chain: for example, </li></ul></ul>Title Author Email Affil.
  16. 16. Hidden Markov Model <ul><li>We cannot directly see these states – only the words </li></ul><ul><li>But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented </li></ul><ul><li>This may be expressed in terms of an HMM </li></ul><ul><li>Bayesian statistics used across term appearance </li></ul>
  17. 17. Example parse <ul><li>Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE </li></ul><ul><li>Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE </li></ul><ul><li>... </li></ul><ul><li>Confirmation-Guided Discovery of First-Order Rules , PETER A. FLACH, NICOLAS LACHICHE </li></ul><ul><li>Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection </li></ul>
  18. 18. <ul><li>Challenges in metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extraction in brief </li></ul><ul><li>Practical use as part of a repository deposit workflow </li></ul><ul><li>A user study comparing manual and hybrid input </li></ul><ul><li>Towards metadata triage </li></ul>
  19. 19. Aims <ul><li>Adaption of existing interfaces </li></ul><ul><li>Enhancing rather than rewriting </li></ul><ul><li>Cross-platform, accessible interface </li></ul><ul><li>Simple reusable REST API, metadata as DC/XML </li></ul>
  20. 20. Sample interfaces
  21. 21. Sample interfaces
  22. 22. Architecture
  23. 23. Using what we know...
  24. 24. <ul><li>Challenges in metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extraction in brief </li></ul><ul><li>Practical use as part of a repository deposit workflow </li></ul><ul><li>A user study comparing manual and hybrid input </li></ul><ul><li>Towards metadata triage </li></ul>
  25. 25. Question: <ul><li>“ Do people accept ‘hybrid’ interfaces?” </li></ul><ul><li>Here’s one we did earlier… </li></ul>
  26. 26. Hypotheses <ul><li>Correcting extracted metadata is faster than entering or cutting-and-pasting metadata. </li></ul><ul><li>The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct. </li></ul><ul><li>User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails. </li></ul><ul><li>Measured: speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction </li></ul>
  27. 27. Results: Timing <ul><li>Hybrid faster under both conditions </li></ul><ul><li>(Summary of median times)‏ </li></ul>
  28. 28. Results: Accuracy <ul><li>Tested against ground-truth </li></ul><ul><li>Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords. </li></ul><ul><li>Manual metadata accuracy: </li></ul><ul><ul><li>Few users use cut and paste </li></ul></ul><ul><ul><li>Capitalisation, punctuation frequently differs </li></ul></ul><ul><ul><li>Synonyms are accidentally substituted </li></ul></ul><ul><li>Hybrid closer to ground-truth, and more complete, but results not clear-cut. </li></ul>
  29. 29. Qualitative results <ul><li>Most users preferred the hybrid mode </li></ul><ul><li>Most perceived it to be faster than manual data entry </li></ul><ul><li>Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach </li></ul><ul><li>Both were good - quality </li></ul>
  30. 30. Discussion <ul><li>Results support hypotheses </li></ul><ul><li>People prefer the hybrid interface, and found it more satisfying to use </li></ul><ul><li>Accessibility issues exist, but can be overcome </li></ul><ul><li>The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie. it did more than the subject wanted! </li></ul>
  31. 31. <ul><li>Challenges in metadata extraction </li></ul><ul><li>Manual metadata generation </li></ul><ul><li>Metadata extraction in brief </li></ul><ul><li>Practical use as part of a repository deposit workflow </li></ul><ul><li>A user study comparing manual and hybrid input </li></ul><ul><li>Towards metadata triage </li></ul>
  32. 32. MetRe prototype (2008) <ul><li>Characteristic classes of individual/systematic error highlighted </li></ul><ul><li>Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error </li></ul><ul><li>Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences </li></ul>
  33. 33. v
  34. 35. Issues <ul><li>Discipline/domain-specific issues </li></ul><ul><li>Lots of information required to do this right (see metadata schema/terminology registry) </li></ul><ul><li>Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’) </li></ul>
  35. 36. Approach <ul><li>Generally dependent on heuristics over available data </li></ul><ul><li>Powered by very specific functions (classifiers, validation, etc…) </li></ul><ul><li>Potentially expensive, not always domain-independent </li></ul>
  36. 37. Future work <ul><li>More! </li></ul><ul><ul><li>Data </li></ul></ul><ul><ul><li>Filters (input/output formats) </li></ul></ul><ul><ul><li>Methods </li></ul></ul><ul><ul><li>Evaluation </li></ul></ul><ul><ul><li>Service availability (mail me for announcements!) </li></ul></ul>
  37. 38. Conclusion <ul><li>Metadata creation can be supported through software </li></ul><ul><li>Specific problem sets in metadata triage </li></ul><ul><li>Work continues in the FixRep project </li></ul>
  38. 39. Conclusion (II) <ul><li>Formal Metadata Extraction/evaluation </li></ul><ul><li>Metadata review process </li></ul><ul><li>Accessibility metadata </li></ul><ul><li>Entity extraction (named entities, geographical, temporal [k-int!]) </li></ul><ul><li>Repository integration </li></ul>
  39. 40. <ul><li>Thanks! </li></ul><ul><li>Comments/Questions? </li></ul><ul><li>www.ukoln.ac.uk/projects/fixrep </li></ul>

×