Bibliographic metadata (including citation) Tuesday 7 th  July 2009 AMG 2 nd  workshop,  University of Leicester , Leicest...
Agenda <ul><li>Introduction
What and why
Use cases
Key points
Issues
Recommendations </li></ul>
Introduction <ul><li>Metadata extraction is the process of describing extrinsic and intrinsic qualities of a resource </li...
Bibliographic metadata <ul><li>Bibliographic metadata is a particular case of metadata extraction.
For example:
Title
Authors
Emails
Citations  </li></ul>
What and why  <ul><li>General metadata extraction – tends to involve machine learning
Citation and reference analysis – usually involves regular expressions
Might involve visual structure analysis and text mining </li></ul>
What and why (2) <ul><li>In order to improve long/boring manual operations with metadata: </li><ul><li>Generation metadata...
Revision of metadata
Comparison and aggregation
<Put your own operation here> </li></ul></ul>
What and why (3) <ul><li>Automatic extraction can make a system more robust (in addition to existing approaches)
It is not a drop-in replacement for manual creation, but semi-automated feature extraction can make for better metadata qu...
Use case (1) <ul><li>Dominik – is a researcher, publishing his new paper
Upcoming SlideShare
Loading in...5
×

Bibliographic metadata (including citation)

1,283
-1

Published on

A talk were given at automatic metadata extraction workshop by Intrallect and Jisc. This particular talk is about bibliographical metadata extraction in context of automated extraction.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,283
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bibliographic metadata (including citation)

  1. 1. Bibliographic metadata (including citation) Tuesday 7 th July 2009 AMG 2 nd workshop, University of Leicester , Leicester www.bath.ac.uk UKOLN is supported by: Alexey Strelnikov Research Officer UKOLN Contributions from Emma Tonkin
  2. 2. Agenda <ul><li>Introduction
  3. 3. What and why
  4. 4. Use cases
  5. 5. Key points
  6. 6. Issues
  7. 7. Recommendations </li></ul>
  8. 8. Introduction <ul><li>Metadata extraction is the process of describing extrinsic and intrinsic qualities of a resource </li></ul>
  9. 9. Bibliographic metadata <ul><li>Bibliographic metadata is a particular case of metadata extraction.
  10. 10. For example:
  11. 11. Title
  12. 12. Authors
  13. 13. Emails
  14. 14. Citations </li></ul>
  15. 15. What and why <ul><li>General metadata extraction – tends to involve machine learning
  16. 16. Citation and reference analysis – usually involves regular expressions
  17. 17. Might involve visual structure analysis and text mining </li></ul>
  18. 18. What and why (2) <ul><li>In order to improve long/boring manual operations with metadata: </li><ul><li>Generation metadata on document deposit
  19. 19. Revision of metadata
  20. 20. Comparison and aggregation
  21. 21. <Put your own operation here> </li></ul></ul>
  22. 22. What and why (3) <ul><li>Automatic extraction can make a system more robust (in addition to existing approaches)
  23. 23. It is not a drop-in replacement for manual creation, but semi-automated feature extraction can make for better metadata quality overall </li></ul>
  24. 24. Use case (1) <ul><li>Dominik – is a researcher, publishing his new paper
  25. 25. Instead of fully manual deposit (typing in all values) he makes use of system suggestions, which make the process faster and simpler </li></ul>
  26. 26. Use case (2) <ul><li>Fiona – is a researcher, assessing impact made by her paper
  27. 27. How many citations of my work?
  28. 28. Network of citations (existing system: Google scholar, citeseer.net...) </li></ul>
  29. 29. Use case (3) <ul><li>Bob – is a repository manager, checking inconsistency in the repository's metadata
  30. 30. Make use of system recommendations, and a generated value confidence level
  31. 31. Easier to find invalid or obsolete metadata values </li></ul>
  32. 32. Use case (4) <ul><li>Edward – is an application profile/standard curator, checking inter-repository metadata
  33. 33. Have application profile, but no feedback on how it is followed
  34. 34. Consistent errors: </li><ul><li>Not filled
  35. 35. Systematically wrong value (might be related to research field, environment) </li></ul><li>Comparison & aggregation report </li></ul>
  36. 36. Summary for use cases <ul><li>All approaches have a manual analogue
  37. 37. Automated metadata extraction would be an improvement, but not replacement
  38. 38. Service is invisible , it just makes suggestions: for example – 'the metadata field “title” should be “Some name”' </li></ul>
  39. 39. Key points <ul><li>Standards - involved in the workflow make a big impact </li><ul><li>“The nice thing about standards is that there are so many of them to choose from” Andrew S. Tanenbaum </li></ul><li>Tools – existing applications to extract metadata </li></ul>
  40. 40. Standards <ul><li>Should consider a number of standards for representation, format, as well as languages and locales </li></ul><ul><ul><li>Document encoding
  41. 41. Metadata encoding
  42. 42. Locale specifics
  43. 43. Citation formats </li></ul></ul>
  44. 44. <ul><ul><li>Document encoding </li></ul></ul><ul><li>Important because this may impact correct reading of a resource
  45. 45. Document formats: </li><ul><li>PDF, Doc, PPT, etc. </li></ul><li>Font encoding: </li><ul><li>UTF, locale specific </li></ul></ul>
  46. 46. <ul><ul><li>Metadata encoding </li></ul></ul><ul><li>This has a direct impact on the result's usability in a given context
  47. 47. Examples of metadata standards: </li><ul><li>OAI-DC
  48. 48. SWAP
  49. 49. LOM
  50. 50. OAI-ORE
  51. 51. MARC </li></ul></ul>
  52. 52. <ul><ul><li>Locale specifics </li></ul></ul><ul><li>Country and culture specific formats of text elements
  53. 53. For example: </li><ul><li>Right-to-left languages
  54. 54. Date format: </li><ul><li>dd/mm/yyyy
  55. 55. mm/dd/yyyy </li></ul></ul></ul>
  56. 56. <ul><ul><li>Citation and reference formats </li></ul></ul><ul><li>There exist many citation/reference formats, different standards exist for most research fields
  57. 57. For example: </li><ul><li>APA – social sciences
  58. 58. MLA – literature and the arts
  59. 59. AMA - biology
  60. 60. Turabian – multi-field
  61. 61. Chicago standard – publications
  62. 62. Harvard, Numerical, MHRA - multi-field </li></ul></ul>
  63. 63. Tools <ul><li>Automated metadata extraction is a workflow, which involves several interconnected software systems
  64. 64. Helps to overcome standards heterogeneity </li></ul>
  65. 65. Examples of Tools <ul><li>Examples of existing tools: </li><ul><li>DC-dot (variety of doc/web formats -> DC metadata)
  66. 66. DepositPlait (var. format metadata -> metadata repository)
  67. 67. DataFountains (var. format->metadata)
  68. 68. paperBase (prototype concentrating on eprint documents) </li></ul></ul>
  69. 69. Issues <ul><li>Full-text resource availability
  70. 70. Readability of the text
  71. 71. Legal issues
  72. 72. Engineering constraints - machine suggestions might be imperfect
  73. 73. Language & localization - need to retrain system for the other locale </li></ul>
  74. 74. Recommendations <ul><li>A robust system that is easy to retrain, customizable input & outputs plugins </li><ul><li>A potential gain: </li><ul><li>Simplify (re)extraction of metadata, faster repository operations, validation </li></ul></ul><li>Making use of confidence level assigned to the metadata field </li><ul><li>A potential gain: </li><ul><li>Identifying possibly incorrect metadata records </li></ul></ul></ul>
  75. 75. Recommendations (2) <ul><li>Make full-text document available to the system </li><ul><li>A potential gain: </li><ul><li>Periodical re-exploration of the resource and updating the metadata </li></ul></ul><li>Investigate the problem of analysing citation </li><ul><li>A potential gain: </li><ul><li>Assess level of similarity between papers
  76. 76. Classify paper nature </li></ul></ul></ul>
  77. 77. Q&A <ul><li>Thank you for your attention </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×