Published on

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Automatic Metadata Generation through Extraction<br />LIS 688 / spring 2011<br />Group 1 <br />MendyOzan, EunaeChae, and Vanessa Smith<br />1<br />
  2. 2. Overview of automatic metadata generation<br />Differences between harvesting and extraction<br />Examination of extraction (6 literature reviews) <br />Conclusion <br />Outline <br />2<br />
  3. 3. With increased wed-based information resources <br />Limited resources – labor and money <br />“ For libraries to advance and take leadership in the bibliographic control of web resources, they must investigate more efficient and less costly metadata creation methods.” – Greenberg, Spirgin, and Crystal, 2005<br />3<br />Background <br />
  4. 4. Help serve the need for more efficient and less costly metadata creation<br />Increase consistency in records : more searchable and interoperable <br />Cut down on the amount of human labor<br />4<br />Benefits <br />
  5. 5. Harvesting : rely on machine-enabled collection of previously tagged or populated metadata fields <br />Extraction : use complex indexing algorithms and mining techniques to read the contents of a resource, analyze this content, and extract information for application to a metadata schema <br />Main difference : the type of content that is read and analyzed by the program <br />5<br />Main methods <br />
  6. 6. Rule-based systems based on natural language processing : used to extract metadata from educational materials <br />Machine learning methods : used to extract titles from general documents <br />Weaknesses : many methods mainly extract metadata from the first page of a document but not from the inner pages of a document <br />6<br />More about Extraction <br />
  7. 7. Automatic Metadata Retrieval form Ancient Manuscripts by Le bourgeosis, F.& Kalieh, H. (2004) <br />Aimed at processing automatically digitized manuscripts by using a generic platform which can be used by non-specialists in image processing and pattern recognition<br />Finding : the source of the image was the main factor in deciding if the automatically generated metadata was good : the quality of image is the key. <br />7<br />Literature reviews (1)<br />
  8. 8. Metadata Extraction and Harvesting: Klarity &<br />Klarity<br /><ul><li>extraction application that takes metadata from the actual textual content of the web resource
  9. 9. Generate metadata five elements : identifier, title, concepts, keywords, and descriptions
  10. 10. Results : a smaller amount of text creates better metadata, character limit in the title field (>100)</li></ul>8<br />Literature reviews (2)<br />
  11. 11. <br /><ul><li>Use the existing metadata that is housed in the META tags
  12. 12. finds the resource “identifier” from the web browser’s address prompt and pulls the remaining resources from the source code metadata </li></ul>9<br />Literature reviews (2)<br />
  13. 13. Still Require human oversight and input <br />Scored low on their accuracy evaluations (86.2% for Klarity, 78.6% for<br />10<br />Literature reviews (2)<br />
  14. 14. New possibilities for metadata creation in an institutional repository context<br />Institutional Repository : archive, distribute, manage, and preserve research efforts<br />Focused on the use of two methods of metadata : text mining and machine learning <br />11<br />Literature reviews (3)<br />
  15. 15. Text mining : the process by which a computer recognizes the similarities between objects based on textual content<br /><ul><li>Text clustering: the process by which the system looks for frequency of specific types of words within a resource and then compares that count to other resources </li></ul>12<br />Literature reviews (3)<br />
  16. 16. Machine learning : process where problems are given to the computer along with the solution<br /><ul><li>After processing enough examples, the computer begin to use the “rules”
  17. 17. Number of characters, nouns, verbs per line, adjectives per line can all signal the system to identify specific elements </li></ul>13<br />Literature reviews (3)<br />
  18. 18. Findings : records with good metadata can be taught to the system to those records other records could have similar metadata created<br />14<br />Literature reviews (3)<br />
  19. 19. Automation generation and extraction<br />Provide overview of metadata, and automatic generation<br />More detailed about DescribeThis<br />Using the Dublin Core format, it will create records that web administrators can use in describing their own web content <br />15<br />Literature reviews (4)<br />
  20. 20. Automated document metadata extraction<br />Focused on theses and dissertations <br />Nature of these documents : standard headers, titles, table of contents, abstract, acknowledgement, preface, introduction, conclusion and references <br />There preexisting categories aid in the automatic generation<br />Rule- based system, extract more metadata<br />16<br />Literature reviews (5)<br />
  21. 21. Application of semi-automatic metadata generation in libraries <br />Metadata creation through partial reliance on software in combination with human process<br />Findings : the application has not yet been fully exploited, still exist the gap between experimental studies and the usage in a real world setting <br />More research needed on the development of automatic metadata generation in practical setting <br />17<br />Literature reviews (6)<br />
  22. 22. The role of automatic generation is critical along with the enormous volume of online and digital resources <br />Hold great promise for increasing the efficiency and consistency of metadata generation<br />But many metadata extraction programs are limited<br />and are not perfect in using it in real setting <br />The key to solve the problem : increased standardization in format and schemas <br />18<br />Conclusion<br />
  23. 23. Le Bourgeosis, F., & Kaileh, H. (2004). Automatic Metadata Retrieval from Ancient Manuscripts. In A. e. Dengel, Document Analysis Systems VI. Berlin: Springer Berlin / Heidelberg. <br />Greenberg, J. (2004). Metadata Extraction and Harvesting: A Comparison of Two Automatic Metadata Generation Applications. Journal of Internet Cataloging, 6 (4), 59-82.<br />Al-Digeil, M., Burk, A., Forest, D., & Whitney, J. (2007). New possibilities for metadat creation in an institutional repository context. OCLC Systems & Services, International Digital Library Perspectives, 23 (4), 403-410<br />19<br />Reference <br />
  24. 24. Noufal, P. (2005). Automatic generation and extraction. In: 7th MANLIBNET Annual National Convention on Digital Libraries in Knowledge Management: Op- portunities for Management Libraries, Indian Institute of Management Kozhikode (pp. 319-327). Kolkata: IIM Kolkata. <br />Ojokoh, B. A., Adewale, O. S., & Falaki, S. O. (2009). Automated document metadata extraction. Journal of Information Science , 563-570.<br />Park, J.-r., & Lu, C. (2009). Application of semi-automatic metadata generation in libraries: Types, tools and techniques. Library & Information Science Research , 225-231.<br />20<br />Reference <br />