Automatic Metadata Generation using Associative Networks


Published on

In spite of its tremendous value, metadata is generally sparse and incomplete, thereby hampering the effectiveness of digital information services. Many of the existing mechanisms for the automated creation of metadata rely primarily on content analysis which can be costly and inefficient. The automatic metadata generation system proposed in this article leverages resource relationships generated from existing metadata as a medium for propagation from metadata-rich to metadata-poor resources. Because of its independence from content analysis, it can be applied to a wide variety of resource media types and is shown to be computationally inexpensive. The proposed method operates through two distinct phases. Occurrence and co-occurrence algorithms first generate an associative network of repository resources leveraging existing repository metadata. Second, using the associative network as a substrate, metadata associated with metadata-rich resources is propagated to metadata-poor resources by means of a discrete-form spreading activation algorithm. This article discusses the general framework for building associative networks, an algorithm for disseminating metadata through such networks, and the results of an experiment and validation of the proposed method using a standard bibliographic dataset.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Automatic Metadata Generation using Associative Networks

  1. 1. Automatic Metadata Generation Using Associative-Networks Marko A. Rodriguez CCS-3 ‘Tech Talk’ December 7, 2005
  2. 2. Resources and Metadata <ul><li>A resource is any digital-object (e.g. manuscripts, images, video, audio, etc.). </li></ul><ul><li>A resource’s metadata record is a list of attributes describing the resource </li></ul><ul><li>[ EXAMPLE MANUSCRIPT METADATA ] </li></ul><ul><li>Authors, Institutions, Keywords, Subject Categories, Citations, Year, Publishing Journal, Usage Data </li></ul>
  3. 3. Metadata Record <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?> <OAI-PMH xmlns =&quot;; xmlns:xsi=&quot;; >   <responseDate >2005-09-07T15:25:04Z</ responseDate >   <request verb =&quot;GetRecord&quot; identifier =&quot;; metadataPrefix =&quot;oai_dc&quot; ></ request > <GetRecord > <record > <header >   <identifier ></ identifier >   <datestamp >2004-12-14</ datestamp >   <setSpec >cs</ setSpec >   </header > <metadata > <oai_dc:dc xmlns:oai_dc =&quot;; >   <dc:title >A Social Network for Societal-Scale Decision-Making Systems</ dc:title >   <dc:creator >Rodriguez, Marko</ dc:creator >   <dc:creator >Steinbock, Daniel</ dc:creator >   <dc:subject >Computers and Society</ dc:subject >   <dc:subject >Data Structures and Algorithms</ dc:subject >   <dc:subject >Human-Computer Interaction</ dc:subject >   <dc:subject >H.4.2</ dc:subject >   <dc:subject >J.7</ dc:subject >   <dc:subject >K.4.m</ dc:subject >   <dc:description >In societal-scale decision-making systems the collective is faced ...</ dc:description >   <dc:description >Comment: Dynamically Distributed Democracy algorithm</ dc:description >   <dc:date >2004-12-10</ dc:date >   <dc:type >text</ dc:type >   <dc:identifier ></ dc:identifier >   <dc:identifier >North American Association for Computational Social and Organizational Science Conference Proceedings 2004</ dc:identifier >   </oai_dc:dc >   </metadata >   </record >   </GetRecord > </OAI-PMH >
  4. 4. Problem Statement <ul><li>Metadata is costly to generate by hand </li></ul><ul><li>Metadata is hard to extract from raw resource (e.g. audio, video) </li></ul><ul><li>How can we automatically generate metadata for atrophied resource records? </li></ul>
  5. 5. General System Overview <ul><li>Generate resource relations with existing metadata in the repository. </li></ul><ul><ul><li>occurrence and/or co-occurrence networks </li></ul></ul><ul><li>Propagate metadata from metadata rich resources to metadata limited resources </li></ul><ul><ul><li>encapsulate metadata in discrete particles and disseminate them over the generated associative network </li></ul></ul>
  6. 6. HEP-TH 2003 Semantic Network
  7. 7. Transforming the Semantic Network <ul><ul><li>Convert the multi-node network into a collection of manuscripts with their associated attributes (metadata record). </li></ul></ul><ul><ul><li>manuscript </li></ul></ul><ul><ul><ul><li>Authors </li></ul></ul></ul><ul><ul><ul><li>Citations </li></ul></ul></ul><ul><ul><ul><li>Publication Date </li></ul></ul></ul><ul><ul><ul><li>Keywords </li></ul></ul></ul><ul><ul><ul><li>Organizations </li></ul></ul></ul><ul><ul><ul><li>Journal </li></ul></ul></ul>resource metadata record
  8. 8. Occurrence/Co-Occurrence <ul><li>Citation : two manuscripts are connected if one manuscript cites the other. </li></ul><ul><li>Co-Author : two manuscripts are connected if they share the same authors </li></ul><ul><li>Co-Citation : two manuscripts are connected if they share the same authors </li></ul><ul><li>Co-Keyword : two manuscripts are connected if they share the same keywords </li></ul><ul><li>Co-Organization : two manuscripts are connected if they share the same organizations </li></ul><ul><li>Co-Date : two manuscripts are connected if they share the same publication date </li></ul><ul><li>Co-Journal : two manuscripts are connected if they share the same journal </li></ul>
  9. 9. Network Generation Running Times <ul><li>Occurrence : O(N) </li></ul><ul><ul><li>Each resource’s metadata record much be checked once and only once for a direct reference to another resource. </li></ul></ul><ul><li>Co-occurrence : O([N 2 – N] / 2) </li></ul><ul><ul><li>Each resource’s metadata record much be check against every other resource’s (N 2 ), except itself (-N), once and only once (1/2). </li></ul></ul>A B A B C
  10. 10. Particle Propagation <ul><li>Every resource is given one particle, p_i. This particle contains all the metadata associated with its resource. </li></ul><ul><li>A particle also has an energy value, e_i. The further the particle travels (edge steps), the more its energy value decays. </li></ul><ul><ul><li>e_i(t+1) = e_i(t) * (1-delta) </li></ul></ul>
  11. 11. Particle Propagation <ul><li>The particle takes an outgoing edge of its current node based on the probability distribution of its outgoing edge set. If the resource it encounters doesn’t have metadata of a particular type, it recommends that resource its metadata weighted by its energy value. </li></ul>
  12. 12. Metadata Recommendations <ul><li>Manuscript A </li></ul><ul><ul><li>Journal </li></ul></ul><ul><ul><ul><li>Journal of Complexity [ 0.2457 ] </li></ul></ul></ul><ul><ul><ul><li>Journal of Information Science [ 0.1 ] </li></ul></ul></ul><ul><ul><ul><li>Information Processing and Management [ 0.001 ] </li></ul></ul></ul>recommendation strength
  13. 14. Mini-Break
  14. 15. Terrorist Alert
  15. 16. System Parameters <ul><li>Metadata Density : to validate the algorithm we kill a percentage of the metadata in the system and see if we can reconstruct it using the algorithm (d in [0,1]) </li></ul><ul><li>Metadata Percentile : only those metadata tags in the p th percentile are accepted as valid metadata (p in [0,1]) </li></ul><ul><li>** Validation is based Precision and Recall values </li></ul>
  16. 17. Results for Co-Author Network (Citation Metadata)
  17. 18. Results for Co-Author Network (Organization Metadata)
  18. 19. Results for Co-Author Network (Keyword Metadata)
  19. 20. Results for Co-Keyword Network (Citation Metadata)
  20. 21. Results for Co-Keyword Network (Journal Metadata)
  21. 22. Results for Citation Network (Author Metadata)
  22. 23. Results for Citation Network (Keyword Metadata)
  23. 24. Results for Citation Network (Journal Metadata)
  24. 25. Take Home Points <ul><li>Different edge types are better a propagating different metadata types. </li></ul><ul><li>Can work for any resource type as long as there exists some preliminary vetted metadata and a way to create resource relations. (if there is pre-existing metadata then resource relations can be automatically created). </li></ul>
  25. 26. Future Work (part 1) <ul><li>What about path types? e.g. take a co-author edge, then a citation edge, etc. Better precision and recall? </li></ul><ul><li>Explore usage metadata (applicable to any resource type—and allows for cross resource relations (e.g. manuscripts connected to audio)). The weight between two resources is a function of the interval between their download from the same IP. (Bollen, 2004) </li></ul>
  26. 27. Future Work (part 2) <ul><li>Application to social-networks? Given an unknown individual, infer his attributes according to his social-relationships </li></ul><ul><li>how does ‘work_with’ differ from ‘married_to’? They share same income metadata and religious belief metadata, respectively. </li></ul>
  27. 28. Conclusion <ul><li>Good life… </li></ul><ul><li>Rodriguez, M.A., Bollen, J., Van de Sompel, H., “Automatic Metadata Generation using Associative Networks”, [ unpublished ], 2005. </li></ul><ul><li>Know of a good journal venue? </li></ul>