Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Metadata Provenance

  1. 1DCMI Metadata Provenance Metadata Provenance Two motivating scenarios for metametadata Kai Eckert Mannheim University Library Michael Panzer OCLC DCMI Metadata Provenance F2F Meeting and Workshop October 20th, 2010 Pittsburgh, PA, USA
  2. 2DCMI Metadata Provenance Metametadata  Provenance information outside of existing data models  „Transparent“   Potential use­cases:  Whenever you have lots of legacy data in a model that does  not support provenance.  Whenever new applications require information that can not  be expressed in the existing data model.
  3. 3DCMI Metadata Provenance Need for Metametadata  Metadata are also data, so we need additional data  about them.                  Metametadata  Metadata about a whole metadata record, not for single  statements:  Who created this metadata record?  When was this record created?  …   Metadata Provenance
  4. 4DCMI Metadata Provenance Statements about (single) statements  Often proposed, but only vague instructions how to  implement it.  Needed, if metadata records are created by the  combination of single statements from different  sources.  Needed for the storage of arbitrary additional  information for single statements, that can not be  represented in the metadata format easily.
  5. 5DCMI Metadata Provenance Metametadata vs. Model based provenance  Simple statement:  Peter knows Paul.  Provenance information:  This statement is made by Mary. Peter Paul Mary Knows says Metalevel
  6. 6DCMI Metadata Provenance Data model extension Peter Paul Mary Has Relation Relation Has Object Has Creator Knows Relation Has Type  Simple statement:  Peter knows Paul.  Provenance information:  This statement is made by Mary.
  7. 7DCMI Metadata Provenance Peter Paul Mary hasRelation Relation Has Object Has Creator Knows Relation Has Type Peter Paul Mary Knows says Metalevel
  8. 8DCMI Metadata Provenance Implementation in RDF  This should not be limited to RDF!  But it is a good example and RDF has a currently a  high impact.  RDF provides no satisfying answer how to express  provenance information.  Different possible implementation, e.g.:  Reification  Named Graphs  Extended data models  ...
  9. 9DCMI Metadata Provenance RDF Reification  RDF supports statements about statements by means of  Reification, literally „objectification“ (actually a  “subjectification”...).  “The book is written by Goethe“ is said by Kai. How is it done in RDF: ex:someID rdf:type rdf:Statement . ex:someID rdf:subject “The book”. ex:someID rdf:predicate ex:isWrittenBy . ex:someID rdf:object "Goethe" . ex:someID ex:isSaidBy “Kai” . Subject Predicate Object
  10. 10DCMI Metadata Provenance S u b j e c t P r e d i c a t e O b j e c t 1 e x : p 1 2 3 r d f : t y p e e x : p e r s o n 2 e x : p 1 2 3 e x : h a s N a m e “ K a i E c k e r t ” 3 e x : p 1 2 3 e x : w o r k s F o r e x : u n i m a E x a m p l e 1 : A s i m p l e R D F e x a m p l e Simplified Presentation  Based on Notation 3 (RDF/N3)  Identification of statements by the line number: 4 #1 dc:creator ''Kai Eckert'' The subject of a statement is a reference to another statement. With this notation, we imply a reification.
  11. 11DCMI Metadata Provenance Scenario 1: Crosswalks  Crosswalks define rules, how metadata from one  schema are represented in a different schema.  Problems:   Loss of information  Erroneous Crosswalks MARC field Dublin Core element 260$c (Date of publication, distribution, etc.) → Date.Created 522 (Geographic Coverage Note) → Coverage.Spatial 300$a (Physical Description) → Format.Extent
  12. 12DCMI Metadata Provenance Possibilities for Metametadata  Storage of additional information, which would be lost  in the target format.  Identification of Crosswalks with version and the  specific rule for every generated statement. Which statements are generated by a specific rule? Which rule is responsible for a specific (erroneous) statement? Which data in the originating format was used to generate a  specific statement?
  13. 13DCMI Metadata Provenance Example 1: Crosswalk Data S u b j e c t P r e d i c a t e O b j e c t 1 e x : d o c b a s e / d o c 1 d c : t i t l e “ E x a m p l e t i t l e ” 2 # 1 e x : r u l e 1 6 3 # 1 e x : c r o s s w a l k 3 4 # 1 e x : o r i g i n M A R C : 2 4 5 5 e x : d o c b a s e / d o c 2 d c : t i t l e “ A b o u t f i n d i n g a t i t l e ” 6 # 5 e x : r u l e 1 6 7 # 5 e x : c r o s s w a l k 3 8 # 5 e x : o r i g i n M A R C : 2 4 5 9 e x : d o c b a s e / d o c 3 d c : t i t l e “ L o r e m i p s u m d o l o r ” 1 0 # 9 e x : r u l e 1 8 1 1 # 9 e x : c r o s s w a l k 3 1 2 # 9 e x : o r i g i n M A R C : 2 4 5 1 3 # 9 e x : o r i g i n M A R C : 2 4 6 1 4 e x : d o c b a s e / d o c 4 d c : t i t l e “ C o n s e t e t u r S a d i p s c i n g ” 1 5 # 1 4 e x : r u l e 1 9 1 6 # 1 4 e x : c r o s s w a l k 6 1 7 # 1 4 e x : o r i g i n x m l : / r e c o r d / d e s c r i p t i o n E x a m p l e 4 : R e s u l t i n g R D F s t a t e m e n t s w i t h a d d i t i o n a l M e t a m e t a d a t a
  14. 14DCMI Metadata Provenance Crosswalk Updates  Which statements are generated by a given rule and  need to be regenerated after an update? SELECT ?document ?field ?value WHERE { ?t rdf:subject ?document . ?t rdf:predicate ?field . ?t rdf:object ?value . ?t ex:rule 16 . ?t ex:crosswalk 3 . } document field value ex:docbase/doc1 http://www.example.org/dc#title "Example title" ex:docbase/doc2 http://www.example.org/dc#title "About ding a title" document field value ex:docbase/doc1 http://www.example.org/dc#title "Example title" ex:docbase/doc2 http://www.example.org/dc#title "About ding a title"
  15. 15DCMI Metadata Provenance Crosswalk Debugging  Which rule is responsible for a given statement and  what was the original data? SELECT ?crosswalk ?rule ?origin WHERE { ?t rdf:subject <ex:docbase/doc1> . ?t rdf:predicate dc:title . ?t rdf:object "Example title" . ?t ex:rule ?rule . ?t ex:crosswalk ?crosswalk . ?t ex:origin ?origin . } crosswalk rule origin 3 16 "MARC:245" crosswalk rule origin 3 16 "MARC:245"
  16. 16DCMI Metadata Provenance Scenario 2: Different Sources for Metadata  Manual indexing is costly.  Many documents are not indexed at all or not  searchable:  Journal Articles  Externally owned documents   Working papers  Webpages  New sources for metadata?
  17. 17DCMI Metadata Provenance New ways for document indexing  Automatic processes  Tagging  (Automatic) mapping of metadata from external  sources  Problem: Lack of quality  How do you integrate these data from different sources without  compromising the retrieval quality?
  18. 18DCMI Metadata Provenance Possibilities for Metametadata  Storage of the source of single statements.  Storage of further source­specific information:  Weighting for automatically generated subject headings.  Number of users who tagged a document with a given tag.  The original subject heading in case of an automatic  translation or mapping. Can we use these additional information to improve document  retrieval?
  19. 19DCMI Metadata Provenance Example 2: Subject indexing S u b j e c t P r e d i c a t e O b j e c t 1 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 2 0 2 # 1 e x : s o u r c e e x : s o u r c e s / a u t o i n d e x 1 3 # 1 e x : r a n k 0 . 5 5 4 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 3 0 5 # 4 e x : s o u r c e e x : s o u r c e s / a u t o i n d e x 1 6 # 4 e x : r a n k 0 . 8 7 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 3 0 8 # 7 e x : s o u r c e e x : s o u r c e s / p f e f f e r 9 # 7 e x : r a n k 1 . 0 1 0 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 4 0 1 1 # 1 0 e x : s o u r c e e x : s o u r c e s / p f e f f e r 1 2 # 1 0 e x : r a n k 1 . 0 1 3 e x : s o u r c e s / a u t o i n d e x 1 e x : t y p e e x : t y p e s / a u t o 1 4 e x : s o u r c e s / p f e f f e r e x : t y p e e x : t y p e s / m a n u a l E x a m p l e 7 : S u b j e c t a s s i g n m e n t s b y d i f f e r e n t s o u r c e s
  20. 20DCMI Metadata Provenance Backward compatibility  While there are four assignments for subject headings,  the statement  “ex:docbase/doc1 dc:subject ex:thes/sub30” is still one statement, regardless of the number of  times you put it into your RDF store.  Important for applications, that access the RDF Data,  but do not handle the RDF reification.  Your metadata remains valid, in particular there  are no doublets.
  21. 21DCMI Metadata Provenance Separating the sources  Which statements are made by a specific source (here:  Pfeffer)? SELECT ?document ?value WHERE { ?t rdf:subject ?document . ?t rdf:predicate dc:subject . ?t rdf:object ?value . ?t ex:source <ex:sources/pfeffer> . } document subject ex:docbase/doc1 ex:thes/sub30 ex:docbase/doc1 ex:thes/sub40 document subject ex:docbase/doc1 ex:thes/sub30 ex:docbase/doc1 ex:thes/sub40
  22. 22DCMI Metadata Provenance Extended queries  Use all manually created subject headings.  Use all subject headings with a rank > 0.7. SELECT DISTINCT ?document ?subject WHERE { ?t rdf:subject ?document . ?t rdf:predicate dc:subject . ?t rdf:object ?subject . ?t ex:source ?source . ?source ex:type ?type . ?t ex:rank ?rank . FILTER ( ?type = <ex:types/manual> || ?rank > 0.7 ) } document subject ex:docbase/doc1 ex:thes/sub30 ex:docbase/doc1 ex:thes/sub40 document subject ex:docbase/doc1 ex:thes/sub30 ex:docbase/doc1 ex:thes/sub40
  23. 23DCMI Metadata Provenance Conclusion  Many applications of metametadata in the library fields  can be realized with Metametadata.  No change on the underlying data models needed.  But:  Reification is not well accepted in the community.  Named graphs are not (yet) part of RDF standard.  ...  Existing approaches are usable, but users need more  guidance how to implement them.  Metametadata is not always the appropriate solution  (meta­level complexity vs. data model complexity)
Advertisement