Significant characteristics in Planets


            Manfred Thaller
           Universität zu* Köln

          *Universit...
What are “significant characteristics”?

Those properties of a digital file which have to
  be known to enable the process...
Why extract them by software?

To create technical metadata as required by
  organizational models for long term
  preserv...
Within Planets …

… served by solutions to identify formats:
  formats registry / PRONOM / DROID.

… and a solution for ex...
A Vision
            Extractor                   tiff XCDL
  tiff

                                  93%
Migrator         ...
A Vision
 Extractor




                    Comparator
Appropriate XCELs

                        C-Set
Why automate?

1 million objects: use one second for each.

== 16666.7 minutes == 277.8 hours

== 11.57 working days of a ...
Why automate?

1 million objects: use five minutes for each.

== 416 666.7 hours

== 52 803.4 8-hour days for a Human
Why automate?

Assumption: Preservation is only feasible, if the
content of two digital objects can be compared
without hu...
Demo
Abstract solution I
(1) Language to represent the complete content of a digital object.
    XCDL
(2) Language to describe ...
<XCELDocument...>           ...                       <xcdl>
<formatDescription>....                                 <obje...
<request2>                                      <property id=quot;2quot;
    <measurementRequest>                         ...
Abstract solution I
(1) Language to represent the complete content of a digital object.
    XCDL
(2) Language to describe ...
Are the following two items equal:




          VIII  8
eight   eight


VIII  8
otto

       eight   eight


otto   VIII  8
otto                   acht

       eight   eight


otto   VIII  8       acht
8.0
otto                         acht

       eight         eight


otto   VIII  8             acht
Information model: „an image“

otto                                   acht

           eight           eight


otto     VI...
information model: „an image“

 format ontology: „what terms are
used in formats to describe image
            properties“...
Information model: „what is an image“

  Format ontology: „what terms are
  used in formats to describe image
            ...
Abstract solution II
(1) A theoretical model of information (not: data) types – “image”,
    “text”, “audio” ...

(2) Onto...
XCDL

eXtensible Characterisation Definition
  Language

Purpose: Describe the contents of a file in
  terms of an abstrac...
XCDL: text model (1)

A text (= <object>) is composed of
data (= <normData>) plus
interpretations of data according to t...
XCDL: text model (2)

Or, one level of abstraction higher, a text
is composed of content carrying tokens,
accompanied by r...
This text   is a

<refData id=quot;1quot;>54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>
…
<property>
<name>fontsize<...
This text   is a

<refData id=quot;1quot;>54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>
…
<property>
<name>fontsize<...
Thank you!

     Questions?

Manfred.thaller@uni-koeln.de
Upcoming SlideShare
Loading in …5
×

Significant Characteristics In Planets Manfred Thaller

1,583 views

Published on

3rd Annual WePreserve Conference Nice 2008

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,583
On SlideShare
0
From Embeds
0
Number of Embeds
36
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Significant Characteristics In Planets Manfred Thaller

  1. 1. Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne
  2. 2. What are “significant characteristics”? Those properties of a digital file which have to be known to enable the processing of the file within a specific setup.
  3. 3. Why extract them by software? To create technical metadata as required by organizational models for long term preservation. (NLNZ)
  4. 4. Within Planets … … served by solutions to identify formats: formats registry / PRONOM / DROID. … and a solution for extracting and processing such characteristics: XCL.
  5. 5. A Vision Extractor tiff XCDL tiff 93% Migrator Comparator png tiff XCEL png XCEL png XCDL
  6. 6. A Vision Extractor Comparator Appropriate XCELs C-Set
  7. 7. Why automate? 1 million objects: use one second for each. == 16666.7 minutes == 277.8 hours == 11.57 working days of a computer == 34.7 8-hour days for a Human == 7 working weeks
  8. 8. Why automate? 1 million objects: use five minutes for each. == 416 666.7 hours == 52 803.4 8-hour days for a Human
  9. 9. Why automate? Assumption: Preservation is only feasible, if the content of two digital objects can be compared without human intervention, giving a numerical estimate of their degree of similarity.
  10. 10. Demo
  11. 11. Abstract solution I (1) Language to represent the complete content of a digital object. XCDL (2) Language to describe any machine readable format in a formal language. XCEL (3) Software to extract the content of a file based upon a description as under (2) and express it in the language as specified under (1). “extractor” (4) Software to compare two such content descriptions. “comparator”
  12. 12. <XCELDocument...> ... <xcdl> <formatDescription>.... <object id=quot;o1quot; > <symbol identifier=quot;ID01_I01_I01_S02quot; <normData id=quot;nd1quot; > ... </normData> originalName=quot;height“ interpretation=quot;uint32quot;> <property id=quot;p1quot; source=quot;rawquot; <range><startposition xsi:type=quot;sequential“> cat=quot;descrquot; > </startposition> <length xsi:type=quot;fixedquot;>4</length></range> <name> compression</name> <name>height</name> <valueSet id=quot;i_i1_s6quot; > <rawValue>0 </rawValue> </symbol> <labValue>...</labValue> <symbol identifier=quot;ID01_I01_I01_S04quot; originalName=quot;colourTypequot;> <dataRef ind=quot;normAllquot; /> <range> <propRel/> <startposition xsi:type=quot;sequentialquot;> </valueSet> </startposition> </property> <length xsi:type=quot;fixedquot;>1</length></range> <property id=quot;p2quot; source=quot;rawquot; <valueInterpretation> cat=quot;descrquot; > <valueLabel>greyscale</valueLabel> <name> height</name> <value>0</value></valueinterpretation> <valueSet id=quot;i_i1_s3quot; > <name>imageType</name> <rawValue>0 0 1 ad </rawValue> </symbol> <labValue> <symbol identifier=quot;ID01_I01_I01_S05quot; <val>429</val> originalName=quot;compressionMethodquot;> <type>uint32</type> <range> </labValue> <startposition xsi:type=quot;sequential“> <dataRef ind=quot;normAllquot; /> </startposition> <propRel/> <length </valueSet> xsi:type=quot;fixedquot;>1</length></range> </property> <valueInterpretation> <property id=quot;p3quot; source=quot;rawquot; <valueLabel>zlibDeflateInflate</valueLabel> cat=quot;descrquot; > <value>0</value></valueInterpretation> <name> imageType</name> <name>compression</name> ..... </symbol>...
  13. 13. <request2> <property id=quot;2quot; <measurementRequest> name=quot;imageHeightquot; <source name=quot;XCDL1.xmlquot;/> unit=quot;pixelquot; <target name=quot;XCDL2.xmlquot;/> compStatus=quot;completequot;> <property id=quot;45quot; name=quot;rgbPalettequot;> <values type=quot;intquot;> <metric id=quot;10quot; name=quot;hammingDistancequot;/> <src>32</src> </property> <property id=quot;300quot; name=quot;normDataquot;> <tar>32</tar> <metric id=quot;10quot; </values> name=quot;hammingDistancequot;/> <metric id=quot;50quot; name=quot;RMSEquot;/> <metric id=quot;200quot; </property> name=quot;equalquot; result=quot;truequot;/> <property id=quot;2quot; <metric id=quot;201quot; name=quot;imageHeightquot; unit=quot;pixelquot;> name=quot;intDiffquot; result=quot;0quot;/> <metric id=quot;200quot; name=quot;equalquot;/> <metric id=quot;210quot; <metric id=quot;201quot; name=quot;intDiffquot;/> name=quot;percDevquot; <metric id=quot;210quot; name=quot;percDevquot;/> result=quot;0.000000quot;/> </property> </property> <property id=quot;30quot; name=quot;imageWidthquot; unit=quot;pixelquot;> <metric id=quot;200quot; name=quot;equalquot;/> <metric id=quot;201quot; name=quot;intDiffquot;/> <metric id=quot;210quot; name=quot;percDevquot;/> </property>
  14. 14. Abstract solution I (1) Language to represent the complete content of a digital object. XCDL (2) Language to describe any machine readable format in a formal language. XCEL (3) Software to extract the content of a file based upon a description as under (2) and express it in the language as specified under (1). “extractor” (4) Software to compare two such content descriptions. “comparator”
  15. 15. Are the following two items equal: VIII  8
  16. 16. eight eight VIII  8
  17. 17. otto eight eight otto VIII  8
  18. 18. otto acht eight eight otto VIII  8 acht
  19. 19. 8.0 otto acht eight eight otto VIII  8 acht
  20. 20. Information model: „an image“ otto acht eight eight otto VIII  8 acht
  21. 21. information model: „an image“ format ontology: „what terms are used in formats to describe image properties“ VIII  8
  22. 22. Information model: „what is an image“ Format ontology: „what terms are used in formats to describe image properties“ Extraction language: “how to get the terms describing an image out of a file”
  23. 23. Abstract solution II (1) A theoretical model of information (not: data) types – “image”, “text”, “audio” ... (2) Ontologies, which map existing file format terminologies onto these model. (3) A language – XCDL – which allows to express the content of files in different formats using the vocabulary of the ontologies and the “grammar” of the information model.
  24. 24. XCDL eXtensible Characterisation Definition Language Purpose: Describe the contents of a file in terms of an abstract model.
  25. 25. XCDL: text model (1) A text (= <object>) is composed of data (= <normData>) plus interpretations of data according to the underlying format specification (= <property>).
  26. 26. XCDL: text model (2) Or, one level of abstraction higher, a text is composed of content carrying tokens, accompanied by rendering info plus deployment info plus historical info.
  27. 27. This text is a <refData id=quot;1quot;>54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData> … <property> <name>fontsize</name> <rawVal> <val>48</val> <type>unsignedInt8</type> </rawVal> <dataRef> <!-- property refers to discrete part of reference data--> <ref id=quot;1quot; start=quot;0quot; end=quot;3quot;/> <ref id=quot;1quot; start=“10quot; end=quot;12quot;/> </dataRef> </property>
  28. 28. This text is a <refData id=quot;1quot;>54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData> … <property> <name>fontsize</name> <rawVal> <val>48</val> <type>unsignedInt8</type> </rawVal> <dataRef> <!-- property refers to discrete part of reference data--> <ref id=quot;1quot; start=quot;0quot; end=quot;3quot;/> <ref id=quot;1quot; start=“10quot; end=quot;12quot;/> </dataRef> </property>
  29. 29. Thank you! Questions? Manfred.thaller@uni-koeln.de

×