Trm 02 10 07vilnius
Upcoming SlideShare
Loading in...5
×
 

Trm 02 10 07vilnius

on

  • 1,736 views

DPE Training materials

DPE Training materials

Statistics

Views

Total Views
1,736
Views on SlideShare
1,702
Embed Views
34

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 34

http://www.digitalpreservationeurope.eu 33
http://www.digitalpreservationeurope.e 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Trm 02 10 07vilnius Trm 02 10 07vilnius Presentation Transcript

  • File formats and registries Manfred Thaller, University at Cologne October 2 nd , 2007
    • PART I – Formats and Registries EXERCISE I – Evaluate some PART II – Formats in PLANETS EXERCISE II – A bit of modelling
  • An image
  • An image 6 rows 5 columns
  • 5 rows 6 columns
  • An image 1 == yellow 0 == red 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1
  • An image 1 == violett 0 == green 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1
  • An image Store: 1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1
  • An image Store: 6,1,3,0,3,11,0,4,1,1,0,4,1,1,0,7,1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1
  • An image Store: 1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1 Uncompressed 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1
  • An image Store: 6,1,3,0,3,1,1,0,4,1,1,0,4,1,1,0,7,1 (Compressed)Run Length Encoded 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1
  • An image Store: SetSize: 5 by 6 SetBackgroundColor: Blue SetForegroundColor: Red SetLetterHeight: 4 MoveTo: 3,5 DrawLetter: T 1,1 2,1 3,1 4,1 5,1 1,2 2,2 3,2 4,2 5,2 1,3 2,3 3,3 4,3 5,3 1,4 2,4 3,4 4,4 5,4 1,5 2,5 3,5 4,5 5,5 1,6 2,6 3,6 4,6 5,6
  • An image 6 rows 5 columns 1 == yellow 0 == red Uncompressed
  • An image dimensions 1 == yellow 0 == red Uncompressed
  • An image dimensions photogrammetric interpretation Uncompressed
  • An image dimensions photogrammetric interpretation compression
  • An image <basic information> <rendering information> <storage information>
  • An image <basic information> (implicit / explicit) <rendering information> (implicit / explicit) <storage information> (implicit / explicit) … and the data?
  • An image <basic information> (implicit / explicit) <rendering information> (implicit / explicit) <storage information> (implicit / explicit) … and the data?
  • An image Data either as data stream 1,1,1,1,1,1, 0,0,0,1,1,1, 0,1,1,1,1,0, 1,1,1,1,0,1, 1,1,1,1,1,1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1
  • An image Data either as data stream or as processing instructions SetSize: 5 by 6 SetBackgroundColor: Yellow SetForegroundColor: Red SetLetterHeight: 4 MoveTo: 3,5 DrawLetter: T 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1
  • File format <basic information> <rendering information> <storage information> <data>
  • File format <basic information> What to do? <rendering information> <storage information> <data>
  • File format <basic information> What to do? <rendering information> How to do it? <storage information> <data>
  • File format <basic information> What to do? <rendering information> How to do it? <storage information> How to move it from persistent to deployed form? <data>
  • File format <basic information> What to do? <rendering information> How to do it? <storage information> How to move it from persistent to deployed form? <data> What to deploy?
  • File format <basic information> What to do? <rendering information> How to do it? <storage information> How to move it from persistent to deployed form? <data> What to deploy?
  • File format <basic information> Mandatory <rendering information> Useful <storage information> Historical <data> Mandatory
  • File format A deterministic specification how the properties of a digital object can reversibly be converted into a linear bytestream (bitstream).
  • File format: TIFF
  • File format: PDF 1 0 obj << /Type /Page /Parent 281 0 R /Resources 2 0 R /Contents 3 0 R /StructParents 2 /MediaBox [ 0 0 612 792 ] /CropBox [ 0 0 612 792 ] /Rotate 0 >> endobj
  • File format: PDF 2 0 obj << /ProcSet [ /PDF /Text ] /Font << /TT2 292 0 R /TT4 288 0 R >> /ExtGState << /GS1 300 0 R >> /ColorSpace << /Cs6 289 0 R >> >> endobj
  • File format: PDF 3 0 obj << /Length 4605 /Filter /FlateDecode >> stream H‰„WÛŽÛÈ}×Wô#Œ4jR”¨`±Àø ™Í&quot; ¶(²5j›&quot;¹lräý‘|oêÖ-j —‹ udTÙÂ…fPnˆ¿ìþ>Ó›Ež²ÝÕ˽âä”uª2i*<<v ú[Óžk9Q‰¼‡x»XTP{ ‹ ±/[i²½Ö)}ÔÏö&ªÙH;<Cµ … and about 4000 bytes more ŠøL&quot;È÷ےƐ¬JYØÂm]j¥Ýqõ¥ÏººÕ™·²ôÒ·Ûº¤–÷.u-kP0 4“øTxM<é識9uôøˆòLi¦ØoTÖ m–;ǯ÷¤ÿlÕºvéU—Ë ±¤Lm°gŸˆu1Åëu5l3¯’¢O %òËTîü7?ìNdh endstream endobj
  • File format: XML (here: SVG) <?xml version=&quot;1.0&quot; encoding=&quot;UTF-16&quot;?> <svg:svg width=&quot;800&quot; height=&quot;1000&quot; xmlns:svg=&quot;http://www.w3.org ... <svg:rect x=&quot;0&quot; y=&quot;0&quot; width=&quot;800&quot; height=&quot;1000&quot; fill=&quot;white&quot; /> <svg:g transform=&quot;translate(-140,0)&quot;> <svg:line x1=&quot;600&quot; y1=&quot;20&quot; x2=&quot;500&quot; y2=&quot;20&quot; stroke=&quot;black&quot; … <svg:text x=&quot;600&quot; y=&quot;28.8&quot; font-size=&quot;6&quot; fill=&quot;black&quot; … </svg:g> <svg:g transform=&quot;translate(-140,0)&quot;> <svg:text x=&quot;500&quot; y=&quot;24.4&quot;> <svg:tspan font-size=&quot;4&quot; fill=&quot;black&quot;>Leiste</svg:tspan> </svg:text> </svg:g> <svg:defs> <svg:g id=&quot;halbeSaeuleLeiste0&quot;>
  • File format: XML (here SVG)
  • File format: XML (ETH: “column XML”) <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> <Autor name=&quot;Vitruv&quot;> <Ordnung name=&quot;Ionisch&quot; THz=&quot;&quot; THn=&quot;&quot; MH=&quot;&quot; TBz=&quot;&quot; TBn=&quot;&quot; … <Element name=&quot;Gebaelk&quot; original=&quot;&quot; THz=&quot;&quot; THn=&quot;&quot; MH=&quot;&quot; … <Element name=&quot;Gesims&quot; original=&quot;corona&quot; THz=&quot;&quot; THn=&quot;&quot; MH=&quot;&quot; … <Element name=&quot;Leiste&quot; original=&quot;&quot; THz=&quot;&quot; THn=&quot;&quot; MH=&quot;0.03&quot; … <Element name=&quot;Kyma&quot; original=&quot;sima&quot; THz=&quot;&quot; THn=&quot;&quot; … <Element name=&quot;Leiste&quot; original=&quot;&quot; THz=&quot;&quot; THn=&quot;&quot; MH=&quot;0.017&quot; … <Element name=&quot;Kyma_reversa&quot; original=&quot;cymatium&quot; THz=&quot;&quot; … <Element name=&quot;Platte&quot; original=&quot;corona&quot; THz=&quot;&quot; THn=&quot;&quot; … <Element name=&quot;Leiste&quot; original=&quot;&quot; THz=&quot;&quot; THn=&quot;&quot; MH=&quot;0.017&quot; … <Element name=&quot;Kyma_reversa&quot; original=&quot;cymatium&quot; THz=&quot;&quot; … <hElement name=&quot;Band&quot; typ=&quot;1&quot; dx=&quot;0.048&quot; r=&quot;0.019&quot;/> <hElement name=&quot;Band&quot; typ=&quot;1&quot; dx=&quot;0.048&quot; r=&quot;0.019&quot;/> </Element>
  • Files and Preservation
    • Bit rot.
    • Obscolescence of software.
  • Bit rot An Image file before ….
  • Bit rot ... and after one byte is changed.
  • Bit rot ... and after one byte is changed. Undetectable by software.
  • Bit rot Processing dictionary Payload 002 004 234 123 234 156 127 178 221 221
  • Bit rot One byte is damaged, one byte cannot be displayed correctly. 002 004 234 123 234 156 127 xxx 221 221
  • Bit rot One byte is damaged, ten bytes cannot be displayed correctly. 002 xxx 234 123 234 156 127 178 221 221
  • Result: http://www.cflr.beniculturali.it/Progetti/Fixit.php www.cflr.beniculturali.it Franco Liberati [email_address] Università di Roma “La Sapienza” Dipartimento Informatica Centro Fotoriproduzione Legatoria e Restauro Paolo Buonora [email_address]
  • Paolo on JPEG JPEG2000 more robust against bit rot than TIFF.
  • Paolo on JPEG JPEG2000 more robust against bit rot than TIFF. So, to stinulate more empiricism …
  • Obsolescence
    • Software able to read does not exist any more.
    • Format specification lost.
    • Implied algorithm lost.
    • Required object lost.
  • Recommended formats: text http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf High confidence Medium confidence Low confidence
    • Plain text (encoding: ISO8859-1 - 9 , UTF-8, UTF-16 with BOM)
    • XML (includes XSD/XSL/XHTML, etc.; with included or accessible
    • schema and character
    • encoding explicitly
    • specified)
    • PDF/A-1 (ISO 19005-1)
    • Cascading Style Sheets (*.css)
    • DTD (*.dtd)
    • PDF (*.pdf) (embedded fonts)
    • Rich Text Format 1.x (*.rtf)
    • HTML 4.x (include a
    • DOCTYPE declaration)
    • SGML (*.sgml)
    • Open Office (*.sxw/*.odt)
    • Office Open XML (*.docx)
    • PDF (*.pdf) (encrypted)
    • Microsoft Word (*.doc)
    • WordPerfect (*.wpd)
    • DVI (*.dvi)
    • All other text formats not
    • listed here
  • Recommended formats: bitmap / raster image http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf High confidence Medium confidence Low confidence
    • TIFF (uncompressed)
    • PNG (*.png)
    • BMP (*.bmp)
    • JPEG/JFIF (*.jpg)
    • JPEG2000 (prefer lossless or uncompressed) (*.jp2)
    • TIFF (compressed)
    • GIF (*.gif)
    • MrSID (*.sid)
    • TIFF (in Planar format)
    • FlashPix (*.fpx)
    • PhotoShop (*.psd)
    • All other raster image formats not listed here
  • Recommended formats: vector graphics http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf High confidence Medium confidence Low confidence
    • SVG 1.1 (no Java binding) (*.svg)
    • Computer Graphic Metafile (CGM, WebCGM) (*.cgm)
    • Encapsulated Postscript (EPS)
    • Macromedia Flash (*.swf)
    • All other vector image formats not listed here
  • Recommended formats: audio http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf High confidence Medium confidence Low confidence
    • AIFF (PCM) (*.aif, *.aiff)
    • WAV (PCM) (*.wav)
    • SUN Audio (uncompressed) (*.au)
    • Standard MIDI (*.mid,
    • *.midi)
    • Ogg Vorbis (*.ogg)
    • Free Lossless Audio Codec (*.flac)
    • Advance Audio Coding (*.mp4, *.m4a, *.aac)
    • MP3 (MPEG-1/2, Layer 3)(*.mp3)
    • AIFC (compressed) (*.aifc)
    • NeXT SND (*.snd)
    • RealNetworks 'Real Audio‚ (*.ra, *.rm, *.ram)
    • Windows Media Audio
    • (*.wma)
    • WAV (compressed) (*.wav)
    • All other audio formats not listed here
  • Recommended formats: video http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf High confidence Medium confidence Low confidence
    • Motion JPEG 2000
    • (ISO/IEC 15444-4) ( *.mj2)
    • AVI (uncompressed)
    • (*.avi)
    • QuickTime Movie
    • (uncompressed)(*.mov)
    • Motion JPEG (*.avi,
    • *.mov)
    • Ogg Theora (*.ogg)
    • MPEG-1, MPEG-2 (*.mpg, *.mpeg)
    • MPEG-4(*.mp4)
    • AVI (compressed) (*.avi)
    • QuickTime Movie
    • (compressed) (*.mov)
    • RealNetworks 'Real Video‚ (*.rv)
    • Windows Media Video
    • (*.wmv)
    • All other video formats not listed here
  • Recommended formats: “data base” http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf High confidence Medium confidence Low confidence
    • Delimited Text (*.txt,
    • *.csv)
    • SQL DDL
    • DBF (*.dbf)
    • OpenOffice *.sxc/*.ods)
    • Office Open XML *.xlsx)
    • Excel (*.xls)
    • All other spreadsheet/ database formats not listed here
  • Recommended formats: 3D (“virtual reality”) http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf High confidence Medium confidence Low confidence
    • X3D (*.x3d)
    • VRML (*.wrl, *.vrml)
    • U3D (Universal 3D file
    • format)
    • All other virtual reality
    • formats not listed here
  • What kind of file is this?
    • Two ways to identify a file:
    • By extension.
    • By internal characteristics („magic number“, „signature“).
  • What kind of file is this?
    • Two ways to identify a file:
    • By extension.
    • „ Each file ending with *.doc is a MS Word document“
  • What kind of file is this? Two ways to identify a file: (b) By internal characteristics („magic number“, „signature“). A TIFF file begins with … Bytes 0-1: The byte order used within the file. Legal values are: “ II” (4949.H) / “MM” (4D4D.H) Bytes 2-3 An arbitrary but carefully chosen number (42) that further identifies the file as a TIFF file.
  • What kind of file is this?
    • Necessity to identify files lead to two developments:
    • „ Clever software“ – inspects files to decide how to process them.
    • MIME Types.
    • FORMAT registries.
  • What kind of file is this? The following 4 transparencies are a quotation from http://hul.harvard.edu/gdfr (see below).
  • Why Do We Need a Registry?
    • Repository functions are performed on a format-specific basis
    • Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented
    • Interchange requires mutual agreement of format syntax and semantics
    Global Digital Format Registry DSpace User Group, March 2004
  • Potential Use Cases
    • Identification
      • “ I have a digital object; what format is it?”
    • Validation
      • “ I have an object purportedly of format F ; is it?”
    • Transformation
      • “ I have an object of format F , but need G ; how can I produce it?”
    • Characterization
      • “ I have an object of format F ; what are its significant properties?”
    • Risk assessment
      • “ I have an object of format F ; is at risk of obsolescence?”
    • Delivery
      • “ I have an object of format F ; how can I render it?”
    Global Digital Format Registry DSpace User Group, March 2004
  • Repository Format Dependencies Using the OAIS Reference Model Global Digital Format Registry DSpace User Group, March 2004
  • What’s Wrong with MIME Types?
    • Insufficient depth of detail
      • No requirements regarding syntax and semantic description
      • No requirement for complete disclosure, especially of proprietary formats
    • Insufficient granularity
      • Both tiled RGB GeoTIFF with LZW and striped bi-tonal TIFF-FX with Group 4 are typed as “image/tiff”
      • All of PDF 1.0 – 1.4, PDF/X-1, X-2, X-3, and PDF/A are typed as “application/pdf”
      • These variants might require radically different workflows
    Global Digital Format Registry DSpace User Group, March 2004
  • File format registries - URLs PRONOM : http://www.nationalarchives.gov.uk/pronom/ (does not only rely on extensions) Global Digital Format Registry : http://hul.harvard.edu/gdfr (predominantly project description) FileExt : http://filext.com (predominantly links to software)
  • Exercise I: A few experiments Group 1 Aist ė Abromaityt ė Tomasz Jablonski Aadi Kaljuvee Jurat ė Kuprien ė Violeta Meiliūnait ė
  • Exercise I: A few experiments Group 2 Libor Coufal Edvardas Germanas Hamid Rofoogaran Laima Šiudikiene Egl ė Žvinyt ė
  • Exercise I: A few experiments Group 3 Renata Balandien ė Thomas Guignard Edgars Jekabsons Elona Malaiškien ė Bjorn Ragnolf Ronning
  • Exercise I: A few experiments Group 4 Gražina Deveikyt ė Raimondas Malaiška Filip Kwiatek Marija Prokopčik Piret Randmae Jelena Saikovič
  • PART II – Formats in PLANETS: File characteristics
  • PART II – Formats in PLANETS: File characteristics
    • Based on two formal languages:
    • eXtensible Characterisation Extraction Language (= XCEL)
    • eXtensible Characterisation Description Language (= XCDL)
    • 2007
    2017 0,99% Tooth of Time Extractor Format specified in XCEL Comparer XCDL 2017 XCDL 2007
    • tiff
    png 0,93% Migrator Extractor tiff XCEL png XCEL ... XCEL ... XCEL Comparer png XCDL tiff XCDL
    • <XCELDocument ...> ...
    • <formatDescription>....
    • <symbol identifier=&quot;ID01_I01_I01_S02&quot; originalName=&quot;height“ interpretation=&quot;uint32&quot;>
    • <range>
    • <startposition xsi:type=&quot;sequential“> </startposition>
    • <length xsi:type=&quot;fixed&quot;>4</length></range>
    • <name> height </name>
    • </symbol>
    • <symbol identifier=&quot;ID01_I01_I01_S04&quot; originalName=&quot;colourType&quot;>
    • <range>
    • <startposition xsi:type=&quot;sequential&quot;> </startposition>
    • <length xsi:type=&quot;fixed&quot;>1</length></range>
    • <valueInterpretation>
    • <valueLabel>greyscale</valueLabel>
    • <value>0</value>...
    • <name> imageType </name>
    • </symbol>
    • <symbol identifier=&quot;ID01_I01_I01_S05&quot; originalName=&quot;compressionMethod&quot;>
    • <range>
    • <startposition xsi:type=&quot;sequential“> </startposition>
    • <length xsi:type=&quot;fixed&quot;>1</length></range>
    • <valueInterpretation>
    • <valueLabel>zlibDeflateInflate</valueLabel>
    • <value>0</value></valueInterpretation>
    • <name> compression </name>
    • </symbol>...
    <xcdl> <object id=&quot;o1&quot; > <normData id=&quot;nd1&quot; > ... </normData> <property id=&quot;p1&quot; source=&quot;raw&quot; cat=&quot;descr&quot; > <name> compression </name> <valueSet id=&quot;i_i1_s6&quot; > <rawValue>0 </rawValue> <labValue>...</labValue> <dataRef ind=&quot;normAll&quot; /> <propRel/> </valueSet> </property> <property id=&quot;p2&quot; source=&quot;raw&quot; cat=&quot;descr&quot; > <name> height </name> <valueSet id=&quot;i_i1_s3&quot; > <rawValue>0 0 1 ad </rawValue> <labValue> <val>429</val> <type>uint32</type> </labValue> <dataRef ind=&quot;normAll&quot; /> <propRel/> </valueSet> </property> <property id=&quot;p3&quot; source=&quot;raw&quot; cat=&quot;descr&quot; > <name> imageType </name> .....
  •  
  • Confession
  • Confession Computer science does not really know what information is.
  • Computer science does not really know what information is. It is pretty good at representing and processing it, though.
  • Representations & migrations III == 3 == γ ‘ == ●●● Four representations of the idea / concept / model three
  • Representations & migrations I divided by III == 1 / 3 == 1.3333? I divided by III == 1 / 3 == 1.3 periodic Some ideas are handled more precisely by Some thinkers than others.
  • Representations & migrations 48 bit images on 24 and on 48 bit graphics cards. Some data is processed more adequately by some equipment than others
  • Representations & migrations A model for information before and after a migration must therefore potentially represent all information there, irrespective of the possibility to process it in a given environment.
  • XCEL / XCDL Languages are being processed … … development focus currently: dynamic handling of format specific algorithms.
  • XCEL / XCDL: image model (1) A pixel cube … Each pixel: MSB (channel 1), … LSB (channel 1), … MSB (channel n), … LSB (channel n), MSB (aux 1), … LSB (aux 1), … MSB (aux m), … LSB (aux m)
  • XCEL / XCDL: image model (2) A pixel cube … Accompanied by rendering info plus deployment info.
  • XCEL / XCDL: image model - example <property id=&quot;p4&quot; source=&quot;raw&quot; cat=&quot;descr&quot; > <name>imageType</name> <valueSet id=&quot;i_i1_s5&quot; > <rawValue>2</rawValue> <labValue> <val>truecolour</val> <type>fixedLabel</type> </labValue> <dataRef ind=&quot;normAll&quot; /> <propRel/> </valueSet> </property>
  • XCEL / XCDL: text model A text (= <object>) is composed of - data (<normData>) plus - interpretations of data according to the underlying format specification (=properties; <property>).
  • XCEL / XCDL: text model - example This is a text <refData id=&quot;1&quot;>54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData> … <property> <name>fontsize</name> <rawVal> <val>00 18</val> <type>unsignedInt8</type> </rawVal> <dataRef> <!-- property refers to discrete part of reference data- -> <ref id=&quot;1&quot; start=&quot;0&quot; end=&quot;3&quot;/> <ref id=&quot;1&quot; start=“10&quot; end=&quot;12&quot;/> </dataRef> </property>
  • Exercise II: Abstract modelling Group 1: maps Group 2: music Group 3: excel sheets Group 4: „books“ … ever heard of FRBR?