SCAPECharacterisation - 101An introduction to the identification andcharacterisation of file formats.Carl WilsonOpen Plane...
SCAPE                 About Us• Carl Wilson  Open Planets Foundation  carl@openplanetsfoundation.org  http://www.openplane...
SCAPE                About You• Once Around The Room  • Name  • Where you work  • What you do  • Why you’re here• DO Ask Q...
SCAPE                File Formats• What is a File Format?  • A “standard” method of encoding data for    storage.  • May b...
SCAPE         Who Cares About Formats?• Operating Systems: in order to open a file with  an application that can interpret...
SCAPE      Some Uses of Format Information• Format Information:   • Associates a file with software that can     interpret...
SCAPE             File Name Extension• A file name suffix separated by a dot “.”, from the  file base name.• Examples: .pd...
SCAPE        Internet Media (MIME) Types• The format identifiers used by the web• Examples:   • text/plain   • text/html  ...
SCAPE             Apple’s Alternatives• Pre OS-X versions of MAC OS used Creator and  Type codes   • Creator: The software...
SCAPE    PRONOM Unique Identifiers or PUIDs• PRONOM is a web based registry of file format  information• Created and Hoste...
SCAPE             The Unix File Utility• A standard Unix program for identifying the data  in a file.• First released in 1...
SCAPE                    FIDO• Format Identification of Digital Objects• Open Source format identification tools• Based up...
SCAPE                 Apache Tika• Open Source toolkit for detecting and extracting  metadata and structured text from fil...
SCAPE    How Do These Tools Identify Formats?• They exploit “common features” of the format.• PDF start of file:   • %PDF-...
SCAPE           FIDO & PDF Identification• FIDO identifies the different PDF versions, each of  which have a PUID• FIDO al...
Upcoming SlideShare
Loading in …5
×

Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

396 views
285 views

Published on

This is an introduction to the identification and characterization of file formats and which tools can be used for this. The intro was given by Carl Wilson from Open Planets Foundation at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
396
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

  1. 1. SCAPECharacterisation - 101An introduction to the identification andcharacterisation of file formats.Carl WilsonOpen Planets FoundationSCAPE TrainingGuimarães This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).
  2. 2. SCAPE About Us• Carl Wilson Open Planets Foundation carl@openplanetsfoundation.org http://www.openplanetsfoundation.org• SCAPE Project EU funded research project SCAlable Preservation Environments http://www.scape-project.eu 2
  3. 3. SCAPE About You• Once Around The Room • Name • Where you work • What you do • Why you’re here• DO Ask Questions • Or tell me to slow down… • Or ask me to repeat something… 3
  4. 4. SCAPE File Formats• What is a File Format? • A “standard” method of encoding data for storage. • May be to an open specification • OR a proprietary one, open preferred • Or simply following a loosely documented convention 4
  5. 5. SCAPE Who Cares About Formats?• Operating Systems: in order to open a file with an application that can interpret /render it.• Web Servers: to negotiate Content-Type in HTTP requests• Memory Institutions: to identify software stacks that can render or extract meaning from a file, now or at a later date.• More Generally: everyone with digital content, whether they know it or not. 5
  6. 6. SCAPE Some Uses of Format Information• Format Information: • Associates a file with software that can interpret and/or render its contents • Can be used to find documentation / specifications to help interpret a file’s contents • Is a first step to preservation planning, knowing what you have…… 6
  7. 7. SCAPE File Name Extension• A file name suffix separated by a dot “.”, from the file base name.• Examples: .pdf, .txt, .jpg, .doc, .docx• This has worked for a number of years BUT • Any user with the right permission can change a file extension • Bytes aren’t always transferred with a name 7
  8. 8. SCAPE Internet Media (MIME) Types• The format identifiers used by the web• Examples: • text/plain • text/html • image/jpg• Don’t readily hold extra information such as format version, but may be extended. 8
  9. 9. SCAPE Apple’s Alternatives• Pre OS-X versions of MAC OS used Creator and Type codes • Creator: The software that created the file • Type: The type of information, e.g. TEXT • More flexible than extension, but no longer used• Recent OS-X versions also use Uniform Type Identifiers 9
  10. 10. SCAPE PRONOM Unique Identifiers or PUIDs• PRONOM is a web based registry of file format information• Created and Hosted by the National Archives of the UK in 2002• Uses PUIDS to identify file formats: • fmt/15 == Acrobat PDF 1.1 • fmt/16 == Acrobat PDF 1.2 • fmt/17 == Acrobat PDF 1.3 10
  11. 11. SCAPE The Unix File Utility• A standard Unix program for identifying the data in a file.• First released in 1973, written in C so requires Operating System dependent compilation• Open source version used in Linux distributions written in 1986• Identification based upon compiled “magic” files• Provides text information about files, or MIME types with the right options 11
  12. 12. SCAPE FIDO• Format Identification of Digital Objects• Open Source format identification tools• Based upon the PRONOM signature data compiled to regular expressions• Written in Python so can be run on different Operating Systems• Richer command line syntax than DROID 12
  13. 13. SCAPE Apache Tika• Open Source toolkit for detecting and extracting metadata and structured text from files• Performs Format Identification and deeper characterisation (more on that later).• Java based so will run on different platforms.• Returns MIME types as format identifiers 13
  14. 14. SCAPE How Do These Tools Identify Formats?• They exploit “common features” of the format.• PDF start of file: • %PDF-1.1 PDF Version 1.1 • %PDF-1.2 PDF Version 1.2 • %PDF-1.6 PDF Version 1.6• Tika and File simply look for files starting with the string %PDF- and return the MIME type• FIDO However…… 14
  15. 15. SCAPE FIDO & PDF Identification• FIDO identifies the different PDF versions, each of which have a PUID• FIDO also looks for an END OF FILE marker for PDFs : .%%EOF.• This could be a problem……. 15

×