This presentation by Anna-Grit Eggers (University of Goettingen) offers some guidance on the use of Information Encapsulation techniques.
PERICLES is a four-year Integrated Project (2013-2017) funded by the European Union under its Seventh Framework Programme (ICT Call 9).
http://pericles-project.eu/
Scaling API-first – The story of a global engineering organization
PERICLES - Choice of Information Encapsulation (IE) Technique
1. GRANT AGREEMENT: 601138 | SCHEME FP7 ICT 2011.4.3
Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics
[Digital Preservation]
“This project has received funding from the European Union’s Seventh
Framework Programme for research, technological development and
demonstration under grant agreement no601138”.
Choice of IE Technique
Anna-Grit Eggers (University of Goettingen)
3. • IE techniques cover a wide range of uses which differ regarding the:
•processing velocity
•required disk space
•location of storage
•accessibility and perceptibility of the payload (by human / by machine analysis)
•preservation level of the carrier / digital object
•processability of digital object and payload file formats and file sizes
•provided compression mechanisms.
Range of uses
4. • We identified criteria to distinguish between the techniques.
Criteria
• An encapsulation technique that fits for a specific use
scenario can be chosen based on the technique specific
characteristics of these criteria.
• Definition of criterion in this context:
A property or feature of information encapsulation techniques that can be
used to compare different techniques on the basis of the criterion
characteristics.
• For example:
•Robustness of encapsulated information after encapsulation with an algorithm
towards processing of the carrier.
•Perceptibility of the encapsulated information by an observation of the carrier.
5. Examples of characteristics for this example:
• The characteristic of a technique for the criterion “Robustness” can be:
• “robust” (“true”)
• “not robust” (“false”)
• The characteristic of a technique for the criterion “Perceptibility” can be:
• “visible”
• “not perceptible by humans”
• “detectable by computers”
• “not perceptible at all”.
Criterion characteristics
• The assignment for this criterion is harder, because the threshold for
“human perceptibility” or “computer detectability” is blurred and not a
“true/false” value.
8. • IE techniques are used for a specific purpose in the context of a use
scenario.
• This usage scenario defines features and characteristics that are desired to
be fulfilled by a potential IE technique.
• The overall task is to find the best IE technique by capturing and evaluating
the scenario defined by the user.
• The user could aim for encapsulating messages or metadata with digital
files.
• The aim could also be to add legal information, ownership information,
corporate designs, or information to ensure the authenticity of a digital
object.
Creating a scenario
9. • Some need a visible payload, others prefer to hide it.
• The most valuable information might be the digital object or the payload
itself (this is often the case with steganographic messages).
Creating a scenario (cont.)
10. • Three procedures are required to implement the scenario:
•scenario capturing
•weighting
•decision calculation mechanism
Implementing a scenario
• Capturing:
a questionnaire which requests the importance of a set of scenario criteria
for the given scenario. The criteria are chosen in a way that they can be
mapped to the features of the IE techniques.
• The amount of investigated criteria correlates with the amount of available
IE techniques: It should be high enough to be able to distinguish between
all main techniques, but low enough that the user won’t be overwhelmed
while filling the questionnaire.
11. • Weighting:
Another crucial aspect is to ask the user
• which of the criteria are important for the scenario
• how important they are,
• which should be excluded because not pertinent.
Implementing a scenario
• The Analytic Hierarchy Process is a sophisticated but complex method:
criteria are compared to each other and the user has to decide for each
comparison which criterion is the more important one.
• A simpler approach is to include an option to exclude unimportant
criteria and add a weighting mechanism for the user to indicate how
desirable a characteristic of a criterion is, or how important it is for the
scenario.
12. ● There are two types of criteria to consider:
Decision criteria
• must be fulfilled to be able to use a
specific algorithm.
• File formats are an example of a
technical criterion, because some
algorithms can only be used for
specific file formats.
Technical
criteria
• depend on a usage scenario or the
user preferences.
Scenario
criteria
13. Technical criteria:
◦ File formats (for carrier files as well as the payload files)
◦ Number of files
◦ Capacity
Technical decision criteria
14. File formats
• Embedding algorithms usually supports a set of file formats and cannot
be used on files with the wrong formats.
• For packaging, the metadata has to be mapped to one of the standard
XML packaging formats. The created packet reduces the risk of data
loss.
Technical decision criteria
Number of files
• A growing number of files increases the risk of losing one of them.
• With more than one file, the files can facilitate the identification process
of the belonging files by providing indications for the used formats. That
reduces the impact of file format obsolescence.
15. Capacity
• Capacity is the message size constraint: the number of payload bits that can be
embedded by an embedding algorithm into a specific digital object.
Technical criteria (cont.)
• It is influenced by the data format and used method. Some methods increase
the risk of damaging the carrier file, or the payload becomes visible if the
message size is too big.
• While packaging methods have no limit for the size of the payload files,
embedding methods mostly have a maximum payload size.
• Invisible watermarking and steganography embedding methods not only
become visible, if the payload size is too high, the cost for algorithmic
calculations will also strongly increase with the size of the payload.
• The use of an information frame can scale theoretically well for big payload file
size. Though it might be unproductive, if the data frame outsizes the original
digital object. In such a case the use of a packaging method can be considered
16. ● Processability and robustness
● Complexity (space/time) of the algorithm
● Used disk space of the output
● Restorability of the carrier
● Risk of data loss
● Perceptibility
● Location of the encapsulated information
● Spreading, standardization
● Security, confidentiality
● Authenticity
Scenario criteria
17. • To be re-usable, digital objects need to be processable normally by
applications unhindered by its encapsulated information.
Processability
• Packaging techniques might require unpacking before processing
the digital object, thus consuming additional calculation power.
• Embedding techniques do not change the file format of a digital
object which therefore can usually be processed directly.
• A method that allows for encapsulated information to survive
processing steps is considered “robust”.
• In an ideal case the embedded metadata survives even file format
conversions.
18. • Robustness can be strong or weak. Weakness implies an additional
extraction step to keep the metadata safe.
Robustness
• In a scenario where the digital object is frequently viewed and
processed, the metadata has to be embedded with a robust method.
• Steganographic methods are often very robust:
• they take an attacker into account.
• the digital objects can be processed normally, because a usage restriction would betray the
hidden messages.
• Visible digital watermarks can be very robust.
• Imperceptible watermarks are often fragile or semi fragile,
to allow the recognition of authenticity violations.
• The use of available metadata fields is a very robust method that allows
even object conversions. This can be valid for information frames, too.
19. Parameters for calculating costs for algorithmic calculations and
resources for the encapsulation process:
Complexity of algorithms
● The time and space requirements related to the complexity of the algorithm
used for the encapsulation and recovery of the original digital object and
the environment information
● This includes costs for validation calculations and for the unpacking
algorithms of packaging strategies.
● Big O notation can help to express the algorithms behaviour in relation to
the embedding payload size:
http://web.mit.edu/16.070/www/lecture/big_o.pdf
● Frequent use a digital object and its metadata requires faster the
restoration time.
20. • The time for decompression has to be added if the data was compressed.
• Packaging mostly needs a lot more time for this than embedding.
• With robust methods, an extraction is not necessary before the reuse of the
object.
• Edition of available metadata fields is integrated in many programs, and
therefore not very time intensive.
• The extension of an information frame in itself is not necessarily time
intensive, whereas the embedding method used on the frame might.
Complexity of algorithms (cont)
21. • The difference of disk space needed for the enriched digital object in
contrast to the original digital object can be an important parameter for
preserving a large amount of data.
Disk space requirements
• Some methods offer the possibility of compressing the data, so that disk
space can be saved.
• Packaging container compress both payload and digital object.
• Embedding methods offering compression only compress the embedded
metadata and not the carrier.
• Packaging can save more disk space than embedding with compression.
• An integrity check for possible damage during compression.
• Compression, decompression and validation require extra calculation time
22. • Embedding methods mostly do not need much extra disk space.
Disk space requirements (cont.)
• With steganography algorithms changing only single bits, the size of
the digital object remains constant. This method, however limits the
capacity.
• The use of available metadata fields doesn’t need much disk space.
Compression can extend the capacity for these methods.
• Information frames need additional disk space proportional to the size of
the metadata files that should be stored.
23. • The encapsulation method has to ensure that the digital object and the
metadata can be restored.
• The digital object has to be restored in its original state unscathed.
• The integrity has to be verified by checksum comparisons.
• There are different levels of integrity, e.g. just to ensure that the significant
properties survive, or a bit exact replica.
• The significant properties have to survive in any case.
• A validation of the whole object is often simpler than the validation of the
significant properties.
• It is highly improbably that packaging damages the digital object. A
validation is easy, if the checksum was added to the metadata file.
Restoration
24. • Not all algorithms that are used for embedding are completely
reversible.
• Reversible embedding algorithms often embed the information of
how to reverse them into a defined location of the digital object.
• If metadata is converted in the encapsulation, by example by
compression, encryption, or format conversion, it might be
necessary to validate the metadata.
• The methods using available metadata fields or information frames
offer easy restoration.
Restoration (cont.)
25. • The following factors increase the risk of damage for the digital object or the
metadata:
•encryption usage
•information hiding
•compression
•processing
•conversion of the digital object
Risk of data loss
• Packaging stores the metadata in separate files. This guarantees access to
embedded information which in turn may help identify the related digital data.
• Data containers used for packaging mostly have standard formats that are
not as vulnerable as non-standardised formats.
• At the same time the risk of data loss is higher for separated files when
unpacking.
• For some embedding methods object modifications are inevitable.
26. • The term ’Data Hiding’ describes methods to embed information in a way it
is not perceptible by humans.
• Steganography and invisible digital watermarks are mostly detectable by
machines.
• For most preservation scenarios it is necessary to be able to detect the
encapsulated information.
• Data hiding increases the risk of losing the knowledge about the existence
of this data.
• To avoid this, the carrier can be tagged with a visible method.
• Packaging is always visible, whereas steganographic methods are usually
invisible.
Visibility
27. • The location where the metadata is encapsulated can be a decision
criterion:
•a separate file
•the exact location at a file
•the time dimension of an audio file
•the background noise.
Location
• The location of storage is the main difference between packaging
and embedding methods:
•Packaging stores the information in a separated file, mostly in a standardised XML
format.
•Embedding stores the environment information directly in the digital object.
28. • The embedding into the background noise has no influence on the
significant properties and doesn't need additional disk space.
• Therefore, the noise has to be clearly identifiable, to prevent damage of
the digital object.
• Using available metadata fields, or an information frame, do not
influence the significant properties of the digital object directly.
• Some embedding methods store information by changing elements of
the object, e.g. by inverting single bits of an image pixel, or by usingof
an imperceptible frequency of audio files.
• Some data formats offer extra space for the storage of additional
information.
Location (cont.)
29. • Some encapsulation tools offer security features, like encryption.
Security
• If encryption requires a secret key for accessing the data, there is
high risk potential of losing the data by losing the key.
• The preservation and re-use of confidential objects or encapsulated
information requires adequate prudence. For this purpose an
encryption makes sense.
• The confidentiality of steganographic methods is based on the
retention of knowledge or authorisation, if no additional encryption is
used. Insofar this constitutes a very weak kind of confidentiality.
30. • Authenticity and integrity of the digital object and its environment
information are paramount for many usages.
• Authenticity can be important if the digital object has special legal
requirements.
• Fragile or semi-fragile digital watermarks can be used in some
cases to ensure the integrity of a delivery copy of an object, hereby
the object is changed slightly by the application of the watermarking
algorithm.
• The marking would be destroyed, if the file is altered, thereby an
intact mark can ensure that no third party changed the object.
• Authenticity plays a major role in the archive context in which also
the provenance and chain of custody of an object are important.
Authenticity
31. • To guarantee the integrity of a digital object, it is often kept apart in
its original state and context, all changes to the original are omitted.
Integrity
• The BagIt directory structure can be used without applying an
additional packaging or compression method to prevent object
alterations.
• Metadata is added into other defined directories of the structure, so
that the digital object remains untouched, even by complementing
information at a future date.
• The integrity of the encapsulated information can be verified by
adding the checksum of their originals to the restoration metadata.