1. GRANT AGREEMENT: 601138 | SCHEME FP7 ICT 2011.4.3
Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving
Semantics [Digital Preservation]
Fabio Corubolo, University of Liverpool
11 February, IDCC 2015, London
2. Ensure long term usability of DOs
Observation:
Use of DO ⇒ access to DO’s environment
4. Define a broad set of information
Consider its significance and purposes
Explore pragmatic methods to collect such
information
5. Technical system information
DO metadata
User, policy, process information
Information necessary to use the DO:
◦ Auxiliary data (e.g. calibration data)
◦ External documentation (e.g. related documents)
◦ Implicit knowledge (e.g. user knowledge about
relevance in relation to purpose
◦ …
6. All the entities that have some relationship to
a DO through its lifecycle
Entities: DOs, metadata, policies, rights,
services, users, etc.
Refinement:
Information about the set of relationships
from the DO to any related objects
7. DOs are preserved for different uses,
purposes
Purposes give scope to the dependent
environment information
Weights can express based on purpose
(definition)
SEI is the set of relationships between a DO and
its environment information qualified with
purpose and weights
8. Observe the use of DOs throughout of lifecycle
◦ Curation doesn’t start at the archive but throughout
DO’s life
Collect dependencies for use (SEI)
Measure significance
Sheer curation:
◦ curation activities integrated in the use workflow;
◦ lightweight and transparent
9. Open source* framework - builds on the SEI
Sheer curation – at the right time and place
Generic, modular, domain agnostic
Flexible configuration and profiles
Monitoring changes in time
Snapshot of the system environment
User is in full control of the app and data
To observe unstructured workflows
* Apache 2.0 licensed, on GitHub
10.
11. Install PET, configure, leave it monitoring
Profile is use case specific
User interacts with DOs, PET collects in BG
◦ Environment information,
◦ DO events
◦ Changes
12. 1. Collect EI: User is using a machine, PET
installed and running in BG
--- We are now here ---
2. SEI graph: PET data analyzed, relationships
between DOs discovered.
3. Weighted SEI graph: assign weights to
relationships (with purpose and significance)
4. Graphs can help:
1. understand inter-document relationships
2. appraisal of documents; defining collections
16. Operator’s task: resolve anomalies
Process: extensive search in the archived data
Issue: preserve implicit information, help with
overload
PET task: record SEI for a specific anomaly
◦ monitor environment, record significant events,
infer documentation useful to solve the anomaly
SEI: to identify and debug a specific anomaly,
that is the implicit operator knowledge
17. An anomaly is reported in an handover sheet
The operator proceeds with
documentation search and
consultation, all tracked by
PET
18. Improve: filtering, dependency inference
Semantics for SEI and significance weights
Explore weighted dependency graphs to
support appraisal
19. Can you think of other situations where PET
could be useful in your practice?
20. Get involved! This is open source (-:
https://github.com/pericles-project/pet
Editor's Notes
WE want to collect important information that could be lost if not gathered at the right time.
Aim: Ensure long term usability of Digital Objects
Observation: Usability of Digital Object can require access to parts of its environment
Define a broad set of information (Environment information)
Consider its significance (Significant environment information)
Explore and test pragmatic methods to collect such information (PET)
Technical system information (OS, system architecture, etc.)
DO metadata (descriptive, structural, technical)
User, policy, process information (User BG knowledge, interaction with the system and document collections, use data etc.
Information necessary to make use of the object:
Auxiliary data (e.g. calibration data support sensor data)
External documentation (e.g. specifications, related documents)
Implicit knowledge about what data is useful to use the DO (e.g. the user knowledge about what is relevant and what not in the collection)
…
Observe the use of DOs throughout of lifecycle
Collect dependencies for use (SEI)
Measure significance
E.g. based on frequency of use
Different semantics and factors for significance weights
Weights will change in time
Sheer curation: curation activities integrated in the use workflow; lightweight and transparent
Extracts information that is usually ignored by current metadata extractors.
Visualizes information change over time.
Information snapshot extractions allow getting a quick overview of extractable information.
Modules:
Available and used system resources;
File format identification and checksums;
Currently running processes;
Event information (file and network) from processes;
Graphic configuration information;
MS Office and PDF font dependencies.
Native commands
PET is installed, configured, started on the machine where the DOs are used – stays in monitoring mode
The profile (modules and configuration) are use case specific
The user interacts normally with the DOs while PET collects in the background
Collects environment information, DO events and changes for future use and analysis
(for future use and analysis)
User is using a machine, were PET runs in background, observing the use of documents --- We are now here ---
Collected data is analyzed and relationships between Dos are derived; this will form a SEI graph
Assign weights to relationships based on the purpose and significance – weighted graph
SEI graphs can help understanding inter-document relationships and appraisal of documents; collection building and analysis
PLEASE NOTE: THIS IS One example – based on one scenario, I prefer to give you a complete example in one scenario, but there are many possible scenarios that can be addressed by PET with proper configuration and modules.
I will now introduce briefly a synthetic scenario (fictional) inspired by the BUSOC mission operators use case
- Busoc operators are sometime facing the task of resolving anomalies, such as when some instrument does not respond as expected
the process they follow is guided by their knowledge of the domain and involves research on the archived documentation and operation data can include for example solutions from previous anomalies, telemetry, console logs, meeting notes, emails, etc.
Such data, although present in the storage, requires experience and its selection is a task that requires specific knowledge that is usually passed from operator to operator
- the issue we want to address is that of preserving the useful information that is in the use of specific documents from the large collection in order to solve the issue, and help the operators with the information overload.
the task the PET tool is trying to accomplish is to record the SEI for this use case, for a specific anomaly. This is done by monitoring the environment and recording significant events (via a PET profile) and from there allow the inferring of new dependencies
dependencies between anomalies and mission documentation, in order to preserve useful information that is otherwise not captured.
The SEI in this case is EI that will help to identify and debug a specific anomaly
we set up a specific PET profile that tracks the use of relevant software on specific files, using the PET software monitor; this enables us to have a trace of the documents that have been used at a given moment in time
At the same time, it is possible to observe the ‘handover sheet’ and track the reporting of an anomaly start and end times
The connection between the documentation track and the ‘handover sheet’ tracking can allow us to infer the ‘anomaly solving time span’ (indicated with a red line in Figure 4) and assume there is a dependency between the solution to the anomaly and the documentation that was used between the start and end of the anomaly.
In future work we will consider more complex issues that we have ignored in this simplified example, such as the ‘noise’ that can be reported by the event tracking. This ‘noise’ can be for example due to the fact that users often multitask, so there can be unrelated documentation that was used but not relevant to the anomaly solution, or documentation that was quickly opened and closed may also indicate in some cases that the document was not relevant. We will explore also ways to obtain a fine-grained tracking, as for example to include what pages have been consulted in a document. We are planning to dedicate effort to a more careful analysis of the collected data in the next phases.