This presentation was prepared by George Papastefanatos (Athena-Research and Innovation Center) for the PERICLES final project conference 'Acting on Change: New Approaches and Future Practices in LTDP' (Wellcome Collection Conference Centre, London, 30 Nov -1 Dec 2016).
George Papastefanatos joined a panel discussion on 'Preparing for Change' facilitated by Natalie Harrower (Digital Repository of Ireland). The panel comprised an exciting group of experts including Natasa Milic-Frayling (Intact Digital/PERSIST); Jean-Yves Vion-Dury (Xerox/PERICLES); Neil Beagrie (Charles Beagrie Ltd) and Nancy McGovern (MIT),
There is a growing awareness of the broader scope of change in digital preservation, but has this awareness yet led to understanding? And understanding to action? Our question to experts in the field of digital preservation was this: how well prepared are we to deal with the multifaceted aspects of change in our digital environments?
http://pericles-project.eu/
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016
1. Digital Preservation in the era of Big Data
The DIACHRON Platform, archiving and querying linked open data
George Papastefanatos
gpapas@imis.athena-innovation.gr
“Athena” Research & Innovation Center
Panel discussion: Preparing for change
Acting on Change Conference : New Approaches and Future Practices in LTDP
London, Dec 2016
3. Data on the Web
Global data space
Connecting data from
diverse domains and
sources
Primary objects:
(description of) “entities”
Links between “entities”
Info granularity: from
entire data collections to
atomic data
Interrelated, Heterogeneous
Adapted from Chris Bizer, Richard Cyganiak, Tom Heath,
available at http://linkeddata.org/guides-and-tutorials
Web of Data
Conceptual Representation
entity entity entity
entity
Typed Links Typed Links Typed Links
Spreadsheets
HTMLXMLRDFa
URIs
URIs
URIs
URIs
URIs
URIs
represent
SemiStructuredTriplesStatistical
represent represent represent
Web of world things described by Web of data
4. Data Web Evolution
Explosion of data volume published
on web and diversity of sources
Government
Scientific
Corporate
Crowd-sourced
Linked Open Data (LOD)
continuously published
Currently data.gouv.fr lists
350,000 datasets,
data.gov.uk has 8,200 datasets.
Current Status
2007
2009
2011
5. Statistics
Datasets#: 1014
Social web 51.28%
Government 18.05%
Publications 9.47%
Life sciences 8.19%
User-generated content 4.73%
Cross-domain 4.04%
Media 2.17%
Geographic 2.07%
Rapidly Evolving Ecosystem
Mid 2014
http://lod-cloud.net/
6. Big Data Preservation is Challenging
Emerging Application Domains
2020: digital data production > 40 zetabytes =
5,200 Gbytes for every person on the planet
WIRED – 09/10/2014
Effective & efficient techniques to manage the data lifecycle
Appraisal
Integration
ArchivingProducing
Publishing
Linking
9. Dataset Model
[M1-M12] – Task 1.5Diachronic
Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change
Set
D1(t1,t2)
Change
Set
D1(t1,t2)
Time-Agnostic
Space
Time-Aware
Space
Record_1
Record
Atts
subject
predicate
“6 Artemidos st.”
Resource_a
(D1,tm)
“vcard:has
Address”
object
RecordSet
(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record
Atts
predicate
“John Doe”
foaf:na
me
object
subject
D1(tn)
Resource_a
(D1,tn)
Record_i
Record
Atts
subject
Resource
Changes
(D1,tm,tn)
Change Set
D1(t1,t2)
Resource_a
(D2,tk)
………….
Record and
Schema
changes
Diachronic
Resource b
owl:sameAs
Diachronic
Resource a
Diachronic
Dataset D2
10. Dataset Model
[M1-M12] – Task 1.5Diachronic
Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change
Set
D1(t1,t2)
Change
Set
D1(t1,t2)
Time-Agnostic
Space
Time-Aware
Space
Record_1
Record
Atts
subject
predicate
“6 Artemidos st.”
Resource_a
(D1,tm)
“vcard:has
Address”
object
RecordSet
(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record
Atts
predicate
“John Doe”
foaf:na
me
object
subject
D1(tn)
Resource_a
(D1,tn)
Record_i
Record
Atts
subject
Resource
Changes
(D1,tm,tn)
Change Set
D1(t1,t2)
Resource_a
(D2,tk)
………….
Record and
Schema
changes
Diachronic
Resource b
owl:sameAs
Diachronic
Resource a
Diachronic
Dataset D2
11. Dataset Model
[M1-M12] – Task 1.5Diachronic
Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change
Set
D1(t1,t2)
Change
Set
D1(t1,t2)
Time-Agnostic
Space
Time-Aware
Space
Record_1
Record
Atts
subject
predicate
“6 Artemidos st.”
Resource_a
(D1,tm)
“vcard:has
Address”
object
RecordSet
(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record
Atts
predicate
“John Doe”
foaf:na
me
object
subject
D1(tn)
Resource_a
(D1,tn)
Record_i
Record
Atts
subject
Resource
Changes
(D1,tm,tn)
Change Set
D1(t1,t2)
Resource_a
(D2,tk)
………….
Record and
Schema
changes
Diachronic
Resource b
owl:sameAs
Diachronic
Resource a
Diachronic
Dataset D2
12. Dataset Model
[M1-M12] – Task 1.5Diachronic
Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change
Set
D1(t1,t2)
Change
Set
D1(t1,t2)
Time-Agnostic
Space
Time-Aware
Space
Record_1
Record
Atts
subject
predicate
“6 Artemidos st.”
Resource_a
(D1,tm)
“vcard:has
Address”
object
RecordSet
(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record
Atts
predicate
“John Doe”
foaf:na
me
object
subject
D1(tn)
Resource_a
(D1,tn)
Record_i
Record
Atts
subject
Resource
Changes
(D1,tm,tn)
Change Set
D1(t1,t2)
Resource_a
(D2,tk)
………….
Record and
Schema
changes
Diachronic
Resource b
owl:sameAs
Diachronic
Resource a
Diachronic
Dataset D2
13. Dataset Model
[M1-M12] – Task 1.5Diachronic
Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change
Set
D1(t1,t2)
Change
Set
D1(t1,t2)
Time-Agnostic
Space
Time-Aware
Space
Record_1
Record
Atts
subject
predicate
“6 Artemidos st.”
Resource_a
(D1,tm)
“vcard:has
Address”
object
RecordSet
(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record
Atts
predicate
“John Doe”
foaf:na
me
object
subject
D1(tn)
Resource_a
(D1,tn)
Record_i
Record
Atts
subject
Resource
Changes
(D1,tm,tn)
Change Set
D1(t1,t2)
Resource_a
(D2,tk)
………….
Record and
Schema
changes
Diachronic
Resource b
owl:sameAs
Diachronic
Resource a
Diachronic
Dataset D2
14. Dataset Model
[M1-M12] – Task 1.5Diachronic
Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change
Set
D1(t1,t2)
Change
Set
D1(t1,t2)
Time-Agnostic
Space
Time-Aware
Space
Record_1
Record
Atts
subject
predicate
“6 Artemidos st.”
Resource_a
(D1,tm)
“vcard:has
Address”
object
RecordSet
(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record
Atts
predicate
“John Doe”
foaf:na
me
object
subject
D1(tn)
Resource_a
(D1,tn)
Record_i
Record
Atts
subject
Resource
Changes
(D1,tm,tn)
Change Set
D1(t1,t2)
Resource_a
(D2,tk)
………….
Record and
Schema
changes
Diachronic
Resource b
owl:sameAs
Diachronic
Resource a
Diachronic
Dataset D2
15. DIACHRON Query Language
• Queries on archive catalog
• Lists of datasets
• Lists of versions of a given dataset
• Filtered based on temporal, provenance or other metadata criteria
• Queries on Data
• Retrieve part(s) of a dataset that match certain criteria.
• Longitudinal queries
• Retrieve part(s) of a dataset across multiple versions.
• Temporal (version based) criteria can be applied.
• Queries on Changes
• Retrieve changes between two concurrent versions.
• Limit results for specific type of changes (schema, data, etc.).
• Mixed Queries on Changes and Data
• Retrieve datasets or parts of datasets affected by specific changes
Requirements
16. Diachron Query language
• Extension of SPARQL
– SPARQL queries are valid DIACHRON queries
• DIACHRON graph model
– basis of the query language, e.g.
– <FROM DATASET>,<FROM CHANGES>, …
• Specific versions
– AT VERSION, AFTER VERSION,
BEFORE VERSION, BETWEEN VERSIONS
• Syntactic Sugar for graph patterns, e.g.
– RECORD (e.g. for record variable)
– RECATT
• Query results dereified
Overview
18. Archiving Strategies
• Hybrid Materialization
• Only major versions & and all changes (delta) are stored
• Balance between query performance & storage space
2nd approach
Web data is only one source of big data. Big data, in general; its use and exploitation has come to generate new values not only within the ICT sector, but in very varying financial sectors, ranging from ….(mention here above sectors)
Data published on the web is one of the main source of information. Following the Linked data paradigm, data coming from many diverse domains are published on the web. Primary objects are not html documents, but rather entities, uniquely identified by a URI and connected with typed links between them. This forms a global interconnected dataspace, that is independent of the data domain, the data formats, the granularity of the data (entire data collections vs atomic data)
In this context, the need for big data preservation and archiving is far from challenging,
Archiving associated with queries, linking, … almost as good as the most recent version.
At the core, there us a Unified DIACHRONIC model for incorporating various data models and their evolution. It is structured across two dimensions the time and the information dimension. We distinguish between time-aware and time-agnostic objects. Time aware are objects incorporate evolution (changes) and temporal information , whereas time-agnostic objects represent unchangeable – diacronic objects. At the information space, we have the data space where we capture datasets, and the curated space where resources capture semantic rich notions within datasets.
DIACHRON is a pilot-motivated project. Its primary focus is to deliver services tailored to real preservation needs of big data providers. Our three cases concern, an open-data use case, dealing with governmental and statistical multidimensional data. An enterprise use case, dealing with close-world enterprise data and a scientific use case dealing with biological data.