These slides accompanied a June 4th, 2016 presentation made by Dan Gillean of Artefactual Systems at the Association of Canadian Archivists' 2016 Conference in Montreal, QC, Canada.
This presentation aims to examine several existing or emerging computing paradigms, with specific examples, to imagine how they might inform next-generation archival systems to support digital preservation, description, and access. Topics covered include:
- Distributed Version Control and git
- P2P architectures and the BitTorrent protocol
- Linked Open Data and RDF
- Blockchain technology
The session is part of an attempt by the ACA to create interactive "working sessions" at its conferences. Accompanying notes can be found at: http://bit.ly/tech-Proche
Participants were also asked to use the Twitter hashtag of #techProche for online interaction during the session.
Empowering Africa's Next Generation: The AI Leadership Blueprint
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools of Today
1. Dan Gillean – Artefactual Systems
ACA Conference 2016 – Montréal, QC
Technologie Proche:
Envisioning the Archival Systems of
Tomorrow with the Tools of Today
https://en.wikipedia.org/wiki/Hermes_%28missile_progra
m%29#/media/File:Hermes_A-1_Test_Rockets_-_GPN-
2000-000063.jpg
2. This is
going to be
a bit of an
experiment
….
https://en.wikipedia.org/wiki/Nikola_Tesla#/media/File:Nikola_Tesla,_with_his_equipment_Wellcome_M0014782.jpg
3. What projects, technologies,
or initiatives from the last
couple of years have inspired
you around archives and
technology? What are you
anticipating?
Group Brainstorm:
Google doc: http://bit.ly/tech-Proche
Twitter: #techProche
aboutmodafinil.com via https://commons.wikimedia.org/wiki/File:Brain_01.jpg
6. We can’t future-proof… so let’s be strategic
•Open
•Collaborative
•Decentralized
•Agnostic
•Interoperable
Building systems to be:
data
metadata
https://commons.wikimedia.org/wiki/File:Go_Board,_Hoge_Rielen,_BelgiumEdit_Fcb981.jpg
9. The Tools of Today
Rusty tools; Biser Todorov via https://commons.wikimedia.org/wiki/File:Rusty_tools.JPG
10. Distributed version control and git
Version control:
A system for managing changes to
source data (documents, code, etc)
over time.
Version control systems help you track
who made what change when, and
why. They also allow you to roll back
changes to your data to a previous
state.
https://git-scm.com/
11. Local version control
•Changes are tracked
on a local computer
•Simple – great for
working alone
•Not effective for
collaboration
https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
12. Centralized version control
•Single server with all
versioned files; users
“check out” files
•Allows for
collaboration
•Single point of failure
https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
13. Distributed version control
•Full mirroring of
source with each user
•Broadest potential for
collaboration
•Complex merging,
branching, etc.
possible
https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
15. Git
•Hashes all files in the
repository
•Produces a diff of changes
•Commit messages explain who
did what when and why
•Supports parallel workflows
with a method for resolving
conflicts
https://github.com/artefactual/atom-
docs/commit/6d8fe4ab9803ec3421b04bcabe215e3dbb4a77dc
16. Git
•Essentially a (NoSQL) database
– a key/value store
• Objects DB and Refs DB
•Libgit2: git implementation in C
• Supports different backends
•Might git provide a model for
auditable, versioned data
management? https://github.com/artefactual/atom-
docs/commit/6d8fe4ab9803ec3421b04bcabe215e3dbb4a77dcVincent Driessen via http://nvie.com/posts/a-successful-git-branching-model/
17. Git for Archives?
•Description systems:
• Maintain transparent provenance and manage collaboration
and concurrency – e.g. authority records
•Collaborative preservation environments
• Track transfer/ingest, preservation actions
•Distributed, version-controlled backups
• Reconcile changes via branch management
18. Peer to Peer computing and BitTorrent
P2P:
A distributed application
architecture that directly links
users as peers with equal
privileges, instead of relying on
a centralized server for
communication and exchange
19. Early mainframe computing
Lawrence Livermore National Laboratory. IBM 704 Mainframe, via
https://en.wikipedia.org/wiki/Mainframe_computer#/media/File:IBM_704_mainframe.gif
20. The promise of the web
ARPANET MARCH 1977 via
https://commons.wikimedia.org/wiki/World_Wide_Web#/media/File:Arpanet_logical_map,_march_1977.png
21. The rise of the cloud
https://pixabay.com/en/network-iot-internet-of-things-782707/
22. The rise of the cloud
https://pixabay.com/en/network-iot-internet-of-things-782707/
23. The rise of the cloud
https://pixabay.com/en/network-iot-internet-of-things-782707/
25. BitTorrent Protocol
•A communications protocol for peer-to-peer file sharing
•Works with any kind/size of files
•Increases speed with increases in users – each user
becomes a potential download source
•Uses cryptographic hashing to segment files into pieces,
so different pieces can be downloaded from different
users and verified against the torrent tracker
27. BitTorrent Protocol
• Torrent users who share content
are called peers
• A group of peers is called a swarm
• Users who download without also
sharing are called leeches
• Early BT implementations used a
centralized tracker to discover and
connect peers, before they could
start sharing directly
28. BitTorrent Protocol
DHT:
Distributed Hash Table. A decentralized
lookup system (similar to a hash table)
with key/value pairs.
• Pairs are stored in the DHT
• Any peer can look up the value associated with
a key
• Responsibility for maintaining the mapping
from keys to values is algorithmically distributed
among the nodes to minimize disruption when
nodes appear/depart/fail
29. BitTorrent Protocol
•Central trackers and DHT
can be used together with
other methods to increase
efficiency
•Other features (e.g.
encryption) can be added
on top
30. Archival uses for BitTorrent?
https://archive.org/details/bittorrent
35. • Description and access: a networking and data exchange
protocol
• Exchanging authority records between repositories
• Updating portal sites
• Offering content to the public
• Preservation: Sharing supporting technology images
• Operating systems, codecs, tools
• LOCKSS-style approach to preservation and real-time access despite
service interruptions at any one point – distributed copies
Archival uses for P2P and data?
40. Linked data
A method of expressing and
publishing structured data so
that it can be interlinked and
queried semantically by both
humans and machines.
http://linkeddata.org/
• Turning the web of documents into a
machine-readable web of data
• From the World Wide Web to the Giant
Global Graph
41. Linked data
Tim Berners-Lee. John S. and James L. Knight Foundation via
https://commons.wikimedia.org/wiki/File:Tim_Berners-Lee-Knight-crop.jpg
4 Principles:
• Use URIs to name (identify) things.
• Use HTTP URIs so that these things can be
looked up (interpreted, "dereferenced").
• Provide useful information about what a
name identifies when it's looked up, using
open standards such as RDF, SPARQL, etc.
• Refer to other things using their HTTP URI-
based names when publishing data on the
Web.
42. RDF
• Makes statements about resources as expressions
• Uses subject predicate object structure to form triples
A standard model for data
interchange on the web
43. RDF
A standard model
for data interchange
on the web
• Uses URIs to express each
element of the triple
• Can be serialized into
many different formats
45. 5-star deployment
scheme for open
data
Linked open data
★ Make your stuff available on the Web
(whatever format) under an open license
★ ★ Make it available as structured data (e.g.,
Excel instead of image scan of a table)
★ ★ ★ Make it available in a non-proprietary
open format (e.g., CSV instead of Excel)
★ ★ ★ ★ Use URIs to denote things, so that people
can point at your stuff
★ ★ ★ ★ ★ Link your data to other data to provide
context
http://5stardata.info/en/
46. Linked open data and archives
Lots of discussion, many existing useful ontologies, and
some implementation in larger portal sites. Few
institutional implementers out there.
• Europeana Data Model (EU), Archives Hub (UK)
• LoC, W3C, Getty; SKOS, ViAF, GeoNames, etc.
• PREMIS creating an OWL ontology
• PRONOM ontology? Work on hold
• ICA EGAD developing an entity relationship model that can be
expressed semantically?
48. Linked open data and archives
https://linkedjazz.org/tools/
Combining linked open data, machine learning, and
crowdsourcing into a suite of user friendly tools
49. Linked open data and archives
• Description and Access
• Our systems should connect users to other related information –
linked data is the way
• Relationship visualizations, queries across institutions, domains
• Crowd-sourced description and reconciling from communities of
interest
• Preservation
• PREMIS ontology + PRONOM ontology + W3C PROV= building
blocks for a community Format Policy Registry (FPR)
• Ability to generate community-wide preservation stats
50. Blockchain:
A distributed ledger that uses
cryptographic hashing to maintain a
continuously growing record of
transactions, divided into linked,
time-stamped blocks.
Blockchain technology
Jorge González, “Chains reloaded.”July 29, 2011.
https://www.flickr.com/photos/aloriel/6292261464
51. Bitcoin
• An online, anonymous cryptocurrency
• Allows peer-to-peer transactions without
an intermediary
• Created 2008, released open-source in
January 2009
• Relies on the blockchain:
• Transactions are verified by nodes on the
network
• Nodes are rewarded for verification
• Once transactions are verified, they are added
to a block every 10mins
https://www.flickr.com/photos/jason_benjamin/12795733984
52. Bitcoin
• A wallet can have as many BTC addresses as desired
• An address is created via public/private key cryptography
User installs a
Bitcoin client (a
wallet), and uses
it to create a
bitcoin address
for a transaction
55. Blockchain
• Miners (peers) try to solve the
cryptographic puzzle of the
nonce
• First to solve is rewarded with
Bitcoin
• Other miners verify the solution
• When there is consensus, the
block is added to the blockchain
• New blocks are created every 10
minutes
https://medium.com/@nik5ter/explain-bitcoin-like-im-five-73b4257ac833
56. Blockchain
• An immutable, distributed, public ledger
• No single point of failure
• Removes central mediators in transactions
• Errors and fakes are resolved via consensus
against all versions
• Entire history of transactions available to all
Graph - by Rocchini - Own work, CC BY 3.0,
https://commons.wikimedia.org/w/index.php?curid=19859789
59. Blockchain
http://etherscripter.com/what_is_ethereum.html
Smart contracts
• Contracts are essentially
agreements – much like
traditional law – often with
economic impacts
• Terms and conditions, time
frames and consequences, all
lend themselves well to code
• Ethereum provides digital
currency and a programming
language to create
distributed, enforceable
agreements
61. Archival Blockchain uses?
Chain of custody
• An archival blockchain could support the maintenance provenance and authenticity
• Register acquisitions, donor agreements, transfers, preservation actions
• LOCKSS-style P2P backups – use ledger + smart contracts to track who is preserving
what where, under what terms – no need for centralized coordination
Archival Services Network
• Institutions donate storage space, compute power, or services
• Are rewarded with “ArchCoin” for use on the network
• These are used to cover network costs, use of other services
• DAOs and smart contracts allow automation of decentralized preservation services
62. Bringing it all together
Thomas Hawk via https://commons.wikimedia.org/wiki/File:Fireworks_4.jpg
66. Bringing it all together
https://medium.com/on-archivy/decentralized-autonomous-collections-ff256267cbd6
“…I believe that the emergence of blockchain technology and the
concept of Decentralized Autonomous Organizations, alongside the
maturation of peer-to-peer networks, open library technology
architectures, and open-source software practices offers a new approach
to the issues of control, privilege, and sustainability that are inherent to
many centralized information collections.”
-- Peter Van Garderen, Decentralized Autonomous Collections
“A Decentralized Autonomous Collection is a set
of digital information objects stored for ongoing
re-use with the means and incentives for
independent parties to participate in the
contribution, presentation, and curation of the
information objects outside the control of an
exclusive custodian.”
67. Will we be
a part of
this future?
https://commons.wikimedia.org/wiki/File:GT056-Antigua_C-ruins.jpeg
68. • What is the most interesting part of this technology in terms of archives?
What potential do you see?
• Where do you think these ideas fall short? What might be some of the
challenges?
• How might the ideas shared around this technology impact your institution if
it existed now? How might it change the way you carry out these
workflows/activities?
• If it doesn’t seem like it would solve a challenge you face, why not? What
commonalities can you find in terms of challenges with others in your group?
Group questions
Google doc: http://bit.ly/tech-Proche
Twitter: #techProche
I’m Dan Gillean. I work for Artefactual Systems as the AtoM Program Manager; these views do not necessarily reflect my employer
This is a new session type the ACA wanted to try – a “working sessions” that aims to encourage interactivity. Its success depends on you. Feedback welcome.
We spend a lot of time at conferences recapping our challenges, or sharing our updates. We should also make time to think laterally, creatively, to brainstorm together about what’s possible
Instead of just consuming presentations, the hope is to encourage discussion and exchange in future working sessions, should this format be continued
The conference theme is Futur Proche. Today, let’s think about ways in which we might start shaping that future with tools and technologies that already exist now.
Before we start, I want to be clear:
I am not an expert on any of the things I am about to talk about. My interests are around archives and technology, and these are some of the things that have caught my attention recently. Ask me too many specific questions and I might not be able to answer them. However, my hope in leading this session is to make the case that this should not stop us from paying attention to what’s out there and dreaming big for our own future. We may not be technology experts, but our perspective as a profession on archival theory and digital preservation should play an important role in imagining the future, if we want to be a part of it. Our conferences are often dedicated to discussing challenges or sharing particular institutional workflows – and these are important - but I think we also need blue sky sessions, collaborative working sessions, ways to share our excitement about what’s possible. It’s rare to get so many of us in the room together – often this is the one chance we have a year to see so many Canadian archivists together. Let’s see what happens when so many brilliant minds dream together.
We live in a constantly shifting technological landscape. Digital preservation is a moving target – no matter how well we plan, we’re not going to “solve” digital preservation. No matter how long-lived our media, some disruptive technology will inevitably come along that forces to reassess our methods of preservation, storage, and retrieval. Additionally, we work in an extremely under-resourced sector, especially in Canada. It is highly unlikely that we are going to achieve long-term digital preservation if we are all going it alone, each of us coming up with bespoke solutions, re-inventing the wheel and confronting the same challenges in isolation.
That doesn’t mean that we can’t be strategic.
These are some of the high-level values that I believe will help us ensure that the investments now we make are still useful to us as we move into an uncertain future. Today I’m going to focus on possibilities for next-generation archival systems, but these values might equally apply to all 3 components. We should aim for the archival systems we build today to be:
Open: open for study, reuse, exchange, modification, and repurposing.
Collaborative: given the under-resourced nature of our profession, we want solutions that will allow us to share what resources we have, solve problems collaboratively, and build upon the successes of our peers.
Decentralized: When human and economic resources can shift suddenly, we should be wary of solutions with a single point of control, or failure. Decentralized architectures are more resilient, and support collaboration.
Agnostic: our solutions should not be institution-specific, nor should they be platform, format, or vendor-specific.
Interoperable: Even across systems and platforms, we want underlying commonalities that allow for exchange, re-use, and reinterpretation.
I want to say a bit more about openness before moving on.
I believe that, in our under-resourced community, collaborative solutions will be critical. Open source enables this by allowing us to share the solutions we create or invest in, and build upon the work of our colleagues.
Digital preservation requires us to maintain access to data and metadata into uncertain futures – we already know that closed, proprietary formats and platforms make this difficult. Why, as a profession, would we ever choose to build our own tools in such a way that they might not be understandable or usable in the future? If a vendor started selling an archival Hollinger box that could not be opened in 10 or 100 years without them, would any of us buy these boxes?
Similarly, we want the tools and systems, the data, and the metadata we use and create to be useful even if their original authors are no longer available.
Open source and open standards allow us to take an iterative approach. One institution can build upon the work of another, or fork it if the use cases are irreconcilably different. Vendors have a role to play in this ecosystem, but we should not be dependent on them, and we should require them to use open standards – we can create service-based economies around open source projects.
For us to tackle the challenges of long-term preservation and access in an under-resourced sector, we need an ecosystem where we can all contribute to the solution as we are able, and we need our tools to be as preservation-ready as our data and metadata. Open source and open standards help move us closer to these aims.
So let’s take a look at some of the existing technologies that we might leverage in imagining the archival systems of tomorrow…
In each of the following cases, I’m going to introduce a technology, which really represents a computing paradigm – and a specific example of that computing paradigm. The specific examples might not actually be the technologies that will give us everything we need – but they solve problems which are very related to ours, and thus should serve as inspiration, and possible entry points to building the tools we need.
The first possibilities I want to consider are distributed version control systems, such as git.
Version control systems generally have 3 approaches.
A local VCS tracks changes directly on your computer. Files are checked out, and changes are tracked. This is the simplest model, and it’s great for an isolated environment, but the second I want to collaborate with you, I run into problems integrating your real-time changes with mine.
We might instead have a centralized VCS – like subversion. Here a central server manages our resources, we check them out from the VCS and conflicts are reconciled against the master repository. This allows for collaboration, but with a limited point of entry. If Alice and Bob are working on the same file, and then Bob wants to send his version to Carol, things will fall apart if Carol’s not part of the central repository. Equally, if our VCS server goes down, everyone’s work grinds to a halt, or we end up with no way to reconcile our independent changes.
Finally, there are distributed VCS’s like git. In this case, each user maintains a local mirror of the entire repository. If Bob wants to include Carol, Carol ends up with the entire repo as well, and can maintain her changes as an independent fork or reconcile them at any time with the master, or with just Bob or just Alice. If the main repository goes down, every user becomes a potential point of backup.
Git was created by the Linux community to streamline broad collaboration in distributed open-source development environments. There are many other alternative distributed VCSs out there, but git has become a de facto standard among developers.
Essentially, git applies a cryptographic SHA1 hash against all files in a repository, and will produce a diff, shown in a graphical form from GitHub here on the right, of the additions, deletions, and modifications made. Every change submitted by a contributor includes a commit message with a timestamp, which includes a short, human-readable summary of the changes and an option for a longer narrative description or justification. Platforms like GitHub, which add a web-based user interface to git repositories and workflows, also offer the ability to discuss changes on a per-line basis via comments and issues.
Artefactual uses git to manage all of its software projects – but we also use it to manage our documentation. In this image, we’re actually looking at an image from a diff on GitHub, made when I submitted changes to the Access to Memory documentation.
For data: Git is traditionally used to manage code, while our data tends to be stored in databases. However, as Brandon Keepers of GitHub has pointed out, git is essentially a noSQL database, in that it uses 2 different key/value stores (the objects db and the refs db) to track and manage changes. Git’s implementation in the C programming language (libgit2) further supports swapping these with many other backends, including relational databases – and as such, we can consider using it as such to manage our data. There are some serious limits to scaling this approach, but it opens a wide avenue of possibilities for consideration.
Early computers relied on a central mainframe – and were accessed via thin clients, which didn’t have the computational resources (memory, disk space) to perform calculations without the mainframe.
As computers gained capacity (disk, memory, CPU) and prices fell, the web allowed for decentralized networks to arise, where users could exchange information directly.
Essentially, with the cloud we’ve moved back to the centralized client server model – only these servers are often owned by private companies with opaque practices and a strong business incentive.
Will this model really meet our needs moving forward? At Code4Lib 2014, Matt Zumwalt pointed out that according to the Cisco Global Cloud Index, by 2019 the data created by Internet of Everything devices alone will be 49 times higher than all the data that moved through data centers in 2014. (source: http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html)
That’s an estimated 507.5 ZB per year of data created by this new connected reality – centralized models may not scale. More importantly, much of this data is highly personal – our phone metadata, our smart homes, our fitness trackers – do we want this data to be held in corporate data centers? We need a model where each user curates the data that matters to them, and shares what data they want with other users who care about that data. We need take out the middle man.
Essentially, with the cloud we’ve moved back to the centralized client server model – only these servers are often owned by private companies with opaque practices and a strong business incentive.
Will this model really meet our needs moving forward? At Code4Lib 2014, Matt Zumwalt pointed out that according to the Cisco Global Cloud Index, by 2019 the data created by Internet of Everything devices alone will be 49 times higher than all the data that moved through data centers in 2014. (source: http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html)
That’s an estimated 507.5 ZB per year of data created by this new connected reality – centralized models may not scale. More importantly, much of this data is highly personal – our phone metadata, our smart homes, our fitness trackers – do we want this data to be held in corporate data centers? We need a model where each user curates the data that matters to them, and shares what data they want with other users who care about that data. We need take out the middle man.
Essentially, with the cloud we’ve moved back to the centralized client server model – only these servers are often owned by private companies with opaque practices and a strong business incentive.
Will this model really meet our needs moving forward? At Code4Lib 2014, Matt Zumwalt pointed out that according to the Cisco Global Cloud Index, by 2019 the data created by Internet of Everything devices alone will be 49 times higher than all the data that moved through data centers in 2014. (source: http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html)
That’s an estimated 507.5 ZB per year of data created by this new connected reality – centralized models may not scale. More importantly, much of this data is highly personal – our phone metadata, our smart homes, our fitness trackers – do we want this data to be held in corporate data centers? We need a model where each user curates the data that matters to them, and shares what data they want with other users who care about that data. We need take out the middle man.
Of course, LOCKSS has been providing a P2P-based preservation environment for some time now.
In the LOCKSS model, ingested content is replicated to other participating institutions and packaged with usage rights. If the original source is unavailable, any other peer with the content can provide it to end-users.
LOCKSS also offers what they call “migration on access” – when a format can’t be displayed, it is migrated on the fly, based on a series of rules, to an equivalent format the end user can access. I’ll return to talking about preservation services, and how they might fit into distributed archival systems, later.
Dat is a grant funded, open source, decentralized data tool for versioning and distributing datasets. It incorporates best-of-breed ideas from the BitTorrent protocol and git’s model of distributed version control. The focus has been on scientific data so far: the pilot users have so far been in the fields of astronomy, bioinformatics, and neuroscience. However, the underlying versioning and sharing tool, Hyperdrive, is generalized for any binary data, including audio, video, linux containers, and so forth.
By using the hashed diffs and merkle trees (which we’ll talk about later) generated in versions, the tool also deduplicates data between versions to reduce the bandwidth required for sharing.
Matt Zumwalt, one of the original creators of the Hydra repository project, has been involved in an module for dat that focuses specifically on tabular data use cases, called dat jawn. The project works via the Code4Philly, building a strong model around mentorship, group learning, and open collaborative coding.
To a computer, the text on a web page is just a pile of keywords. Humans, on the other hand, can understand how different pieces of information relate to each other, and it is these relationships that add meaning to data. In order for computers to understand these relationships, we need to add structure to the data.
To make a transaction, you need several pieces of information.
Every transaction has a set of inputs and a set of outputs. The inputs identify which bitcoins are being spent, and the outputs assign those bitcoins to their new owners. Each input is just a digitally signed reference to some output from a previous transaction. Once an output is spent by a subsequent input, no other transaction can spend that output again. That’s what makes bitcoins impossible to copy.
To make a transaction, you need several pieces of information.
Every transaction has a set of inputs and a set of outputs. The inputs identify which bitcoins are being spent, and the outputs assign those bitcoins to their new owners. Each input is just a digitally signed reference to some output from a previous transaction. Once an output is spent by a subsequent input, no other transaction can spend that output again. That’s what makes bitcoins impossible to copy.
Many use cases, such as registering documents and tracking provenance, providing proof of ownership, managing rights, and much more.
Examples:
Augur – blockchain based crowd-sourced predictions
Storj and MaidSafe – blockchain managed distributed, encrypted storage
Creator Juan Benet describes it as “really just an integration of old ideas.”
P2P architectures
BitTorrent protocol BitSwap in IPFS
DHT
Distributed version control
Support for blockchain marketplaces – partnership with FileCoin