Binary Trees? Automatically identifying the links between born-digital records

Australian Society of Archivists
Conference 2016, Parramatta
Session 5: Description and Innovation
Binary Trees? Automatically Identifying
the links between born-digital records.
Ross Spencer
Digital Preservation Analyst
Systems Strategy and Standards team

Department of Internal Affairs
How do we view the world?

Binary Trees?

But that looks like a network graph?!
• It is!
• Records (Items) connected across many recordkeeping and
archival contexts
• Across functions; People; Agency; Subject; Context; References;
Subject... Date, File Format...
• No boundaries!
@ArvhivesNZ:ItemA -> references -> @DOC:ItemB

We know this...
• Continuum model (Multiple contexts over space and time)
• ICA Draft Conceptual Model (RiC)
• 73 Record Relations RiC-R1 to RiC-R73
• Three of which we might be able to (more easily) automate?
• Has Copy; Is Copy Of; Has Part
• Wherein (I suggest) lies the issue...

Archives NZ Context
2011 Archives New Zealand developed its new conceptual model and metadata
schema for archival description.
Designed to accommodate description of born-digital records.
much discussion among archivists about the practicalities of describing relationships
between items.
It was acknowledged that, given the volumes of digital records likely to be in each
transfer, neither agency nor Archives staff were likely to examine the content of items
visually one-by-one to determine which other items they referred to...
~ Talei Masters

What then do we do?
• Mathematical properties of digital files...
• Signals ->
• Numbers ->
• Encoding Schemes (UTF8, ASCII) - >
• Data Structures ->
• File Formats -> User Content.
• Reduce again to a series of numbers that we can interpret to use
numerical properties:
• Greater than; less than; equal to; not equal to...

In the relationship between numbers we can find the
relationships between records

Relations we might be able to create...
• Relationship One: Is Identical
• Relationship Two: Is Similar
• Relationship Three: Contains Hyperlink
• Relationship Four: Contains CMS Reference
• Relationship Five: Contains Embedded Digital Objects
• Relationship Six: Contains Intra-Item Relationships
• Relationship Seven: Contains Object References
• Relationship Eight: Item Mentions

Relationship One: Is Identical
●
We often have checksums available in digital repository
●
First comparison in a digital transfer...
Does Checksum A still equal Checksum A?
●
If yes, accept, continue to transfer...
●
If no... reject! Inspect!
●
Expose this information in the catalogue and compare; what
happens?

Relationship One: Is Identical
Archival Context A
Record Keeping System
A
Archival Context B
Record Keeping System B

Relationship Two: Is Similar
• MD5 (Rivest, 1992):
• File A (Zero changes):
8c69dc0668c4c73092a7042df45e756adb170742
• File B (1 Byte Removed):
6b75b8f235c148efd1b03d9c113664895b5aa7cd

• SSDEEP (Kornblum, 2006):
• File A (Zero changes):
1536:tLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k
CK7ZBEY0t5vykp6CYP:q1aYpYTESSgM2CwQGt9Z
BB1U6hP
• File C (First 250 Bytes Removed*):
1536:CLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k
CK7ZBEY0t5vykp6CYP:B1aYpYTESSgM2CwQGt9
ZBB1U6hP
*Less than two tweets (140 bytes)

• First experiments, SSDEEP (Kornblum), TLSH* (Oliver et al.)
• Oliver et al. (2014) Thresholds should be tuned for each application
• Fiirst application is item level sentencing during transfer feasibility
investigations
• Manually sentence... 10 records per hour
• Follow links to those not of archival value...
* Trend Micro Locality Sensitivity Hash!

MD5 Hash Fuzzy Hash

Relationship Two: Similar

You liked this record... you might also like...

Relationship Three: Contains HTTP://
• Burnhill et al. (2015)
• 64,000 e-theses, 46,000 pointed out to external sources
• Websites, external files, etc.

#!/bin/bash
set -e
#FILES LOCATION
FILES='/home/digital-preservation/accessions'
dp_analysis ()
{
echo -e $(catdoc "$file" | grep "http://") | tr -d '[:cntrl:]'
echo
}
# Find loop...
oIFS=$IFS
IFS=$'n'
time(find "$FILES" -type f | while read -r file; do
dp_analysis "$file"
done)
IFS=$oIFS

• https://gist.github.com/ross-
spencer/a6411a021afb7de7e3dc6dd713f7b520
• ~5059 parseable born-digital records
• ~4800 lines contained hyperlinks

Relationship Four: Contains CMS Reference
echo -e $(catdoc "$file" | grep -e "A[0-9]{6}"
• Matches the Archway catalogue reference number, e.g.
A204050; A123456; and not AZ12345
• CMS reference could be sent alongside transfer metadata
for such searches.
• Flag existence (at least) - FYI to the end user – be that the
transfer archivist, to the agency, to the researcher

Relationship Five: Contains Embedded Object
$ java -jar tika-app1.13.jar -z <filename> --extract-
dir=<dirname>

Relationship Six: Contains Intra-Item Record

Relationship Seven: Contains Object Reference
A digital preservation risk...

Relationship Seven: Contains Object Reference
Extract files from PPT OLE2 -> Read PowerPoint Document Obect ->
Look for:

Relationship Eight: Item Mentions
Dictionary:
Helen Clark
Helen Elizabeth Clark
John Key
United Nations
Prime Minister
University of Auckland
Jenny Shipley
Labour Party

Relationship
Eight:
Item Mentions

Discussion
• Data structures – support needed in catalogue, and digital
preservation system...
• Extensbile, flexible enough not to (need to) know what the
future holds...
• AS/NZS 5478:2015, Recordkeeping metadata property
reference set (RMPRS) states:
“The digital world is increasingly using networked
relationships”.

Discussion
• Verhoeven (2016) – Devil’s Bridges!
– Ontological, graph/network based infrastuctures
– Vernacular ontologies
– Understand, Make, Improve Quality of our Connections
– redistribution of power and the possibilities of world
making (and remaking) in the archive

Providing the algorithms are transparent, what then
provides a more objective view of the world than machine
generated relations?

Discussion
• ICA... RiC-R7: ‘is Draft Of’ semantics (A Speech):
– Still a draft if 80% content is different from published?
– Draft because it’s marked as such in metadata?
– Draft when it has been delivered in the wild?
• ICA... RiC-R4: ‘has Subject’ semantics (This Presentation):
– Graph technologies?
– Digtial preservation?
– Processing of digital archives?
– Binary trees?!

“RiC-CM aspires to reflect both facets of the Principle of Provenance, as
these have traditionally been understood and practiced, and at the same
time recognize a more expansive and dynamic understanding of
provenance. It is this more expansive understanding that is embodied in
the word “Contexts.” RiC-CM is intended to enable a fuller, if forever
incomplete, description of the contexts in which records emerge and exist,
so as to enable multiple perspectives and multiple avenues of access.”

Discussion
• Impact for record keeping; transfer; digital preservtion,
discovery...
• Digital preservation – linked objects, hyperlinks, embedded
objects...
• Not all geekery!
• Remember the content of these records...
• Remember the connections...
• Remember use-cases for digital preservation, it does not
operate, in and of itself!

Conclusion
• “Computer forensic examiners are often overwhelmed with
data. Modern hard drives contain more information that
cannot be manually examined in a reasonable time period
creating a need for data reduction techniques.” - Kornblum
(2006)
• So how do we begin?
One relation at a time...

Links
• Checksum 101: http://www.slideshare.net/RossSpencer/checksum-101
• SSCOMPARE: https://github.com/exponential-decay/sscompare
• TLSH Experiments: https://github.com/exponential-decay/tlsh-experiments
• Parrallel Lines Workshop: https://github.com/andreakb/parallel-lines-workshop
• Apache Tika: https://tika.apache.org/
• Full Paper: Hopefully in Archives and Manuscripts sometime soon!

Thank you
ross.spencer@dia.govt.nz
@beet_keeper

Binary Trees? Automatically identifying the links between born-digital records

Recommended

Recommended

More Related Content

Similar to Binary Trees? Automatically identifying the links between born-digital records

Similar to Binary Trees? Automatically identifying the links between born-digital records (20)

Recently uploaded

Recently uploaded (20)

Binary Trees? Automatically identifying the links between born-digital records