SlideShare a Scribd company logo
Australian Society of Archivists
Conference 2016, Parramatta
Session 5: Description and Innovation
Binary Trees? Automatically Identifying
the links between born-digital records.
Ross Spencer
Digital Preservation Analyst
Systems Strategy and Standards team
Department of Internal Affairs
How do we view the world?
Department of Internal Affairs
Binary Trees?
Department of Internal Affairs
But that looks like a network graph?!
• It is!
• Records (Items) connected across many recordkeeping and
archival contexts
• Across functions; People; Agency; Subject; Context; References;
Subject... Date, File Format...
• No boundaries!
@ArvhivesNZ:ItemA -> references -> @DOC:ItemB
Department of Internal Affairs
We know this...
• Continuum model (Multiple contexts over space and time)
• ICA Draft Conceptual Model (RiC)
• 73 Record Relations RiC-R1 to RiC-R73
• Three of which we might be able to (more easily) automate?
• Has Copy; Is Copy Of; Has Part
• Wherein (I suggest) lies the issue...
Department of Internal Affairs
Archives NZ Context
2011 Archives New Zealand developed its new conceptual model and metadata
schema for archival description.
Designed to accommodate description of born-digital records.
much discussion among archivists about the practicalities of describing relationships
between items.
It was acknowledged that, given the volumes of digital records likely to be in each
transfer, neither agency nor Archives staff were likely to examine the content of items
visually one-by-one to determine which other items they referred to...
~ Talei Masters
Department of Internal Affairs
What then do we do?
• Mathematical properties of digital files...
• Signals ->
• Numbers ->
• Encoding Schemes (UTF8, ASCII) - >
• Data Structures ->
• File Formats -> User Content.
• Reduce again to a series of numbers that we can interpret to use
numerical properties:
• Greater than; less than; equal to; not equal to...
Department of Internal Affairs
In the relationship between numbers we can find the
relationships between records
Department of Internal Affairs
Relations we might be able to create...
• Relationship One: Is Identical
• Relationship Two: Is Similar
• Relationship Three: Contains Hyperlink
• Relationship Four: Contains CMS Reference
• Relationship Five: Contains Embedded Digital Objects
• Relationship Six: Contains Intra-Item Relationships
• Relationship Seven: Contains Object References
• Relationship Eight: Item Mentions
Department of Internal Affairs
Relationship One: Is Identical
●
We often have checksums available in digital repository
●
First comparison in a digital transfer...
Does Checksum A still equal Checksum A?
●
If yes, accept, continue to transfer...
●
If no... reject! Inspect!
●
Expose this information in the catalogue and compare; what
happens?
Department of Internal Affairs
Relationship One: Is Identical
Archival Context A
Record Keeping System
A
Archival Context B
Record Keeping System B
Department of Internal Affairs
Relationship Two: Is Similar
• MD5 (Rivest, 1992):
• File A (Zero changes):
8c69dc0668c4c73092a7042df45e756adb170742
• File B (1 Byte Removed):
6b75b8f235c148efd1b03d9c113664895b5aa7cd
Department of Internal Affairs
Relationship Two: Is Similar
• SSDEEP (Kornblum, 2006):
• File A (Zero changes):
1536:tLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k
CK7ZBEY0t5vykp6CYP:q1aYpYTESSgM2CwQGt9Z
BB1U6hP
• File C (First 250 Bytes Removed*):
1536:CLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k
CK7ZBEY0t5vykp6CYP:B1aYpYTESSgM2CwQGt9
ZBB1U6hP
*Less than two tweets (140 bytes)
Department of Internal Affairs
Relationship Two: Is Similar
• First experiments, SSDEEP (Kornblum), TLSH* (Oliver et al.)
• Oliver et al. (2014) Thresholds should be tuned for each application
• Fiirst application is item level sentencing during transfer feasibility
investigations
• Manually sentence... 10 records per hour
• Follow links to those not of archival value...
* Trend Micro Locality Sensitivity Hash!
Department of Internal Affairs
Relationship Two: Is Similar
MD5 Hash Fuzzy Hash
Department of Internal Affairs
Relationship Two: Similar
Department of Internal Affairs
Relationship Two: Similar
Department of Internal Affairs
Relationship Two: Is Similar
You liked this record... you might also like...
Department of Internal Affairs
Relationship Three: Contains HTTP://
• Burnhill et al. (2015)
• 64,000 e-theses, 46,000 pointed out to external sources
• Websites, external files, etc.
Department of Internal Affairs
Relationship Three: Contains HTTP://
#!/bin/bash
set -e
#FILES LOCATION
FILES='/home/digital-preservation/accessions'
dp_analysis ()
{
echo -e $(catdoc "$file" | grep "http://") | tr -d '[:cntrl:]'
echo
}
# Find loop...
oIFS=$IFS
IFS=$'n'
time(find "$FILES" -type f | while read -r file; do
dp_analysis "$file"
done)
IFS=$oIFS
Department of Internal Affairs
Relationship Three: Contains HTTP://
• https://gist.github.com/ross-
spencer/a6411a021afb7de7e3dc6dd713f7b520
• ~5059 parseable born-digital records
• ~4800 lines contained hyperlinks
Department of Internal Affairs
Relationship Four: Contains CMS Reference
echo -e $(catdoc "$file" | grep -e "A[0-9]{6}"
• Matches the Archway catalogue reference number, e.g.
A204050; A123456; and not AZ12345
• CMS reference could be sent alongside transfer metadata
for such searches.
• Flag existence (at least) - FYI to the end user – be that the
transfer archivist, to the agency, to the researcher
Department of Internal Affairs
Relationship Five: Contains Embedded Object
$ java -jar tika-app1.13.jar -z <filename> --extract-
dir=<dirname>
Department of Internal Affairs
Relationship Six: Contains Intra-Item Record
Department of Internal Affairs
Relationship Seven: Contains Object Reference
A digital preservation risk...
Department of Internal Affairs
Relationship Seven: Contains Object Reference
Extract files from PPT OLE2 -> Read PowerPoint Document Obect ->
Look for:
Department of Internal Affairs
Relationship Eight: Item Mentions
Dictionary:
Helen Clark
Helen Elizabeth Clark
John Key
United Nations
Prime Minister
University of Auckland
Jenny Shipley
Labour Party
Department of Internal Affairs
Relationship
Eight:
Item Mentions
Department of Internal Affairs
Discussion
• Data structures – support needed in catalogue, and digital
preservation system...
• Extensbile, flexible enough not to (need to) know what the
future holds...
• AS/NZS 5478:2015, Recordkeeping metadata property
reference set (RMPRS) states:
“The digital world is increasingly using networked
relationships”.
Department of Internal Affairs
Discussion
• Verhoeven (2016) – Devil’s Bridges!
– Ontological, graph/network based infrastuctures
– Vernacular ontologies
– Understand, Make, Improve Quality of our Connections
– redistribution of power and the possibilities of world
making (and remaking) in the archive
Department of Internal Affairs
Providing the algorithms are transparent, what then
provides a more objective view of the world than machine
generated relations?
Department of Internal Affairs
Discussion
• ICA... RiC-R7: ‘is Draft Of’ semantics (A Speech):
– Still a draft if 80% content is different from published?
– Draft because it’s marked as such in metadata?
– Draft when it has been delivered in the wild?
• ICA... RiC-R4: ‘has Subject’ semantics (This Presentation):
– Graph technologies?
– Digtial preservation?
– Processing of digital archives?
– Binary trees?!
Department of Internal Affairs
“RiC-CM aspires to reflect both facets of the Principle of Provenance, as
these have traditionally been understood and practiced, and at the same
time recognize a more expansive and dynamic understanding of
provenance. It is this more expansive understanding that is embodied in
the word “Contexts.” RiC-CM is intended to enable a fuller, if forever
incomplete, description of the contexts in which records emerge and exist,
so as to enable multiple perspectives and multiple avenues of access.”
Department of Internal Affairs
Discussion
• Impact for record keeping; transfer; digital preservtion,
discovery...
• Digital preservation – linked objects, hyperlinks, embedded
objects...
• Not all geekery!
• Remember the content of these records...
• Remember the connections...
• Remember use-cases for digital preservation, it does not
operate, in and of itself!
Department of Internal Affairs
Conclusion
• “Computer forensic examiners are often overwhelmed with
data. Modern hard drives contain more information that
cannot be manually examined in a reasonable time period
creating a need for data reduction techniques.” - Kornblum
(2006)
• So how do we begin?
One relation at a time...
Department of Internal Affairs
Links
• Checksum 101: http://www.slideshare.net/RossSpencer/checksum-101
• SSCOMPARE: https://github.com/exponential-decay/sscompare
• TLSH Experiments: https://github.com/exponential-decay/tlsh-experiments
• Parrallel Lines Workshop: https://github.com/andreakb/parallel-lines-workshop
• Apache Tika: https://tika.apache.org/
• Full Paper: Hopefully in Archives and Manuscripts sometime soon!
Thank you
ross.spencer@dia.govt.nz
@beet_keeper

More Related Content

Similar to Binary Trees? Automatically identifying the links between born-digital records

Aba adams
Aba adamsAba adams
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
Jenny Mitcham
 
A collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMA collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDM
northerncollaboration
 
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdm
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdmSailing the Digital Serial Seas: Charting a New Course with CONTENTdm
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdm
NASIG
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
Ian Duncan
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
pbajcsy
 
The Digital Archaeological Workflow: A Case Study from Sweden
The Digital Archaeological Workflow: A Case Study from SwedenThe Digital Archaeological Workflow: A Case Study from Sweden
The Digital Archaeological Workflow: A Case Study from Sweden
Marcus Smith
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked data
Enno Meijers
 
Reference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and RemedyReference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and Remedy
EDINA, University of Edinburgh
 
IIIF and DSpace 7 - IIIF Conference 2023.pdf
IIIF and DSpace 7 - IIIF Conference 2023.pdfIIIF and DSpace 7 - IIIF Conference 2023.pdf
IIIF and DSpace 7 - IIIF Conference 2023.pdf
4Science
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
Ivan Herman
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
Jenny Mitcham
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
Amit Sheth
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projects
zsrlibrary
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
21Style
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
Oscar Corcho
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
4Science
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
Nuxeo
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
National Information Standards Organization (NISO)
 

Similar to Binary Trees? Automatically identifying the links between born-digital records (20)

Aba adams
Aba adamsAba adams
Aba adams
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
 
A collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMA collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDM
 
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdm
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdmSailing the Digital Serial Seas: Charting a New Course with CONTENTdm
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdm
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
 
The Digital Archaeological Workflow: A Case Study from Sweden
The Digital Archaeological Workflow: A Case Study from SwedenThe Digital Archaeological Workflow: A Case Study from Sweden
The Digital Archaeological Workflow: A Case Study from Sweden
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked data
 
Reference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and RemedyReference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and Remedy
 
IIIF and DSpace 7 - IIIF Conference 2023.pdf
IIIF and DSpace 7 - IIIF Conference 2023.pdfIIIF and DSpace 7 - IIIF Conference 2023.pdf
IIIF and DSpace 7 - IIIF Conference 2023.pdf
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projects
 
Presentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenbergPresentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenberg
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 

Recently uploaded

A Guide to AI for Smarter Nonprofits - Dr. Cori Faklaris, UNC Charlotte
A Guide to AI for Smarter Nonprofits - Dr. Cori Faklaris, UNC CharlotteA Guide to AI for Smarter Nonprofits - Dr. Cori Faklaris, UNC Charlotte
A Guide to AI for Smarter Nonprofits - Dr. Cori Faklaris, UNC Charlotte
University of North Carolina at Charlotte
 
RFP for Reno's Community Assistance Center
RFP for Reno's Community Assistance CenterRFP for Reno's Community Assistance Center
RFP for Reno's Community Assistance Center
This Is Reno
 
Preliminary findings _OECD field visits to ten regions in the TSI EU mining r...
Preliminary findings _OECD field visits to ten regions in the TSI EU mining r...Preliminary findings _OECD field visits to ten regions in the TSI EU mining r...
Preliminary findings _OECD field visits to ten regions in the TSI EU mining r...
OECDregions
 
如何办理(uoit毕业证书)加拿大安大略理工大学毕业证文凭证书录取通知原版一模一样
如何办理(uoit毕业证书)加拿大安大略理工大学毕业证文凭证书录取通知原版一模一样如何办理(uoit毕业证书)加拿大安大略理工大学毕业证文凭证书录取通知原版一模一样
如何办理(uoit毕业证书)加拿大安大略理工大学毕业证文凭证书录取通知原版一模一样
850fcj96
 
2017 Omnibus Rules on Appointments and Other Human Resource Actions, As Amended
2017 Omnibus Rules on Appointments and Other Human Resource Actions, As Amended2017 Omnibus Rules on Appointments and Other Human Resource Actions, As Amended
2017 Omnibus Rules on Appointments and Other Human Resource Actions, As Amended
johnmarimigallon
 
快速制作(ocad毕业证书)加拿大安大略艺术设计学院毕业证本科学历雅思成绩单原版一模一样
快速制作(ocad毕业证书)加拿大安大略艺术设计学院毕业证本科学历雅思成绩单原版一模一样快速制作(ocad毕业证书)加拿大安大略艺术设计学院毕业证本科学历雅思成绩单原版一模一样
快速制作(ocad毕业证书)加拿大安大略艺术设计学院毕业证本科学历雅思成绩单原版一模一样
850fcj96
 
CBO’s Outlook for U.S. Fertility Rates: 2024 to 2054
CBO’s Outlook for U.S. Fertility Rates: 2024 to 2054CBO’s Outlook for U.S. Fertility Rates: 2024 to 2054
CBO’s Outlook for U.S. Fertility Rates: 2024 to 2054
Congressional Budget Office
 
Monitoring Health for the SDGs - Global Health Statistics 2024 - WHO
Monitoring Health for the SDGs - Global Health Statistics 2024 - WHOMonitoring Health for the SDGs - Global Health Statistics 2024 - WHO
Monitoring Health for the SDGs - Global Health Statistics 2024 - WHO
Christina Parmionova
 
NHAI_Under_Implementation_01-05-2024.pdf
NHAI_Under_Implementation_01-05-2024.pdfNHAI_Under_Implementation_01-05-2024.pdf
NHAI_Under_Implementation_01-05-2024.pdf
AjayVejendla3
 
2024: The FAR - Federal Acquisition Regulations, Part 39
2024: The FAR - Federal Acquisition Regulations, Part 392024: The FAR - Federal Acquisition Regulations, Part 39
2024: The FAR - Federal Acquisition Regulations, Part 39
JSchaus & Associates
 
About Potato, The scientific name of the plant is Solanum tuberosum (L).
About Potato, The scientific name of the plant is Solanum tuberosum (L).About Potato, The scientific name of the plant is Solanum tuberosum (L).
About Potato, The scientific name of the plant is Solanum tuberosum (L).
Christina Parmionova
 
A proposed request for information on LIHTC
A proposed request for information on LIHTCA proposed request for information on LIHTC
A proposed request for information on LIHTC
Roger Valdez
 
A guide to the International day of Potatoes 2024 - May 30th
A guide to the International day of Potatoes 2024 - May 30thA guide to the International day of Potatoes 2024 - May 30th
A guide to the International day of Potatoes 2024 - May 30th
Christina Parmionova
 
State crafting: Changes and challenges for managing the public finances
State crafting: Changes and challenges for managing the public financesState crafting: Changes and challenges for managing the public finances
State crafting: Changes and challenges for managing the public finances
ResolutionFoundation
 
Get Government Grants and Assistance Program
Get Government Grants and Assistance ProgramGet Government Grants and Assistance Program
Get Government Grants and Assistance Program
Get Government Grants
 
Opinions on EVs: Metro Atlanta Speaks 2023
Opinions on EVs: Metro Atlanta Speaks 2023Opinions on EVs: Metro Atlanta Speaks 2023
Opinions on EVs: Metro Atlanta Speaks 2023
ARCResearch
 
Invitation Letter for an alumni association
Invitation Letter for an alumni associationInvitation Letter for an alumni association
Invitation Letter for an alumni association
elmerdalida001
 
Donate to charity during this holiday season
Donate to charity during this holiday seasonDonate to charity during this holiday season
Donate to charity during this holiday season
SERUDS INDIA
 
2024: The FAR - Federal Acquisition Regulations, Part 38
2024: The FAR - Federal Acquisition Regulations, Part 382024: The FAR - Federal Acquisition Regulations, Part 38
2024: The FAR - Federal Acquisition Regulations, Part 38
JSchaus & Associates
 
Transit-Oriented Development Study Working Group Meeting
Transit-Oriented Development Study Working Group MeetingTransit-Oriented Development Study Working Group Meeting
Transit-Oriented Development Study Working Group Meeting
Cuyahoga County Planning Commission
 

Recently uploaded (20)

A Guide to AI for Smarter Nonprofits - Dr. Cori Faklaris, UNC Charlotte
A Guide to AI for Smarter Nonprofits - Dr. Cori Faklaris, UNC CharlotteA Guide to AI for Smarter Nonprofits - Dr. Cori Faklaris, UNC Charlotte
A Guide to AI for Smarter Nonprofits - Dr. Cori Faklaris, UNC Charlotte
 
RFP for Reno's Community Assistance Center
RFP for Reno's Community Assistance CenterRFP for Reno's Community Assistance Center
RFP for Reno's Community Assistance Center
 
Preliminary findings _OECD field visits to ten regions in the TSI EU mining r...
Preliminary findings _OECD field visits to ten regions in the TSI EU mining r...Preliminary findings _OECD field visits to ten regions in the TSI EU mining r...
Preliminary findings _OECD field visits to ten regions in the TSI EU mining r...
 
如何办理(uoit毕业证书)加拿大安大略理工大学毕业证文凭证书录取通知原版一模一样
如何办理(uoit毕业证书)加拿大安大略理工大学毕业证文凭证书录取通知原版一模一样如何办理(uoit毕业证书)加拿大安大略理工大学毕业证文凭证书录取通知原版一模一样
如何办理(uoit毕业证书)加拿大安大略理工大学毕业证文凭证书录取通知原版一模一样
 
2017 Omnibus Rules on Appointments and Other Human Resource Actions, As Amended
2017 Omnibus Rules on Appointments and Other Human Resource Actions, As Amended2017 Omnibus Rules on Appointments and Other Human Resource Actions, As Amended
2017 Omnibus Rules on Appointments and Other Human Resource Actions, As Amended
 
快速制作(ocad毕业证书)加拿大安大略艺术设计学院毕业证本科学历雅思成绩单原版一模一样
快速制作(ocad毕业证书)加拿大安大略艺术设计学院毕业证本科学历雅思成绩单原版一模一样快速制作(ocad毕业证书)加拿大安大略艺术设计学院毕业证本科学历雅思成绩单原版一模一样
快速制作(ocad毕业证书)加拿大安大略艺术设计学院毕业证本科学历雅思成绩单原版一模一样
 
CBO’s Outlook for U.S. Fertility Rates: 2024 to 2054
CBO’s Outlook for U.S. Fertility Rates: 2024 to 2054CBO’s Outlook for U.S. Fertility Rates: 2024 to 2054
CBO’s Outlook for U.S. Fertility Rates: 2024 to 2054
 
Monitoring Health for the SDGs - Global Health Statistics 2024 - WHO
Monitoring Health for the SDGs - Global Health Statistics 2024 - WHOMonitoring Health for the SDGs - Global Health Statistics 2024 - WHO
Monitoring Health for the SDGs - Global Health Statistics 2024 - WHO
 
NHAI_Under_Implementation_01-05-2024.pdf
NHAI_Under_Implementation_01-05-2024.pdfNHAI_Under_Implementation_01-05-2024.pdf
NHAI_Under_Implementation_01-05-2024.pdf
 
2024: The FAR - Federal Acquisition Regulations, Part 39
2024: The FAR - Federal Acquisition Regulations, Part 392024: The FAR - Federal Acquisition Regulations, Part 39
2024: The FAR - Federal Acquisition Regulations, Part 39
 
About Potato, The scientific name of the plant is Solanum tuberosum (L).
About Potato, The scientific name of the plant is Solanum tuberosum (L).About Potato, The scientific name of the plant is Solanum tuberosum (L).
About Potato, The scientific name of the plant is Solanum tuberosum (L).
 
A proposed request for information on LIHTC
A proposed request for information on LIHTCA proposed request for information on LIHTC
A proposed request for information on LIHTC
 
A guide to the International day of Potatoes 2024 - May 30th
A guide to the International day of Potatoes 2024 - May 30thA guide to the International day of Potatoes 2024 - May 30th
A guide to the International day of Potatoes 2024 - May 30th
 
State crafting: Changes and challenges for managing the public finances
State crafting: Changes and challenges for managing the public financesState crafting: Changes and challenges for managing the public finances
State crafting: Changes and challenges for managing the public finances
 
Get Government Grants and Assistance Program
Get Government Grants and Assistance ProgramGet Government Grants and Assistance Program
Get Government Grants and Assistance Program
 
Opinions on EVs: Metro Atlanta Speaks 2023
Opinions on EVs: Metro Atlanta Speaks 2023Opinions on EVs: Metro Atlanta Speaks 2023
Opinions on EVs: Metro Atlanta Speaks 2023
 
Invitation Letter for an alumni association
Invitation Letter for an alumni associationInvitation Letter for an alumni association
Invitation Letter for an alumni association
 
Donate to charity during this holiday season
Donate to charity during this holiday seasonDonate to charity during this holiday season
Donate to charity during this holiday season
 
2024: The FAR - Federal Acquisition Regulations, Part 38
2024: The FAR - Federal Acquisition Regulations, Part 382024: The FAR - Federal Acquisition Regulations, Part 38
2024: The FAR - Federal Acquisition Regulations, Part 38
 
Transit-Oriented Development Study Working Group Meeting
Transit-Oriented Development Study Working Group MeetingTransit-Oriented Development Study Working Group Meeting
Transit-Oriented Development Study Working Group Meeting
 

Binary Trees? Automatically identifying the links between born-digital records

  • 1. Australian Society of Archivists Conference 2016, Parramatta Session 5: Description and Innovation Binary Trees? Automatically Identifying the links between born-digital records. Ross Spencer Digital Preservation Analyst Systems Strategy and Standards team
  • 2. Department of Internal Affairs How do we view the world?
  • 3. Department of Internal Affairs Binary Trees?
  • 4. Department of Internal Affairs But that looks like a network graph?! • It is! • Records (Items) connected across many recordkeeping and archival contexts • Across functions; People; Agency; Subject; Context; References; Subject... Date, File Format... • No boundaries! @ArvhivesNZ:ItemA -> references -> @DOC:ItemB
  • 5. Department of Internal Affairs We know this... • Continuum model (Multiple contexts over space and time) • ICA Draft Conceptual Model (RiC) • 73 Record Relations RiC-R1 to RiC-R73 • Three of which we might be able to (more easily) automate? • Has Copy; Is Copy Of; Has Part • Wherein (I suggest) lies the issue...
  • 6. Department of Internal Affairs Archives NZ Context 2011 Archives New Zealand developed its new conceptual model and metadata schema for archival description. Designed to accommodate description of born-digital records. much discussion among archivists about the practicalities of describing relationships between items. It was acknowledged that, given the volumes of digital records likely to be in each transfer, neither agency nor Archives staff were likely to examine the content of items visually one-by-one to determine which other items they referred to... ~ Talei Masters
  • 7. Department of Internal Affairs What then do we do? • Mathematical properties of digital files... • Signals -> • Numbers -> • Encoding Schemes (UTF8, ASCII) - > • Data Structures -> • File Formats -> User Content. • Reduce again to a series of numbers that we can interpret to use numerical properties: • Greater than; less than; equal to; not equal to...
  • 8. Department of Internal Affairs In the relationship between numbers we can find the relationships between records
  • 9. Department of Internal Affairs Relations we might be able to create... • Relationship One: Is Identical • Relationship Two: Is Similar • Relationship Three: Contains Hyperlink • Relationship Four: Contains CMS Reference • Relationship Five: Contains Embedded Digital Objects • Relationship Six: Contains Intra-Item Relationships • Relationship Seven: Contains Object References • Relationship Eight: Item Mentions
  • 10. Department of Internal Affairs Relationship One: Is Identical ● We often have checksums available in digital repository ● First comparison in a digital transfer... Does Checksum A still equal Checksum A? ● If yes, accept, continue to transfer... ● If no... reject! Inspect! ● Expose this information in the catalogue and compare; what happens?
  • 11. Department of Internal Affairs Relationship One: Is Identical Archival Context A Record Keeping System A Archival Context B Record Keeping System B
  • 12. Department of Internal Affairs Relationship Two: Is Similar • MD5 (Rivest, 1992): • File A (Zero changes): 8c69dc0668c4c73092a7042df45e756adb170742 • File B (1 Byte Removed): 6b75b8f235c148efd1b03d9c113664895b5aa7cd
  • 13. Department of Internal Affairs Relationship Two: Is Similar • SSDEEP (Kornblum, 2006): • File A (Zero changes): 1536:tLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k CK7ZBEY0t5vykp6CYP:q1aYpYTESSgM2CwQGt9Z BB1U6hP • File C (First 250 Bytes Removed*): 1536:CLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k CK7ZBEY0t5vykp6CYP:B1aYpYTESSgM2CwQGt9 ZBB1U6hP *Less than two tweets (140 bytes)
  • 14. Department of Internal Affairs Relationship Two: Is Similar • First experiments, SSDEEP (Kornblum), TLSH* (Oliver et al.) • Oliver et al. (2014) Thresholds should be tuned for each application • Fiirst application is item level sentencing during transfer feasibility investigations • Manually sentence... 10 records per hour • Follow links to those not of archival value... * Trend Micro Locality Sensitivity Hash!
  • 15. Department of Internal Affairs Relationship Two: Is Similar MD5 Hash Fuzzy Hash
  • 16. Department of Internal Affairs Relationship Two: Similar
  • 17. Department of Internal Affairs Relationship Two: Similar
  • 18. Department of Internal Affairs Relationship Two: Is Similar You liked this record... you might also like...
  • 19. Department of Internal Affairs Relationship Three: Contains HTTP:// • Burnhill et al. (2015) • 64,000 e-theses, 46,000 pointed out to external sources • Websites, external files, etc.
  • 20. Department of Internal Affairs Relationship Three: Contains HTTP:// #!/bin/bash set -e #FILES LOCATION FILES='/home/digital-preservation/accessions' dp_analysis () { echo -e $(catdoc "$file" | grep "http://") | tr -d '[:cntrl:]' echo } # Find loop... oIFS=$IFS IFS=$'n' time(find "$FILES" -type f | while read -r file; do dp_analysis "$file" done) IFS=$oIFS
  • 21. Department of Internal Affairs Relationship Three: Contains HTTP:// • https://gist.github.com/ross- spencer/a6411a021afb7de7e3dc6dd713f7b520 • ~5059 parseable born-digital records • ~4800 lines contained hyperlinks
  • 22. Department of Internal Affairs Relationship Four: Contains CMS Reference echo -e $(catdoc "$file" | grep -e "A[0-9]{6}" • Matches the Archway catalogue reference number, e.g. A204050; A123456; and not AZ12345 • CMS reference could be sent alongside transfer metadata for such searches. • Flag existence (at least) - FYI to the end user – be that the transfer archivist, to the agency, to the researcher
  • 23. Department of Internal Affairs Relationship Five: Contains Embedded Object $ java -jar tika-app1.13.jar -z <filename> --extract- dir=<dirname>
  • 24. Department of Internal Affairs Relationship Six: Contains Intra-Item Record
  • 25. Department of Internal Affairs Relationship Seven: Contains Object Reference A digital preservation risk...
  • 26. Department of Internal Affairs Relationship Seven: Contains Object Reference Extract files from PPT OLE2 -> Read PowerPoint Document Obect -> Look for:
  • 27. Department of Internal Affairs Relationship Eight: Item Mentions Dictionary: Helen Clark Helen Elizabeth Clark John Key United Nations Prime Minister University of Auckland Jenny Shipley Labour Party
  • 28. Department of Internal Affairs Relationship Eight: Item Mentions
  • 29. Department of Internal Affairs Discussion • Data structures – support needed in catalogue, and digital preservation system... • Extensbile, flexible enough not to (need to) know what the future holds... • AS/NZS 5478:2015, Recordkeeping metadata property reference set (RMPRS) states: “The digital world is increasingly using networked relationships”.
  • 30. Department of Internal Affairs Discussion • Verhoeven (2016) – Devil’s Bridges! – Ontological, graph/network based infrastuctures – Vernacular ontologies – Understand, Make, Improve Quality of our Connections – redistribution of power and the possibilities of world making (and remaking) in the archive
  • 31. Department of Internal Affairs Providing the algorithms are transparent, what then provides a more objective view of the world than machine generated relations?
  • 32. Department of Internal Affairs Discussion • ICA... RiC-R7: ‘is Draft Of’ semantics (A Speech): – Still a draft if 80% content is different from published? – Draft because it’s marked as such in metadata? – Draft when it has been delivered in the wild? • ICA... RiC-R4: ‘has Subject’ semantics (This Presentation): – Graph technologies? – Digtial preservation? – Processing of digital archives? – Binary trees?!
  • 33. Department of Internal Affairs “RiC-CM aspires to reflect both facets of the Principle of Provenance, as these have traditionally been understood and practiced, and at the same time recognize a more expansive and dynamic understanding of provenance. It is this more expansive understanding that is embodied in the word “Contexts.” RiC-CM is intended to enable a fuller, if forever incomplete, description of the contexts in which records emerge and exist, so as to enable multiple perspectives and multiple avenues of access.”
  • 34. Department of Internal Affairs Discussion • Impact for record keeping; transfer; digital preservtion, discovery... • Digital preservation – linked objects, hyperlinks, embedded objects... • Not all geekery! • Remember the content of these records... • Remember the connections... • Remember use-cases for digital preservation, it does not operate, in and of itself!
  • 35. Department of Internal Affairs Conclusion • “Computer forensic examiners are often overwhelmed with data. Modern hard drives contain more information that cannot be manually examined in a reasonable time period creating a need for data reduction techniques.” - Kornblum (2006) • So how do we begin? One relation at a time...
  • 36. Department of Internal Affairs Links • Checksum 101: http://www.slideshare.net/RossSpencer/checksum-101 • SSCOMPARE: https://github.com/exponential-decay/sscompare • TLSH Experiments: https://github.com/exponential-decay/tlsh-experiments • Parrallel Lines Workshop: https://github.com/andreakb/parallel-lines-workshop • Apache Tika: https://tika.apache.org/ • Full Paper: Hopefully in Archives and Manuscripts sometime soon!