SlideShare a Scribd company logo
1 of 13
The Reality of Digital Transfer 
@ArchivesNZ 
Ross Spencer, Talei Masters 
Archives New Zealand 
Records Management Network Event, 
Tuesday November 25 2014 
Department of Internal Affairs
Background 
Born Digital and Cultural Heritage Conference 
Melbourne*: http://bit.ly/1utAqz0 
Spencer, Braden, Hutar, Masters, Crouch, Mosely, Fly 
Away Home: Pilot Transfer of Born-digital Records at 
Archives New Zealand 
Collected our experiences from late 2013 through to early 
2014. Royal Commission work through to GDAP Closure 
and beginning of eAccessions. 
* http://playitagainproject.org/conference-report/ 
Department of Internal Affairs
A missing piece of the jigsaw… 
• An appraisal of the technical challenges 
• The first of a much bigger puzzle? 
• We understood a minimal set of descriptive 
metadata e.g. transfer metadata file; mapping 
of EDRMS fields to that schema 
• But the collection profile was missing – 
technical implications of digital preservation… 
Department of Internal Affairs
And the numbers were/are huge! 
Royal Commission on the Pike River Coal Mine Tragedy 
Two EDRMS: 
AccessData Summation Lotus Notes DMS 
374,264 Files (200GB) 
66,580 Directories 
3,892 Unidentified Objects 
15 Unidentified Extensions 
87 Known Formats 
55,425 Duplicates (Content) 
Analysis time: 108 minutes 
Department of Internal Affairs 
24,190 Files (5GB) 
641 Directories 
1,254 Unidentified Objects 
8 Unidentified Extensions 
62 Known Formats 
6,200 Duplicates (Content) 
Analysis time: 44 minutes
There’s more… 
The Canterbury Earthquakes Royal Commission (partial stats) 
One EDRMS: 
Lotus Notes DMS… (but a different flavour!) 
11,505 Files (57GB) 
246 Directories 
123 Unidentified Objects 
2 Unidentified Extensions 
55 Known Formats 
2,468 Duplicates (Content) 
Analysis time: stats not collected 
Department of Internal Affairs
Performance of tools… 
Just one (fairly profound?) example for you…Pike River 
metadata extraction, and checksum generation… ‘triage’ 
2949m21.680s 
Department of Internal Affairs 
49 Hours!
Questions already forming… 
• How do we speed things up? 
• How do we make reporting consistent? 
• Where do we begin with this information? 
• Some answers already appearing: stats report is now 
generated by a Python script in response to these 
issues: https://github.com/exponential-decay/droid-sqlite- 
analysis 
• Relies only on The National Archives, DROID tool, file 
listing, format ID, and checksumming utility 
Department of Internal Affairs
eAccession One [e1] 
Legacy accessions that we have opportunity to utilise lessons 
learned from Initial Digital Transfers… 
175 Files (166.5 mb) 
10 Directories 
0 Unidentified Objects 
0 Unidentified Extensions 
7 Known Formats 
0 Duplicates (content) 
Department of Internal Affairs
eAccession Four [e4] 
eAccessions were seen to be the least complex and allowed 
us to focus, primarily, on the challenge of ingest… 
1295 Files (565.0 mb) 
6 Directories 
2 Unidentified Objects 
1 Unidentified Extensions 
12 Known Formats 
2 Duplicates (content) 
Department of Internal Affairs 
Note: Obscured issue in original statistics… 
A number of false positives! System files 
identified as something more generic. 
Thumbnail preview files, and Serif PagePlus 
might normally look like MS Office file-like 
objects.
Technical Challenges in e1 and e4 
• [Tools] Ability to handle multi-byte character encodings. Maori macrons 
‘Ā’. 
• [Tools] Unidentified files and false positives. 
• [Tools] Recording of pre-conditioning actions on ingest into digital 
preservation system. 
• [Tools] Implementing CSV ingest mechanism; configuration, code, and 
workflow. 
• [Pre-conditioning / Tools] Digital preservation system’s ability (Rosetta) 
to handle contiguous spaces in filenames. 
• [Pre-conditioning] One invalid JPEG. Required rearrangement of 
application marker segments. 
Department of Internal Affairs
What next..? 
• One step at a time. Accessions e1 and e4; develop capability 
further with e2 and e3. 
• Incorporate metadata extraction tool JHOVE into process 
following experience with e1 and e4, possibly via FITS 
• Refine current metrics and the presentation of statistics e.g. 
make more useful for Archivists working on the born-digital 
we’re already in possession of… 
• Ideal: Archivists knowledge (processes, analysis, diagnosis) 
becomes actuated. 
Department of Internal Affairs
What next..? 
• SCALE! 
Thank you! 
Department of Internal Affairs
Department of Internal Affairs

More Related Content

Similar to The Reality of Digital Transfer @ArchivesNZ

Putting it all together for digital assets
Putting it all together for digital assetsPutting it all together for digital assets
Putting it all together for digital assets
Jon Morley
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdf
ekobelasting
 

Similar to The Reality of Digital Transfer @ArchivesNZ (20)

The Incremental Path to Observability
The Incremental Path to ObservabilityThe Incremental Path to Observability
The Incremental Path to Observability
 
Information management at vhir ueb using tiki-cms
Information management at vhir ueb using tiki-cmsInformation management at vhir ueb using tiki-cms
Information management at vhir ueb using tiki-cms
 
Systems, processes & how we stop the wheels falling off
Systems, processes & how we stop the wheels falling offSystems, processes & how we stop the wheels falling off
Systems, processes & how we stop the wheels falling off
 
Putting it all together for digital assets
Putting it all together for digital assetsPutting it all together for digital assets
Putting it all together for digital assets
 
Btech IT Sem VII and VIII-1 (1).pdf
Btech IT Sem VII and VIII-1 (1).pdfBtech IT Sem VII and VIII-1 (1).pdf
Btech IT Sem VII and VIII-1 (1).pdf
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
 
02-History.ppt
02-History.ppt02-History.ppt
02-History.ppt
 
The New DRS: Plan for Metadata Migration
The New DRS: Plan for Metadata MigrationThe New DRS: Plan for Metadata Migration
The New DRS: Plan for Metadata Migration
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
 
Implementing Archivematica, research data network
Implementing Archivematica, research data networkImplementing Archivematica, research data network
Implementing Archivematica, research data network
 
VERDOODT Measuring clouds. A large scale acquisition and preservation service...
VERDOODT Measuring clouds. A large scale acquisition and preservation service...VERDOODT Measuring clouds. A large scale acquisition and preservation service...
VERDOODT Measuring clouds. A large scale acquisition and preservation service...
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdf
 
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel ZikmundNDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
 
MyersTessella_Dec2013
MyersTessella_Dec2013MyersTessella_Dec2013
MyersTessella_Dec2013
 
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
 
G01 blazek betanski_locloud_collections
G01 blazek betanski_locloud_collectionsG01 blazek betanski_locloud_collections
G01 blazek betanski_locloud_collections
 
LoCloud Collections Introduction
LoCloud Collections IntroductionLoCloud Collections Introduction
LoCloud Collections Introduction
 
G01 blazek betanski_locloud_collections
G01 blazek betanski_locloud_collectionsG01 blazek betanski_locloud_collections
G01 blazek betanski_locloud_collections
 

Recently uploaded

Unique Value Prop slide deck________.pdf
Unique Value Prop slide deck________.pdfUnique Value Prop slide deck________.pdf
Unique Value Prop slide deck________.pdf
ScottMeyers35
 
Cara Gugurkan Pembuahan Secara Alami Dan Cepat ABORSI KANDUNGAN 087776558899
Cara Gugurkan Pembuahan Secara Alami Dan Cepat ABORSI KANDUNGAN 087776558899Cara Gugurkan Pembuahan Secara Alami Dan Cepat ABORSI KANDUNGAN 087776558899
Cara Gugurkan Pembuahan Secara Alami Dan Cepat ABORSI KANDUNGAN 087776558899
Cara Menggugurkan Kandungan 087776558899
 
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 

Recently uploaded (20)

unang digmaang pandaigdig tagalog version
unang digmaang pandaigdig tagalog versionunang digmaang pandaigdig tagalog version
unang digmaang pandaigdig tagalog version
 
Unique Value Prop slide deck________.pdf
Unique Value Prop slide deck________.pdfUnique Value Prop slide deck________.pdf
Unique Value Prop slide deck________.pdf
 
A Press for the Planet: Journalism in the face of the Environmental Crisis
A Press for the Planet: Journalism in the face of the Environmental CrisisA Press for the Planet: Journalism in the face of the Environmental Crisis
A Press for the Planet: Journalism in the face of the Environmental Crisis
 
Call Girls Basheerbagh ( 8250092165 ) Cheap rates call girls | Get low budget
Call Girls Basheerbagh ( 8250092165 ) Cheap rates call girls | Get low budgetCall Girls Basheerbagh ( 8250092165 ) Cheap rates call girls | Get low budget
Call Girls Basheerbagh ( 8250092165 ) Cheap rates call girls | Get low budget
 
An Atoll Futures Research Institute? Presentation for CANCC
An Atoll Futures Research Institute? Presentation for CANCCAn Atoll Futures Research Institute? Presentation for CANCC
An Atoll Futures Research Institute? Presentation for CANCC
 
Cara Gugurkan Pembuahan Secara Alami Dan Cepat ABORSI KANDUNGAN 087776558899
Cara Gugurkan Pembuahan Secara Alami Dan Cepat ABORSI KANDUNGAN 087776558899Cara Gugurkan Pembuahan Secara Alami Dan Cepat ABORSI KANDUNGAN 087776558899
Cara Gugurkan Pembuahan Secara Alami Dan Cepat ABORSI KANDUNGAN 087776558899
 
2024: The FAR, Federal Acquisition Regulations, Part 30
2024: The FAR, Federal Acquisition Regulations, Part 302024: The FAR, Federal Acquisition Regulations, Part 30
2024: The FAR, Federal Acquisition Regulations, Part 30
 
Dating Call Girls inBaloda Bazar Bhatapara 9332606886Call Girls Advance Cash...
Dating Call Girls inBaloda Bazar Bhatapara  9332606886Call Girls Advance Cash...Dating Call Girls inBaloda Bazar Bhatapara  9332606886Call Girls Advance Cash...
Dating Call Girls inBaloda Bazar Bhatapara 9332606886Call Girls Advance Cash...
 
NGO working for orphan children’s education
NGO working for orphan children’s educationNGO working for orphan children’s education
NGO working for orphan children’s education
 
Panchayath circular KLC -Panchayath raj act s 169, 218
Panchayath circular KLC -Panchayath raj act s 169, 218Panchayath circular KLC -Panchayath raj act s 169, 218
Panchayath circular KLC -Panchayath raj act s 169, 218
 
2024: The FAR, Federal Acquisition Regulations, Part 31
2024: The FAR, Federal Acquisition Regulations, Part 312024: The FAR, Federal Acquisition Regulations, Part 31
2024: The FAR, Federal Acquisition Regulations, Part 31
 
Call Girls in Moti Bagh (delhi) call me [8448380779] escort service 24X7
Call Girls in Moti Bagh (delhi) call me [8448380779] escort service 24X7Call Girls in Moti Bagh (delhi) call me [8448380779] escort service 24X7
Call Girls in Moti Bagh (delhi) call me [8448380779] escort service 24X7
 
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Morena [ 7014168258 ] Call Me For Genuine Models We...
 
Just Call VIP Call Girls In Bangalore Kr Puram ☎️ 6378878445 Independent Fem...
Just Call VIP Call Girls In  Bangalore Kr Puram ☎️ 6378878445 Independent Fem...Just Call VIP Call Girls In  Bangalore Kr Puram ☎️ 6378878445 Independent Fem...
Just Call VIP Call Girls In Bangalore Kr Puram ☎️ 6378878445 Independent Fem...
 
74th Amendment of India PPT by Piyush(IC).pptx
74th Amendment of India PPT by Piyush(IC).pptx74th Amendment of India PPT by Piyush(IC).pptx
74th Amendment of India PPT by Piyush(IC).pptx
 
31st World Press Freedom Day Conference in Santiago.
31st World Press Freedom Day Conference in Santiago.31st World Press Freedom Day Conference in Santiago.
31st World Press Freedom Day Conference in Santiago.
 
Kolkata Call Girls Halisahar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl ...
Kolkata Call Girls Halisahar  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Girl ...Kolkata Call Girls Halisahar  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Girl ...
Kolkata Call Girls Halisahar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl ...
 
Financing strategies for adaptation. Presentation for CANCC
Financing strategies for adaptation. Presentation for CANCCFinancing strategies for adaptation. Presentation for CANCC
Financing strategies for adaptation. Presentation for CANCC
 
Time, Stress & Work Life Balance for Clerks with Beckie Whitehouse
Time, Stress & Work Life Balance for Clerks with Beckie WhitehouseTime, Stress & Work Life Balance for Clerks with Beckie Whitehouse
Time, Stress & Work Life Balance for Clerks with Beckie Whitehouse
 
Make a difference in a girl's life by donating to her education!
Make a difference in a girl's life by donating to her education!Make a difference in a girl's life by donating to her education!
Make a difference in a girl's life by donating to her education!
 

The Reality of Digital Transfer @ArchivesNZ

  • 1. The Reality of Digital Transfer @ArchivesNZ Ross Spencer, Talei Masters Archives New Zealand Records Management Network Event, Tuesday November 25 2014 Department of Internal Affairs
  • 2. Background Born Digital and Cultural Heritage Conference Melbourne*: http://bit.ly/1utAqz0 Spencer, Braden, Hutar, Masters, Crouch, Mosely, Fly Away Home: Pilot Transfer of Born-digital Records at Archives New Zealand Collected our experiences from late 2013 through to early 2014. Royal Commission work through to GDAP Closure and beginning of eAccessions. * http://playitagainproject.org/conference-report/ Department of Internal Affairs
  • 3. A missing piece of the jigsaw… • An appraisal of the technical challenges • The first of a much bigger puzzle? • We understood a minimal set of descriptive metadata e.g. transfer metadata file; mapping of EDRMS fields to that schema • But the collection profile was missing – technical implications of digital preservation… Department of Internal Affairs
  • 4. And the numbers were/are huge! Royal Commission on the Pike River Coal Mine Tragedy Two EDRMS: AccessData Summation Lotus Notes DMS 374,264 Files (200GB) 66,580 Directories 3,892 Unidentified Objects 15 Unidentified Extensions 87 Known Formats 55,425 Duplicates (Content) Analysis time: 108 minutes Department of Internal Affairs 24,190 Files (5GB) 641 Directories 1,254 Unidentified Objects 8 Unidentified Extensions 62 Known Formats 6,200 Duplicates (Content) Analysis time: 44 minutes
  • 5. There’s more… The Canterbury Earthquakes Royal Commission (partial stats) One EDRMS: Lotus Notes DMS… (but a different flavour!) 11,505 Files (57GB) 246 Directories 123 Unidentified Objects 2 Unidentified Extensions 55 Known Formats 2,468 Duplicates (Content) Analysis time: stats not collected Department of Internal Affairs
  • 6. Performance of tools… Just one (fairly profound?) example for you…Pike River metadata extraction, and checksum generation… ‘triage’ 2949m21.680s Department of Internal Affairs 49 Hours!
  • 7. Questions already forming… • How do we speed things up? • How do we make reporting consistent? • Where do we begin with this information? • Some answers already appearing: stats report is now generated by a Python script in response to these issues: https://github.com/exponential-decay/droid-sqlite- analysis • Relies only on The National Archives, DROID tool, file listing, format ID, and checksumming utility Department of Internal Affairs
  • 8. eAccession One [e1] Legacy accessions that we have opportunity to utilise lessons learned from Initial Digital Transfers… 175 Files (166.5 mb) 10 Directories 0 Unidentified Objects 0 Unidentified Extensions 7 Known Formats 0 Duplicates (content) Department of Internal Affairs
  • 9. eAccession Four [e4] eAccessions were seen to be the least complex and allowed us to focus, primarily, on the challenge of ingest… 1295 Files (565.0 mb) 6 Directories 2 Unidentified Objects 1 Unidentified Extensions 12 Known Formats 2 Duplicates (content) Department of Internal Affairs Note: Obscured issue in original statistics… A number of false positives! System files identified as something more generic. Thumbnail preview files, and Serif PagePlus might normally look like MS Office file-like objects.
  • 10. Technical Challenges in e1 and e4 • [Tools] Ability to handle multi-byte character encodings. Maori macrons ‘Ā’. • [Tools] Unidentified files and false positives. • [Tools] Recording of pre-conditioning actions on ingest into digital preservation system. • [Tools] Implementing CSV ingest mechanism; configuration, code, and workflow. • [Pre-conditioning / Tools] Digital preservation system’s ability (Rosetta) to handle contiguous spaces in filenames. • [Pre-conditioning] One invalid JPEG. Required rearrangement of application marker segments. Department of Internal Affairs
  • 11. What next..? • One step at a time. Accessions e1 and e4; develop capability further with e2 and e3. • Incorporate metadata extraction tool JHOVE into process following experience with e1 and e4, possibly via FITS • Refine current metrics and the presentation of statistics e.g. make more useful for Archivists working on the born-digital we’re already in possession of… • Ideal: Archivists knowledge (processes, analysis, diagnosis) becomes actuated. Department of Internal Affairs
  • 12. What next..? • SCALE! Thank you! Department of Internal Affairs

Editor's Notes

  1. ** Stats generated by a prototype analysis tool in concert with The National Archives DROID tool – work to do to improve further ** Temptation to lump both EDRMS together to look at as an individual accession, but this masks a separate issue Extract of files, and metadata, and mapping of that metadata from two different systems is a challenge in itself… These numbers come from the initial transfers project/Government Digital Archive Project (GDAP) which was closed down. Files not ingested. Files remain in custody of DIA Records Team
  2. ** Stats generated by a prototype analysis tool in concert with The National Archives DROID tool – work to do to improve further ** These numbers come from the initial transfers project/Government Digital Archive Project (GDAP) which was closed down. Files not ingested. Files remain in custody of DIA Records Team
  3. Just the beginning of the data we need to collect A triage dataset to improve decision making JHOVE/Tika ID/File/SHA1SUM Further analysis needed on the analysis!
  4. At this point we’re already seeing the direction we need to take things… Reporting script an output of these questions. Improving consistency / repeatability etc. Reporting script available from GitHub and DROID available from The National Archives, UK website The output of DROID can be drilled into to understand collections, e.g. number of duplicates found across different sets of folders Open source. Useful to agencies embarking on migration project. Collection profiling.
  5. Following the closure of GDAP the Digital Continuity team started work on legacy accessions we were in the possession of. eAccessions. Smaller and less complex. Still enough challenges to push our knowledge forward.
  6. Thumbs.db identified as their family file format – OLE2. Masked their true essence. Serif PagePlus also…
  7. Tool support for Unicode was found lacking through the process. Excel, DROID, our digital preservation system, our own Python script during initial prototyping. Correspondence with developers of DROID to improve tool, to get it to support Unicode. We used pre-release versions of DROID (6.1.4) (as testers) for much of our testing. Work required (collaboration, BL, TNA, SRNSW) to incorporate new identification mechanisms in DROID tool. Pre-conditioning required on JPEG to provide adequate provenance trail on ingest.
  8. Analogy: A medical doctor isn’t always referring to their books. Their knowledge is inbuilt, and instinct. Example: False positives in format ID. Format recognised, but nonsense in context. Example: Duplicates, knowledge of workflow in tools useful in making decisions. Some CERC duplicates came from repetition of email footer images across collection. Whose problem does this become? Agencies (re: management)? Ours (re: storage optimisation (store just one and link?))? A slight distraction. Management of this type of issue up-front is desirable in helping to reduce pain during technical appraisal / ingest.
  9. Simple! We can do this on smaller accessions… It just needs to be scaled!!! ;) How hard can that be?!