Presentation for Archives New Zealand Records Management Network Event describing the reality of digital transfer. Looking at the potential scale of digital transfers from the largest collections we investigated during the initial transfers project and comparing it to the accession work we're currently investigating at time of writing. A look at some of the challenges involved and how we're tackling those.
1. The Reality of Digital Transfer
@ArchivesNZ
Ross Spencer, Talei Masters
Archives New Zealand
Records Management Network Event,
Tuesday November 25 2014
Department of Internal Affairs
2. Background
Born Digital and Cultural Heritage Conference
Melbourne*: http://bit.ly/1utAqz0
Spencer, Braden, Hutar, Masters, Crouch, Mosely, Fly
Away Home: Pilot Transfer of Born-digital Records at
Archives New Zealand
Collected our experiences from late 2013 through to early
2014. Royal Commission work through to GDAP Closure
and beginning of eAccessions.
* http://playitagainproject.org/conference-report/
Department of Internal Affairs
3. A missing piece of the jigsaw…
• An appraisal of the technical challenges
• The first of a much bigger puzzle?
• We understood a minimal set of descriptive
metadata e.g. transfer metadata file; mapping
of EDRMS fields to that schema
• But the collection profile was missing –
technical implications of digital preservation…
Department of Internal Affairs
4. And the numbers were/are huge!
Royal Commission on the Pike River Coal Mine Tragedy
Two EDRMS:
AccessData Summation Lotus Notes DMS
374,264 Files (200GB)
66,580 Directories
3,892 Unidentified Objects
15 Unidentified Extensions
87 Known Formats
55,425 Duplicates (Content)
Analysis time: 108 minutes
Department of Internal Affairs
24,190 Files (5GB)
641 Directories
1,254 Unidentified Objects
8 Unidentified Extensions
62 Known Formats
6,200 Duplicates (Content)
Analysis time: 44 minutes
5. There’s more…
The Canterbury Earthquakes Royal Commission (partial stats)
One EDRMS:
Lotus Notes DMS… (but a different flavour!)
11,505 Files (57GB)
246 Directories
123 Unidentified Objects
2 Unidentified Extensions
55 Known Formats
2,468 Duplicates (Content)
Analysis time: stats not collected
Department of Internal Affairs
6. Performance of tools…
Just one (fairly profound?) example for you…Pike River
metadata extraction, and checksum generation… ‘triage’
2949m21.680s
Department of Internal Affairs
49 Hours!
7. Questions already forming…
• How do we speed things up?
• How do we make reporting consistent?
• Where do we begin with this information?
• Some answers already appearing: stats report is now
generated by a Python script in response to these
issues: https://github.com/exponential-decay/droid-sqlite-
analysis
• Relies only on The National Archives, DROID tool, file
listing, format ID, and checksumming utility
Department of Internal Affairs
8. eAccession One [e1]
Legacy accessions that we have opportunity to utilise lessons
learned from Initial Digital Transfers…
175 Files (166.5 mb)
10 Directories
0 Unidentified Objects
0 Unidentified Extensions
7 Known Formats
0 Duplicates (content)
Department of Internal Affairs
9. eAccession Four [e4]
eAccessions were seen to be the least complex and allowed
us to focus, primarily, on the challenge of ingest…
1295 Files (565.0 mb)
6 Directories
2 Unidentified Objects
1 Unidentified Extensions
12 Known Formats
2 Duplicates (content)
Department of Internal Affairs
Note: Obscured issue in original statistics…
A number of false positives! System files
identified as something more generic.
Thumbnail preview files, and Serif PagePlus
might normally look like MS Office file-like
objects.
10. Technical Challenges in e1 and e4
• [Tools] Ability to handle multi-byte character encodings. Maori macrons
‘Ā’.
• [Tools] Unidentified files and false positives.
• [Tools] Recording of pre-conditioning actions on ingest into digital
preservation system.
• [Tools] Implementing CSV ingest mechanism; configuration, code, and
workflow.
• [Pre-conditioning / Tools] Digital preservation system’s ability (Rosetta)
to handle contiguous spaces in filenames.
• [Pre-conditioning] One invalid JPEG. Required rearrangement of
application marker segments.
Department of Internal Affairs
11. What next..?
• One step at a time. Accessions e1 and e4; develop capability
further with e2 and e3.
• Incorporate metadata extraction tool JHOVE into process
following experience with e1 and e4, possibly via FITS
• Refine current metrics and the presentation of statistics e.g.
make more useful for Archivists working on the born-digital
we’re already in possession of…
• Ideal: Archivists knowledge (processes, analysis, diagnosis)
becomes actuated.
Department of Internal Affairs
12. What next..?
• SCALE!
Thank you!
Department of Internal Affairs
** Stats generated by a prototype analysis tool in concert with The National Archives DROID tool – work to do to improve further **
Temptation to lump both EDRMS together to look at as an individual accession, but this masks a separate issue
Extract of files, and metadata, and mapping of that metadata from two different systems is a challenge in itself…
These numbers come from the initial transfers project/Government Digital Archive Project (GDAP) which was closed down.
Files not ingested.
Files remain in custody of DIA Records Team
** Stats generated by a prototype analysis tool in concert with The National Archives DROID tool – work to do to improve further **
These numbers come from the initial transfers project/Government Digital Archive Project (GDAP) which was closed down.
Files not ingested.
Files remain in custody of DIA Records Team
Just the beginning of the data we need to collect
A triage dataset to improve decision making
JHOVE/Tika ID/File/SHA1SUM
Further analysis needed on the analysis!
At this point we’re already seeing the direction we need to take things…
Reporting script an output of these questions. Improving consistency / repeatability etc.
Reporting script available from GitHub and DROID available from The National Archives, UK website
The output of DROID can be drilled into to understand collections, e.g. number of duplicates found across different sets of folders
Open source. Useful to agencies embarking on migration project. Collection profiling.
Following the closure of GDAP the Digital Continuity team started work on legacy accessions we were in the possession of.
eAccessions.
Smaller and less complex. Still enough challenges to push our knowledge forward.
Thumbs.db identified as their family file format – OLE2. Masked their true essence.
Serif PagePlus also…
Tool support for Unicode was found lacking through the process. Excel, DROID, our digital preservation system, our own Python script during initial prototyping.
Correspondence with developers of DROID to improve tool, to get it to support Unicode.
We used pre-release versions of DROID (6.1.4) (as testers) for much of our testing.
Work required (collaboration, BL, TNA, SRNSW) to incorporate new identification mechanisms in DROID tool.
Pre-conditioning required on JPEG to provide adequate provenance trail on ingest.
Analogy: A medical doctor isn’t always referring to their books. Their knowledge is inbuilt, and instinct.
Example: False positives in format ID. Format recognised, but nonsense in context.
Example: Duplicates, knowledge of workflow in tools useful in making decisions. Some CERC duplicates came from repetition of email footer images across collection.
Whose problem does this become? Agencies (re: management)? Ours (re: storage optimisation (store just one and link?))?
A slight distraction. Management of this type of issue up-front is desirable in helping to reduce pain during technical appraisal / ingest.
Simple! We can do this on smaller accessions… It just needs to be scaled!!! ;)
How hard can that be?!