SlideShare a Scribd company logo
Department of Internal Affairs
Time Traveling Analyst: The Things Only
a Time Machine Can Tell Me…
Ross Spencer - @beet_keeper
Archives New Zealand
#ARANZ2015
Tuesday September 7 2015
Department of Internal Affairs
Sun image, R24685027, E4, Archway,
Archives New Zealand.
http://www.archway.archives.govt.nz/ViewFullItem.do?
code=24685027&digital=yes
Department of Internal Affairs
Background
Two sets of born-digital ingest, Minister's Papers, 'code-named', E1
and E4, E2 and E3.
First sets selected for simplicity.
Second sets followed numerical sequence and were used as a
learning exercise.
Complexity grew.
First sets enabled creation of CSV ingest mechanism, configuration
of Rosetta, creation of process.
Second sets enabled the proof of that method.
Department of Internal Affairs
●
E1~
●
175 Files
●
10 Directories
●
0 Unidentified Objects
●
0 Unidentified Extensions
●
7 Known Formats
N.B. E4 also contained two
identification false positives.
●
E4~
●
1295 Files
●
6 Directories
●
2 Unidentified Objects
●
1 Unidentified Extensions
●
12 Known Formats
Approximate collection breakdowns at the
beginning of the process…
Approximate collection breakdowns at the
beginning of the process…
Department of Internal Affairs
Approximate collection breakdowns at the
beginning of the process…
• E2~
• 2519 Files
• 177 Directories
• 5 Unidentified Objects
• 4 Unidentified Extensions
• 22 Known Formats
• 25 Extension Mismatches
• E3~
• 1748 Files
• 144 Directories
• 8 Unidentified Objects
• 5 Unidentified Extensions
• 12 Known Formats
• 37 Extension Mismatches
N.B. Both collections
contained empty folders,
empty files, and multiple-id
formats.
Department of Internal Affairs
Let's begin with a story...
E1, the simplest... Enabled us to develop an ingest mechanism for
heterogeneous collections – and it worked!
E4, not that different, slightly larger, about as 'known', but!
An unexpected exception discovered in the relationship between
the preservation system and some of the filenames in the
collection...
Department of Internal Affairs
Where do astronauts go for a beer?
Department of Internal Affairs
The...
Department of Internal Affairs
We had filenames with multiple spaces in
them...
E.g. 'A [space] [space] Filename.docx'
An innocuous enough looking problem... Our digital
preservation system couldn't handle them...
Investigate the system...
...
Confirm it's the system...
…
Ask vendor to fix the problem...
…
No fix forthcoming for next release...
Department of Internal Affairs
What now...?
Change filenames?
...
Serious change, this is how we received them!
…
Record provenance...
…
Mechanisms in METS metadata schema [EVENT]
…
How to implement?
Department of Internal Affairs
We continue...
Configure CSV to handle EVENT fields...
...
Modify CSV generation tool to output blank EVENT fields...
…
Test ingest in system until configuration is perfected
…
Mechanism works so pre-condition filenames...
...
Record R-Numbers* and design provenance note controlled list...
…
Add data to CSV
…
DONE!!!!
*Dependency on listing being fixed in Archway
Department of Internal Affairs
Department of Internal Affairs
Test in digital preservation system fails...
...
UTF-8 character encoding...
…
How to preserve in Excel?
…
…
Import using special ribbon in Excel...
…
Add notes to sheet...
…
DONE?!
…
Not even now... >.<
Nope...
Department of Internal Affairs
It can become exhausting...
As a speaker! And for the audience!!! ^_^;
...Time and date based data becomes a problem...
...Asking non-expert users to do the same...
...Even power tools like Open Office suffer issues...
...E4 went in after solving the UTF-8 issues...
...E2 and E3 suffered from issues with time/date information on top
Department of Internal Affairs
But we learn and move onwards an
upwards...
Department of Internal Affairs
The work isn't straight-forward
● It Pushes out time-frames...
● And the problems we're solving aren't what we expected...
● We need to develop with the problem...
Department of Internal Affairs
But we have new tools...
Tools to create provenance information in CSV for ingest into the
digital preservation system.
Tools to identify files with this issue up front.
The digital preservation system is fixed, so this specific use-case
for us is unlikely to occur again.
We have gained new experience.
For E2 and E3, we created mechanisms of creating an ingest
'mash-up' using a separate provenance spreadsheet.
For our next ingest we have a macro to automate an Excel
import!!!!! ← IN MICROSOFT?!!!!
Department of Internal Affairs
We have what seems like an exhaust-less
list...
●
[Tools] Ability to handle multi-byte character encodings. Maori macrons,
‘Ā’, in DROID, digital preservation system, spreadsheets, etc. .
• [Tools] Unidentified files and false positives - contribute to
[Tools] Zero-byte files, empty folders
●
[Tools] System files
• [Tools] Digital preservation system’s capabilities; dates, delivery,
metadata extraction, etc.
• [Files] Invalid objects
• [Files] Templates, objects with auto-fields
Department of Internal Affairs
And we'd never have guessed these up
front...
● What are the next challenges?
● We'd be too conservative, or too O.T.T...
●WE NEED A TIME
MACHINE!!!
Department of Internal Affairs
Questions?
Department of Internal Affairs
We don't need a time machine at all...
● We need evidence!
● We need to practice!
● We need to do!
● Time-frames will be pushed out
● In a world that loves strategy, it's
terribly detail focused.
● Can someone figure it out first?
● Definition of Leadership!
● But you will almost certainly find
new exceptions... as will we.
Department of Internal Affairs
Ground process and policy in the real
world…
● We can reduce surprises...
● But we can't reduce them zero...
● Find the exceptions, create rules, and encode them
in those policies...
● Move one step at a time, with modes increments.
● Flexible endpoints / reasonable / multiple goals...
● Q. HOW DID WE GET THESE FILES??
● A. It doesn't matter, we have to deal with them...
Department of Internal Affairs
Evidence will…
● Inform policy
● Inform Procedures
➔ Tools
➔ Skills
➔ Appetite
➔ Strategy
Department of Internal Affairs
Writing these documents becomes a much
more advanced thought experiment with a
greater number of inputs from a greater
number of people, and experiences...
Department of Internal Affairs
Robustness Principle... (Postel's Law)
e.g. checksums
“Be conservative in what you do; be liberal in what you accept
from others.”
Follow standards... mechanisms should accept non-conforming
input as long as the meaning is clear...
Be prepared to understand material, be prepared to manage it.
A way of doing things... not the only way... WRITE OTHER
SOLUTIONS! RE-WRITE YOUR SOLUTIONS!
Department of Internal Affairs
Other tools for you...
DROID (National Archives UK):
http://www.nationalarchives.gov.uk/information-management/manage-information/policy-proce
Or Siegfried (State Records NSW): https://github.com/richardlehane/siegfried
DROID Analysis Tool: https://github.com/exponential-decay/droid-sqlite-analysis
Other presentations: http://www.slideshare.net/RossSpencer/presentations
Blogs (Open Preservation Foundation):
http://openpreservation.org/knowledge/blogs/
Record Keeping Tookit (Archives New Zealand):
http://www.records.archives.govt.nz/
Department of Internal Affairs
Share yours too!
Department of Internal Affairs
Who do digital preservation analysts
want to drink a beer with?
Department of Internal Affairs
Commander Hadfield!
https://twitter.com/cmdr_hadfield
TED:
What I learned from going blind in space?
Star Talk:
http://www.startalkradio.net/show/social-media-i
Department of Internal Affairs
It’s almost comical that astronauts are stereotyped as daredevils and
cowboys. As a rule, we’re highly methodical and detail-oriented. Our
passion isn’t for thrills but for the grindstone, and pressing our noses to
it. We have to: we’re responsible for equipment that has cost taxpayers
many millions of dollars, and the best insurance policy we have on our
lives is our own dedication to training. Studying, simulating, practicing
until responses become automatic—astronauts don’t do all this only to
fulfill NASA’s requirements. Training is something we do to reduce the
odds that we’ll die.”
 
― Chris Hadfield, An Astronaut's Guide to Life on Earth
The Right Stuff
Department of Internal Affairs
What next..?
Department of Internal Affairs
Questions!
Thank you!
Department of Internal Affairs

More Related Content

Time Travelling Analyst: The Things That Only a Time Machine Can Tell Me...

  • 1. Department of Internal Affairs Time Traveling Analyst: The Things Only a Time Machine Can Tell Me… Ross Spencer - @beet_keeper Archives New Zealand #ARANZ2015 Tuesday September 7 2015
  • 2. Department of Internal Affairs Sun image, R24685027, E4, Archway, Archives New Zealand. http://www.archway.archives.govt.nz/ViewFullItem.do? code=24685027&digital=yes
  • 3. Department of Internal Affairs Background Two sets of born-digital ingest, Minister's Papers, 'code-named', E1 and E4, E2 and E3. First sets selected for simplicity. Second sets followed numerical sequence and were used as a learning exercise. Complexity grew. First sets enabled creation of CSV ingest mechanism, configuration of Rosetta, creation of process. Second sets enabled the proof of that method.
  • 4. Department of Internal Affairs ● E1~ ● 175 Files ● 10 Directories ● 0 Unidentified Objects ● 0 Unidentified Extensions ● 7 Known Formats N.B. E4 also contained two identification false positives. ● E4~ ● 1295 Files ● 6 Directories ● 2 Unidentified Objects ● 1 Unidentified Extensions ● 12 Known Formats Approximate collection breakdowns at the beginning of the process… Approximate collection breakdowns at the beginning of the process…
  • 5. Department of Internal Affairs Approximate collection breakdowns at the beginning of the process… • E2~ • 2519 Files • 177 Directories • 5 Unidentified Objects • 4 Unidentified Extensions • 22 Known Formats • 25 Extension Mismatches • E3~ • 1748 Files • 144 Directories • 8 Unidentified Objects • 5 Unidentified Extensions • 12 Known Formats • 37 Extension Mismatches N.B. Both collections contained empty folders, empty files, and multiple-id formats.
  • 6. Department of Internal Affairs Let's begin with a story... E1, the simplest... Enabled us to develop an ingest mechanism for heterogeneous collections – and it worked! E4, not that different, slightly larger, about as 'known', but! An unexpected exception discovered in the relationship between the preservation system and some of the filenames in the collection...
  • 7. Department of Internal Affairs Where do astronauts go for a beer?
  • 8. Department of Internal Affairs The...
  • 9. Department of Internal Affairs We had filenames with multiple spaces in them... E.g. 'A [space] [space] Filename.docx' An innocuous enough looking problem... Our digital preservation system couldn't handle them... Investigate the system... ... Confirm it's the system... … Ask vendor to fix the problem... … No fix forthcoming for next release...
  • 10. Department of Internal Affairs What now...? Change filenames? ... Serious change, this is how we received them! … Record provenance... … Mechanisms in METS metadata schema [EVENT] … How to implement?
  • 11. Department of Internal Affairs We continue... Configure CSV to handle EVENT fields... ... Modify CSV generation tool to output blank EVENT fields... … Test ingest in system until configuration is perfected … Mechanism works so pre-condition filenames... ... Record R-Numbers* and design provenance note controlled list... … Add data to CSV … DONE!!!! *Dependency on listing being fixed in Archway
  • 13. Department of Internal Affairs Test in digital preservation system fails... ... UTF-8 character encoding... … How to preserve in Excel? … … Import using special ribbon in Excel... … Add notes to sheet... … DONE?! … Not even now... >.< Nope...
  • 14. Department of Internal Affairs It can become exhausting... As a speaker! And for the audience!!! ^_^; ...Time and date based data becomes a problem... ...Asking non-expert users to do the same... ...Even power tools like Open Office suffer issues... ...E4 went in after solving the UTF-8 issues... ...E2 and E3 suffered from issues with time/date information on top
  • 15. Department of Internal Affairs But we learn and move onwards an upwards...
  • 16. Department of Internal Affairs The work isn't straight-forward ● It Pushes out time-frames... ● And the problems we're solving aren't what we expected... ● We need to develop with the problem...
  • 17. Department of Internal Affairs But we have new tools... Tools to create provenance information in CSV for ingest into the digital preservation system. Tools to identify files with this issue up front. The digital preservation system is fixed, so this specific use-case for us is unlikely to occur again. We have gained new experience. For E2 and E3, we created mechanisms of creating an ingest 'mash-up' using a separate provenance spreadsheet. For our next ingest we have a macro to automate an Excel import!!!!! ← IN MICROSOFT?!!!!
  • 18. Department of Internal Affairs We have what seems like an exhaust-less list... ● [Tools] Ability to handle multi-byte character encodings. Maori macrons, ‘Ā’, in DROID, digital preservation system, spreadsheets, etc. . • [Tools] Unidentified files and false positives - contribute to [Tools] Zero-byte files, empty folders ● [Tools] System files • [Tools] Digital preservation system’s capabilities; dates, delivery, metadata extraction, etc. • [Files] Invalid objects • [Files] Templates, objects with auto-fields
  • 19. Department of Internal Affairs And we'd never have guessed these up front... ● What are the next challenges? ● We'd be too conservative, or too O.T.T... ●WE NEED A TIME MACHINE!!!
  • 20. Department of Internal Affairs Questions?
  • 21. Department of Internal Affairs We don't need a time machine at all... ● We need evidence! ● We need to practice! ● We need to do! ● Time-frames will be pushed out ● In a world that loves strategy, it's terribly detail focused. ● Can someone figure it out first? ● Definition of Leadership! ● But you will almost certainly find new exceptions... as will we.
  • 22. Department of Internal Affairs Ground process and policy in the real world… ● We can reduce surprises... ● But we can't reduce them zero... ● Find the exceptions, create rules, and encode them in those policies... ● Move one step at a time, with modes increments. ● Flexible endpoints / reasonable / multiple goals... ● Q. HOW DID WE GET THESE FILES?? ● A. It doesn't matter, we have to deal with them...
  • 23. Department of Internal Affairs Evidence will… ● Inform policy ● Inform Procedures ➔ Tools ➔ Skills ➔ Appetite ➔ Strategy
  • 24. Department of Internal Affairs Writing these documents becomes a much more advanced thought experiment with a greater number of inputs from a greater number of people, and experiences...
  • 25. Department of Internal Affairs Robustness Principle... (Postel's Law) e.g. checksums “Be conservative in what you do; be liberal in what you accept from others.” Follow standards... mechanisms should accept non-conforming input as long as the meaning is clear... Be prepared to understand material, be prepared to manage it. A way of doing things... not the only way... WRITE OTHER SOLUTIONS! RE-WRITE YOUR SOLUTIONS!
  • 26. Department of Internal Affairs Other tools for you... DROID (National Archives UK): http://www.nationalarchives.gov.uk/information-management/manage-information/policy-proce Or Siegfried (State Records NSW): https://github.com/richardlehane/siegfried DROID Analysis Tool: https://github.com/exponential-decay/droid-sqlite-analysis Other presentations: http://www.slideshare.net/RossSpencer/presentations Blogs (Open Preservation Foundation): http://openpreservation.org/knowledge/blogs/ Record Keeping Tookit (Archives New Zealand): http://www.records.archives.govt.nz/
  • 27. Department of Internal Affairs Share yours too!
  • 28. Department of Internal Affairs Who do digital preservation analysts want to drink a beer with?
  • 29. Department of Internal Affairs Commander Hadfield! https://twitter.com/cmdr_hadfield TED: What I learned from going blind in space? Star Talk: http://www.startalkradio.net/show/social-media-i
  • 30. Department of Internal Affairs It’s almost comical that astronauts are stereotyped as daredevils and cowboys. As a rule, we’re highly methodical and detail-oriented. Our passion isn’t for thrills but for the grindstone, and pressing our noses to it. We have to: we’re responsible for equipment that has cost taxpayers many millions of dollars, and the best insurance policy we have on our lives is our own dedication to training. Studying, simulating, practicing until responses become automatic—astronauts don’t do all this only to fulfill NASA’s requirements. Training is something we do to reduce the odds that we’ll die.”   ― Chris Hadfield, An Astronaut's Guide to Life on Earth The Right Stuff
  • 31. Department of Internal Affairs What next..?
  • 32. Department of Internal Affairs Questions! Thank you!

Editor's Notes

  • #2: &amp;lt;number&amp;gt;
  • #3: &amp;lt;number&amp;gt;
  • #5: &amp;lt;number&amp;gt;
  • #6: &amp;lt;number&amp;gt;
  • #7: &amp;lt;number&amp;gt;
  • #8: &amp;lt;number&amp;gt;
  • #9: &amp;lt;number&amp;gt;
  • #10: &amp;lt;number&amp;gt;
  • #11: &amp;lt;number&amp;gt;
  • #12: &amp;lt;number&amp;gt;
  • #13: &amp;lt;number&amp;gt;
  • #14: &amp;lt;number&amp;gt;
  • #15: &amp;lt;number&amp;gt;
  • #16: &amp;lt;number&amp;gt;
  • #17: &amp;lt;number&amp;gt;
  • #18: &amp;lt;number&amp;gt;
  • #19: &amp;lt;number&amp;gt;
  • #20: &amp;lt;number&amp;gt;
  • #21: &amp;lt;number&amp;gt;
  • #22: &amp;lt;number&amp;gt;
  • #23: &amp;lt;number&amp;gt;
  • #24: &amp;lt;number&amp;gt;
  • #25: &amp;lt;number&amp;gt;
  • #26: &amp;lt;number&amp;gt;
  • #27: &amp;lt;number&amp;gt;
  • #28: &amp;lt;number&amp;gt;
  • #29: &amp;lt;number&amp;gt;
  • #30: &amp;lt;number&amp;gt;
  • #31: &amp;lt;number&amp;gt;
  • #32: &amp;lt;number&amp;gt;
  • #33: &amp;lt;number&amp;gt;