SlideShare a Scribd company logo
1 of 11
Download to read offline
Mass Migration
Building a Bulk Hard Drive-to-LTO
Workflow From Scratch
Rebecca Fraimow
National Digital Stewardship Resident at WGBH
@rhfraim
80 hard drives
11,561 audiovisual files
300 TB of data
1 dedicated LTO workstation
1 dedicated archivist
. . .
Required Scripts & Documents (Initial)
AA_PBCorescript.sh: generates checksums and metadata for
each file on drive
AA_LTO_checksum.sh: generates checksums for each file on LTO
WGBH_Batch1_LimitedCSV_final.csv, WGBH-Batch2-140211.csv,
WGBH_batch3.csv, WGBH-Batch4-LimitedCSV.csv: GUID mapping
documents
Some drives didn’t perform correctly when removed
from their cases
Some drives had too much content to fit on one LTO
tape
Some drives had known failed files on them that
were not separated out or identified
Some of the content turned out to be derivative
material
Some of the content had been pulled twice
Some drives turned out to have failed files that
could only be detected by manual QC
Required Scripts & Documents (Revised)
AA_PBCorescript_with_checks.sh: restructures drive, checks for bad files and derivatives,
generates checksums and metadata for each file on drive
AA_LTO_checksum.sh: generates checksums for each file on LTO
WGBH_Batch1_LimitedCSV_final.csv, WGBH-Batch2-140211.csv, WGBH_batch3.csv, WGBH-
Batch4-LimitedCSV.csv: GUID mapping documents
AA_LTO_checksum_second_tape.sh: creates a second checksum list for overflow files
batch_qt_proofsheet.sh: creates QT_proofs for each files
proof_check.sh: QC to identify files incompatible with QT_proofs
aapb_MD5_total.csv: list of all files transferred, with checksums
corrupted_files.csv: list of files that did not pass MD5 checksum validation
derivatives.csv: list of derivative files to be removed from inclusion in the repository
md5_original_values.csv: list of all documented MD5s from before files went into Artesia DAM
QT_Proofsheets
Probably OK! NOT OK
SHARE DRIVE
HARD DRIVE
LTO
Contact:
rebecca_fraimow@wgbh.org
rebeccafraimow@gmail.com
@rhfraim
Code: https://github.com/WGBH/ltoscripts

More Related Content

Similar to Fraimow CURATEcamp 2015

TLPI Chapter 14 File Systems
TLPI Chapter 14 File SystemsTLPI Chapter 14 File Systems
TLPI Chapter 14 File Systems
Shu-Yu Fu
 
Open stack summit-2015-dp
Open stack summit-2015-dpOpen stack summit-2015-dp
Open stack summit-2015-dp
Dirk Petersen
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
BertrandDrouvot
 
CASPUR Staging System II
CASPUR Staging System IICASPUR Staging System II
CASPUR Staging System II
Andrea PETRUCCI
 

Similar to Fraimow CURATEcamp 2015 (20)

Filesystemimplementationpre final-160919095849
Filesystemimplementationpre final-160919095849Filesystemimplementationpre final-160919095849
Filesystemimplementationpre final-160919095849
 
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
 
Build Your OS Part1
Build Your OS Part1Build Your OS Part1
Build Your OS Part1
 
Windowsforensics
WindowsforensicsWindowsforensics
Windowsforensics
 
TLPI Chapter 14 File Systems
TLPI Chapter 14 File SystemsTLPI Chapter 14 File Systems
TLPI Chapter 14 File Systems
 
Swift high-latency-media-middleware--open stack-summit-barcelona2016
Swift high-latency-media-middleware--open stack-summit-barcelona2016Swift high-latency-media-middleware--open stack-summit-barcelona2016
Swift high-latency-media-middleware--open stack-summit-barcelona2016
 
Open stack summit-2015-dp
Open stack summit-2015-dpOpen stack summit-2015-dp
Open stack summit-2015-dp
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
 
Git session-2012-2013
Git session-2012-2013Git session-2012-2013
Git session-2012-2013
 
Introduction to forensic imaging
Introduction to forensic imagingIntroduction to forensic imaging
Introduction to forensic imaging
 
LBNL Node Health Check Update
LBNL Node Health Check UpdateLBNL Node Health Check Update
LBNL Node Health Check Update
 
Operating Systems - Implementing File Systems
Operating Systems - Implementing File SystemsOperating Systems - Implementing File Systems
Operating Systems - Implementing File Systems
 
Some key value stores using log-structure
Some key value stores using log-structureSome key value stores using log-structure
Some key value stores using log-structure
 
Backups-khtn document 2023 tai lieu hay.pdf
Backups-khtn document 2023 tai lieu hay.pdfBackups-khtn document 2023 tai lieu hay.pdf
Backups-khtn document 2023 tai lieu hay.pdf
 
Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...
Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...
Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...
 
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
 
CASPUR Staging System II
CASPUR Staging System IICASPUR Staging System II
CASPUR Staging System II
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkoo
 
PostgreSQL + ZFS best practices
PostgreSQL + ZFS best practicesPostgreSQL + ZFS best practices
PostgreSQL + ZFS best practices
 
File systemimplementationfinal
File systemimplementationfinalFile systemimplementationfinal
File systemimplementationfinal
 

Recently uploaded

Mifepristion Pills IN Kuwait (+918133066128) Where I Can Buy Abortion pills K...
Mifepristion Pills IN Kuwait (+918133066128) Where I Can Buy Abortion pills K...Mifepristion Pills IN Kuwait (+918133066128) Where I Can Buy Abortion pills K...
Mifepristion Pills IN Kuwait (+918133066128) Where I Can Buy Abortion pills K...
Abortion pills in Kuwait Cytotec pills in Kuwait
 
Competitive Advantage slide deck___.pptx
Competitive Advantage slide deck___.pptxCompetitive Advantage slide deck___.pptx
Competitive Advantage slide deck___.pptx
ScottMeyers35
 
Unique Value Prop slide deck________.pdf
Unique Value Prop slide deck________.pdfUnique Value Prop slide deck________.pdf
Unique Value Prop slide deck________.pdf
ScottMeyers35
 

Recently uploaded (20)

Our nurses, our future. The economic power of care.
Our nurses, our future. The economic power of care.Our nurses, our future. The economic power of care.
Our nurses, our future. The economic power of care.
 
Nitrogen filled high expansion foam in open Containers
Nitrogen filled high expansion foam in open ContainersNitrogen filled high expansion foam in open Containers
Nitrogen filled high expansion foam in open Containers
 
Mifepristion Pills IN Kuwait (+918133066128) Where I Can Buy Abortion pills K...
Mifepristion Pills IN Kuwait (+918133066128) Where I Can Buy Abortion pills K...Mifepristion Pills IN Kuwait (+918133066128) Where I Can Buy Abortion pills K...
Mifepristion Pills IN Kuwait (+918133066128) Where I Can Buy Abortion pills K...
 
Time, Stress & Work Life Balance for Clerks with Beckie Whitehouse
Time, Stress & Work Life Balance for Clerks with Beckie WhitehouseTime, Stress & Work Life Balance for Clerks with Beckie Whitehouse
Time, Stress & Work Life Balance for Clerks with Beckie Whitehouse
 
2024: The FAR, Federal Acquisition Regulations, Part 31
2024: The FAR, Federal Acquisition Regulations, Part 312024: The FAR, Federal Acquisition Regulations, Part 31
2024: The FAR, Federal Acquisition Regulations, Part 31
 
Electric vehicle infrastructure in rural areas
Electric vehicle infrastructure in rural areasElectric vehicle infrastructure in rural areas
Electric vehicle infrastructure in rural areas
 
"Plant health, safe trade and digital technology." International Day of Plant...
"Plant health, safe trade and digital technology." International Day of Plant..."Plant health, safe trade and digital technology." International Day of Plant...
"Plant health, safe trade and digital technology." International Day of Plant...
 
ℂall Girl Near Ahmedabad Book Esha 6378878445 Top Class ℂall Girl Serviℂe Ava...
ℂall Girl Near Ahmedabad Book Esha 6378878445 Top Class ℂall Girl Serviℂe Ava...ℂall Girl Near Ahmedabad Book Esha 6378878445 Top Class ℂall Girl Serviℂe Ava...
ℂall Girl Near Ahmedabad Book Esha 6378878445 Top Class ℂall Girl Serviℂe Ava...
 
Harbin-Gross-Spring2022.pdf Yale Historical Review
Harbin-Gross-Spring2022.pdf Yale Historical ReviewHarbin-Gross-Spring2022.pdf Yale Historical Review
Harbin-Gross-Spring2022.pdf Yale Historical Review
 
Tennessee DOT- TEVI Plan coordination & EV
Tennessee DOT- TEVI Plan coordination & EVTennessee DOT- TEVI Plan coordination & EV
Tennessee DOT- TEVI Plan coordination & EV
 
2024: The FAR, Federal Acquisition Regulations, Part 32
2024: The FAR, Federal Acquisition Regulations, Part 322024: The FAR, Federal Acquisition Regulations, Part 32
2024: The FAR, Federal Acquisition Regulations, Part 32
 
PPT Item # 7&8 6900 Broadway P&Z Case # 438
PPT Item # 7&8 6900 Broadway P&Z Case # 438PPT Item # 7&8 6900 Broadway P&Z Case # 438
PPT Item # 7&8 6900 Broadway P&Z Case # 438
 
VIP ℂall Girls Marine lines Mumbai 9004268417 WhatsApp: Me All Time Serviℂe A...
VIP ℂall Girls Marine lines Mumbai 9004268417 WhatsApp: Me All Time Serviℂe A...VIP ℂall Girls Marine lines Mumbai 9004268417 WhatsApp: Me All Time Serviℂe A...
VIP ℂall Girls Marine lines Mumbai 9004268417 WhatsApp: Me All Time Serviℂe A...
 
Item ## 4a -- April 29, 2024 CCM Minutes
Item ## 4a -- April 29, 2024 CCM MinutesItem ## 4a -- April 29, 2024 CCM Minutes
Item ## 4a -- April 29, 2024 CCM Minutes
 
Competitive Advantage slide deck___.pptx
Competitive Advantage slide deck___.pptxCompetitive Advantage slide deck___.pptx
Competitive Advantage slide deck___.pptx
 
The 2024 World Wildlife Crime Report tracks all these issues, trends and more...
The 2024 World Wildlife Crime Report tracks all these issues, trends and more...The 2024 World Wildlife Crime Report tracks all these issues, trends and more...
The 2024 World Wildlife Crime Report tracks all these issues, trends and more...
 
Item # 7-8 - 6900 Broadway P&Z Case # 438
Item # 7-8 - 6900 Broadway P&Z Case # 438Item # 7-8 - 6900 Broadway P&Z Case # 438
Item # 7-8 - 6900 Broadway P&Z Case # 438
 
Unique Value Prop slide deck________.pdf
Unique Value Prop slide deck________.pdfUnique Value Prop slide deck________.pdf
Unique Value Prop slide deck________.pdf
 
sponsor for poor old age person food.pdf
sponsor for poor old age person food.pdfsponsor for poor old age person food.pdf
sponsor for poor old age person food.pdf
 
OECD Green Talks LIVE | Diving deeper: the evolving landscape for assessing w...
OECD Green Talks LIVE | Diving deeper: the evolving landscape for assessing w...OECD Green Talks LIVE | Diving deeper: the evolving landscape for assessing w...
OECD Green Talks LIVE | Diving deeper: the evolving landscape for assessing w...
 

Fraimow CURATEcamp 2015

  • 1. Mass Migration Building a Bulk Hard Drive-to-LTO Workflow From Scratch Rebecca Fraimow National Digital Stewardship Resident at WGBH @rhfraim
  • 2.
  • 3. 80 hard drives 11,561 audiovisual files 300 TB of data 1 dedicated LTO workstation 1 dedicated archivist
  • 5. Required Scripts & Documents (Initial) AA_PBCorescript.sh: generates checksums and metadata for each file on drive AA_LTO_checksum.sh: generates checksums for each file on LTO WGBH_Batch1_LimitedCSV_final.csv, WGBH-Batch2-140211.csv, WGBH_batch3.csv, WGBH-Batch4-LimitedCSV.csv: GUID mapping documents
  • 6.
  • 7. Some drives didn’t perform correctly when removed from their cases Some drives had too much content to fit on one LTO tape Some drives had known failed files on them that were not separated out or identified Some of the content turned out to be derivative material Some of the content had been pulled twice Some drives turned out to have failed files that could only be detected by manual QC
  • 8. Required Scripts & Documents (Revised) AA_PBCorescript_with_checks.sh: restructures drive, checks for bad files and derivatives, generates checksums and metadata for each file on drive AA_LTO_checksum.sh: generates checksums for each file on LTO WGBH_Batch1_LimitedCSV_final.csv, WGBH-Batch2-140211.csv, WGBH_batch3.csv, WGBH- Batch4-LimitedCSV.csv: GUID mapping documents AA_LTO_checksum_second_tape.sh: creates a second checksum list for overflow files batch_qt_proofsheet.sh: creates QT_proofs for each files proof_check.sh: QC to identify files incompatible with QT_proofs aapb_MD5_total.csv: list of all files transferred, with checksums corrupted_files.csv: list of files that did not pass MD5 checksum validation derivatives.csv: list of derivative files to be removed from inclusion in the repository md5_original_values.csv: list of all documented MD5s from before files went into Artesia DAM