Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building the AAPB: Inter-Institutional Preservation and Access Workflows


Published on

Presentation given by Charles Hosale, Special Projects Assistant at WGBH/American Archive of Public Broadcasting; Leslie Bourgeois, Archivist at Louisiana Public Broadcasting; Ann Wilkens, Archivist at Wisconsin Public Television; and Rachel Curtis, AAPB Digital Conversion Specialist and Project Coordinator at the Library of Congress. The presentation was given at the 2017 Association of Moving Image Archivists conference in New Orleans.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Building the AAPB: Inter-Institutional Preservation and Access Workflows

  1. 1. Building the AAPB: Inter-Institutional Preservation and Access Workflows
  2. 2. @amarchivepub @amarchivepub
  3. 3. a collaboration between WGBH and the Library of Congress Seeking to preserve and make accessible significant historical content created by public media, and to coordinate a national effort to save at-risk public media before its content is lost to posterity
  4. 4. LBJ Library photo by Yoichi Okamoto
  5. 5. “Created in Boston, shared with the world”
  6. 6. Library of Congress
  7. 7. Be a focal point for discoverability of historical public media content; Coordinate a national effort to preserve and make accessible historical public media content Provide content creators with standards and best practices, guidance, training, and advice for storing, processing, preserving, and making accessible their historical content, and for raising funds in order to accomplish these tasks; Disseminate content widely by facilitating the use of archival public media content by scholars, educators, students, journalists, media producers, researchers, and the public, for the purpose of learning, informing, and teaching; Increase public awareness of the significance of historical public media and the need to preserve and make accessible significant public broadcasting programs; and Ensure the perpetuation of the archive by working toward financial sustainability. Mission
  8. 8. Identified over 3 million items kept at stations, archives, producers, university collections across the country Collected 2.5 million inventory records from 120 stations Digitized and ingested 40,000 hours of material initially from over 100 stations – 5,000 hours from born digital files Initial Collection
  9. 9. Collection growth • Growing the collection by up to 25,000 hours of digitized content per year • Assisting collection holders with digitization grant proposals and ingesting digital files into our systems • Recent acquisitions – PBS NewsHour and predecessor series – American Masters raw interviews – Ken Burns’ The Civil War raw interviews – Eyes on the Prize raw interviews – NHPR presidential primary collection – KBOO community radio programs – NPACT coverage of Senate Watergate Hearings – Southern California Public Radio environmental collection – Vision Maker Media films
  10. 10. Goal: A Centralized Web Portal for Discovery • All AAPB digitized content on specific topics discoverable through single searches • Direct links to public media on other sites • One-stop shopping for users • Helps solve the separate silos syndrome • DPLA as a model
  11. 11. “Preservation through Collaboration” • within our WGBH AAPB team • with the Library of Congress project partners • with content creators/donors • with legal counsel & Berkman Klein Center • with digitization vendors • with our marketing team • with technical development partners • with LIS programs • with scholars
  12. 12. The Challenges at Hand • Too much material, too little resources • Access and rights • Technology moving too fast • Maintaining relationships with many donors • Copyright
  13. 13. WPT Ann Wilkens Media Archivist
  14. 14. WPT Preservation Workflow • More text here
  15. 15. Problem Solved!
  16. 16. buggy
  18. 18. Louisiana Public Broadcasting Leslie Bourgeois Archivist Preservation & Access Workflows
  19. 19. Louisiana Public Broadcasting • Statewide PBS affiliate except for New Orleans • Headquartered in Baton Rouge • Went on the air on September 6, 1975
  20. 20. American Archive • Participating station since 2009 • Created first comprehensive inventory • Digitized 550 hours of at-risk media
  21. 21. Louisiana Digital Media Archive • Collaborative project with the Louisiana State Archives • Launched on January 20, 2015 • Available at
  22. 22. National Digital Stewardship Residency • Hired Eddy Colloton as AAPB NDSR resident • Documented digitization workflow • Created digital preservation plan • Updated digitization workflow based on Eddy’s recommendations
  23. 23. • More text here
  24. 24. LDMA Access Workflow • All cataloging is done in our PBCore-based MySQL database • Developed an API between the database and the LDMA front end
  25. 25. LDMA Record Example
  26. 26. AAPB Access Workflow • One of the first AAPB stations to use the AAPB as a portal • Send updated metadata and the LDMA link to AAPB in a .csv file • Do not send in digital files from in-house digitization project
  27. 27. AAPB Record Example
  28. 28. Not a Perfect Solution • AAPB records are entered at the episode level • We catalog our newsmagazines to the segment level • Will work with AAPB team to find a solution
  29. 29. AAPB Access Workflows Charles Hosale Special Projects
  30. 30. Incoming Collections • Working with multiple digitization vendors, including those contracted by AAPB and contracted by contributing organizations, to receive content into AAPB • Also acquiring born digital and previously digitized content submitted to us by donors • Requires AAPB to be flexible in workflows and acceptance criteria
  31. 31. Project Workflow Phases • Appraisal • Contracting • Ingestion (Digitization) • Digital Preservation • Arrangement & Description • Access
  32. 32. Appraisal – Three types of projects • AAPB identifies collection to be preserved, guided by Collection Development Policy – AAPB works with content creator/steward on digitization grant proposal (unless already digitized/born digital) – Content creator confirms total number hours and assets to be delivered, provides item-level appraisal inventory – AAPB team determines when the collection can be acquired and plans project workflow • Content creator contacts AAPB and is seeking to preserve collection – Content creator provides total number of hours and assets to be delivered, provides summary of collection – AAPB team determines when the collection can be acquired and plans project workflow • Collaborating archive has already preserved content and made it accessible, and wishes to aggregate it – Content creator provides total number of metadata assets to be delivered, provides summary of collection. Note: in this workflow the recordings are not preserved at WGBH and LOC.
  33. 33. Collection Development Criteria • Unique content • Content has not been widely distributed elsewhere or preserved elsewhere • Content created and owned by station • Content is at-risk due to its condition • Comprehensiveness of the collection • Content documents events, topics, places, persons, opinions, or attitudes of historical, cultural, political, sociological, anthropological, scientific, educational, technological, or aesthetic significance • Content reflects significant international, national, regional, state, or local culture, politics, or society; or presents the viewpoints of indigenous communities, subcultures, societal groups, or population segments • Content documents unique aspects of the style and practice of radio and television journalism • Older content • Content with a significant impact when first broadcast • Content that does not merely illustrate material available elsewhere in other types of media, such as text or photographs, but includes unique content not found in other sources • Content that has received awards • Raw footage, including interviews, that are unique and represent significant historic events, or some unique aspect of the local community • Content that could support educational initiatives • Content that the organization would allow the American Archive of Public Broadcasting to make available in the AAPB Online Reading Room
  34. 34. Contracting • Organization must agree to and sign AAPB’s Deed of Gift agreement – Donor donates rights, title and interest in digital copies – Donor confirms current copyright ownership and control (donor controls all, some or no rights) – Donor assigns rights to AAPB a. Assignment of copyright to AAPB b. Dedication to public domain c. Donor grants AAPB an irrevocable, non-exclusive, royalty-free worldwide perpetual license for AAPB’s discretionary uses of the Donated Materials, in addition to all uses permitted by law. Such discretionary uses may include but are not limited to cataloging, preservation, copying and migration for preservation and access purposes, exhibition, display, and making works available for non-commercial public access (including online), in accordance with AAPB policy and with applicable law. – Re-use by patrons a. Donor may select among any Creative Commons license b. Donor does not authorize AAPB to make Donated Materials available for re-use by patrons – All metadata is made available in the public domain
  35. 35. WGBH Software and Systems • Mac computing environment • PBCore metadata standard • Archival Management System (metadata repository) – PHP app on MySQL database – MINT (data ingestion and crosswalk software) – PBCore API (metadata normalization software) – BagIt (packaging metadata for ingestion) • Sony Ci (media host) – Ruby scripts to batch upload / delete – MediaBox to give limited research access • (public website) – Ruby on Rails – Solr index – Blacklight frontend • Amazon S3 and Web Services (web and document host) • Google Sheets • MediaInfo • FFmpeg • Sophos • Terminal
  36. 36. Administrative Metadata (assets) Technical Metadata (instantiations) Descriptive Metadata (assets) Preservation Metadata (instantiations) Archival Management System (AMS)
  37. 37. Project Input • Administrative metadata from appraisal inventory • Digitized media (masters/proxies OR links out) • Technical metadata • Descriptive metadata • Transcripts and/or closed caption files • Thumbnails
  38. 38. Ingestion Workflow: Phase 1 • Provide donor with metadata template and assist in their creation of appraisal item-level inventory. • Receive final appraisal inventory with administrative, technical, and descriptive metadata from donor. • Normalize the metadata, mapping it to PBCore. • Ingest metadata into AMS, which creates unique identifiers for each record. • Either receive digital files or coordinate transportation of physical tapes to digitization vendor. • Export project inventory containing all metadata paired with the new unique identifiers (GUIDs) from AMS.
  39. 39. donor “Born Digital” has digital files (previously digitized or born digital) Digitization Needed has tapes (grant-funded digitization project) • inventory • descriptive metadata • digital files (or links) WGBH • inventory • descriptive metadata digitization vendor • tapes we provide metadata template and assistance Ingestion Phase 1 inventory matching AMS records to tapes
  40. 40. Ingestion Workflow: Phase 2 Born Digital • Check for viruses • Verify inventory vs delivered content • Create checksums • Create proxy files • Create and upload MediaInfo technical metadata • Normalize file names Digitization • Send vendor project inventory • Coordinate digitization with vendor • Receive technical metadata and digitized files • Check for viruses • Confirm checksums and QC files
  41. 41. donor proxies mediainfo checksums LOC AMS thumbnails AWS S3 Sony Ci metadata catalogers PBCore API browser Solr Rails Blacklight Born Digital Projects Workflow Original files WGBH Renamed original files Hard drive(s) WGBH LTO in vault WGBH MARS record assist with metadata creation normalization validate checkums clean-up scripts ORR reviewers bagit Improved PBCore API Sony Ci ids to AMS
  42. 42. Collaborating archive AMS thumbnails AWS S3 Sony Ci PBCore API browser Solr Rails Blacklight Aggregate Projects Workflow Links WGBH assist with metadata export normalization clean-up scripts Improved PBCore API “Dummy” Ci id to AMS thumbnails metadata
  43. 43. digitization vendor preservation files proxies mediainfo PREMIS checksums Captions Transcripts Library of Congress Google Sheets AMS thumbnails AWS S3 Sony Ci donor metadata catalogers PBCore API browser Solr Rails Blacklight Digitization Projects Workflow tapes WGBH LTO in vault WGBH MARS record normalization validate checkums clean-up scripts add Sony Ci ids to records ORR reviewers Improved PBCore API
  44. 44. WGBH Library of Congress WGBH LTO in vault validate checkums thumbnails AWS S3 WGBH MARS record Archival Management System (AMS) preservation metadata Sony Ci Master files proxies validate checkums Content Flow: Born Digital Projects
  45. 45. WGBH Collaborating archive Content Flow: Aggregation Projects Links Thumbnails AWS S3 Archival Management System (AMS)
  46. 46. WGBH digitization vendor Content Flow: Digitization Projects Library of Congress Master files Proxies WGBH LTO in vaultvalidate checkums validate checkums Thumbnails AWS S3 WGBH MARS record Archival Management System (AMS) Sony Ci Preservation Metadata
  47. 47. Digital Preservation • WGBH generates or receives SIP checksums • Create AIP on LTO, and another version on spinning disk. – Use terminal to copy files and confirm checksums in batch processes. • Create manifest of checksums on LTO. • Put LTO in the vault • Retention schedule assigned to LTO tapes. • Upload preservation metadata (checksum, LTO number, drive number, project code) to each AMS record in the collection. • Put checksum manifests on department server.
  48. 48. Arrangement and Description • Project acquisition: – Varies from project to project; is determined by goals of grant and planned during appraisal. – Usually more robust, involves more staff time and item-level processing. • Standard acquisition: – All records uploaded immediately for on location access. – Interns perform our minimal viable cataloging workflow. – Staff conducts Online Reading Room reviews of full series and some items after acquisition has been ingested.
  49. 49. Online Public Access • Online Reading Room totals more than 23,000 programs available to anyone in the United States (30% of collection and growing) • Online access in accordance with copyright law, including legal doctrine of fair use • Access for research, educational and informational purposes only • Inclusion in the ORR determined by analysis of types of programs and examination of individual series and programs
  50. 50. ORR Review Workflow • Check if series/asset is produced by organization that signed a quitclaim • Review the series/asset • Determine the genre bucket • If an ORR bucket, check for 3rd party content from litigious organizations • Put in online reading room! (or don’t ) – If the series/asset is questionable, we have regular meeting with lawyers for final decisions.
  51. 51. Access Workflow • Run our ruby script to batch upload proxy files to Sony Ci • Add Sony Ci identifiers and other functional metadata to corresponding AMS records • Use FFMPEG script to create thumbnail jpgs from video proxies • Upload thumbnails, transcripts, and closed caption files to Amazon S3 • “Reindex” on
  52. 52. Ecosystem PBCore API browser Solr Rails Blacklight clean-up scripts Archival Management System (AMS) thumbnails AWS S3 Sony Ci Transcripts & closed caption
  53. 53. Library of Congress: The Preservation Arm of the AAPB Rachel Curtis Digital Project
  54. 54. Preservation at the Library • All files ingested into the “deep archive” – Files written to T10K-C tapes – Access copies kept at two, geographically separate locations • Migrations performed every 3-5 years • Checksums stored in MAVIS database – Files are periodically validated – Back-ups are used when problems occur
  55. 55. Inside the Data Center at NAVCC Data storage systems Data processing systems
  56. 56. Ingestion Workflows • New workflows developed to accommodate AAPB material – Ingestion outside of the normal “ordered” workflow at LC – Digital files with no physical component – MAVIS records need to be created before ingestion – Automate as much as possible • Metadata – Ingesting metadata from several different sources – Clean and map to MAVIS fields – Issues with differences in how AMS and MAVIS are structured • Different types of workflows – Files from vendor vs files from a donor – Born digital vs digitized content – LTO vs hard drive
  57. 57. Workflow Development Ingestion workflow for the NewsHour project.
  58. 58. The Library as AAPB Contributor • The Library holds a large collection of PBS and NET material on 16mm and 2” video • Watergate coverage and Impeachment hearings added to the AAPB in November • Challenges – Funding for in-house digitization – Develop new workflows – Resource allocation – Legal clearance – Exporting metadata from MAVIS
  59. 59. Inside the Vaults at NAVCC NET film PBS/NET tape
  60. 60. Challenges and Future Goals • Continually improve/adapt workflows • Put Baton QC software into wider use • Improving file delivery methods • Improving metadata mapping and MAVIS record creation
  61. 61. Find Us! @amarchivepub Charles Hosale @cmhosale Ann Wilkens Rachel Curtis Leslie Bourgeois