SFU Library's METS-Bagger Tool

•

1 like•893 views

Normalizing existing digitized content into standardized packages for robust long-term management. A report on SFU Library's METS-Bagger tool, with a discussion of the benefits, design principles used for the packaging specification, and potential next steps. Presented at Code4Lib BC, November 28, 2013.

METS-Bagger Tool
Normalizing existing digitized content into standardized
packages for robust long-term management.

Marcus Emmanuel Barnes
#c4lbc
2013-11-28

Background
● SFU Library holds about 15 TB of content
○ the Library has created high-quality master versions
of content it has digitized using ‘preservationfriendly’ formats.
○ descriptive metadata exists for almost all of it.

However, this content was not previously
managed with generally accepted digital
preservation practice.

Solution
● SFU Library Digitized Content Packaging
Specification
● METS-Bagger tool for normalizing existing
digitized content based on this specification
for robust long-term management.

METS-Bagger Tool
● Two components:
○ Collection normalization script
○ Integrity scripts based on collection
manifest

Collection Normalization
● Processes existing collections of files into a format
compliant with the SFU Library Digitized Content
Packaging Specification
● Packaging Formats:
○ METS (http://www.loc.gov/standards/mets/)
○ BagIt (http://tools.ietf.org/html/draft-kunze-bagit)

How Collection Normalization Works
1. Configuration file for settings
2. Script walks the directory tree of a collection, compiles
list of files to be preserved
3. Files are collated into items (e.g., newspaper issue),
METS file is generated
4. Items files and associated METS file are bagged (and
serialized)
5. Future: A collection manifest is created for the collection
for integrity checking (automatic or manual).

Design Principles
● a minimalist implementation - uses as few METS and
BagIt options as possible.
● incorporates three widely implemented and understood
standards: METS, BagIt and UUID (Universally Unique
Identifiers)
● Technical metadata included in METS should include at
a minimum bit-level checksums, file type identification,
creating application, and where possible format validity
● Whenever possible, include descriptive metadata for the
item in the METS file.

Script Details
● Configuration file, main script, log file, processed
collection output directory
● Uses Python for using the tool on multiple platforms
● Plugins for technical metadata (FITS) and descriptive
metadata.
● Configuration options include:
○ test run (limited run size)
○ skipping technical metadata creation
○ file types of interest

Future
● Addition of manifest and integrity checking
tools that check a collection against its
manifest
● Additional plugins
● Sharing code on GitHub

Thank You
This work was made possible by the support of:
● Simon Fraser University Library
● SFU Library Systems group
● Mark Jordan @mjordan

Presented by Bronwen Sprout & Sarah Romkey, UBC Library. In early 2011, UBC Library began work on creating a digital preservation strategy in collaboration with Vancouver-based Artefactual Systems. Based on the results of a number of pilot projects, the strategy developed for UBC Library consists of using the open-source Archivematica digital preservation system to provide preservation functionality for the Library’s digitized and born-digital holdings. In addition, the strategy identifies the software requirements, existing and new system components, staffing and business processes that can be implemented to establish operational digital preservation systems and processes. They will discuss the strategy generally and cover three areas of implementation in greater detail: UBC Library’s Rare Books and Special Collections, cIRcle, a DSpace-based institutional repository, and CONTENTdm, UBC Library’s access system for digitized objects.

Session 03 - Object Repository and Ways to Add Object

rajaselv

ArchivesSpace-Archivematica-DSpace Workflow Integration

Max Eckard

PERICLES Information Packaging Techniques

PERICLES_FP7

PREMIS in METS in Archivematica

Artefactual Systems - Archivematica

Auto Baggerrobert roberson

Different Types Of Concrete Mixing Mechanism

Zelkhan

Archivematica and the digital archival chain of custody

Artefactual Systems - Archivematica

NCompass Live - Nov. 21, 2018 http://nlc.nebraska.gov/ncompasslive/ To enhance access to their diverse materials, libraries are digitizing those materials and making them freely available online as digital collections on digital platforms. These collections provide another way for libraries to re-envision their materials and make them relevant to their communities. This presentation will cover best practices for creating and preserving digital collections, including workflows, standards, and staffing. It will also discuss the policies which should be developed for building successful digital collections, as well as the privacy issues which should be considered. In this presentation, individual digital collections from the University of Nebraska at Omaha and Creighton University Law Library, including the Omaha Oral History Collection and the Delaney Tokyo Trial Papers, will be demonstrated. Presenters: Corinne Jacox, Catalog/Reference Librarian, Creighton University Law Library & Yumi Ohira, Digital Initiatives Librarian, UNO Criss Library.

2020 07-30 elastic agent + ingest management

Daliya Spasova

APS-Presentation-MK.pptx

Madhura Arvind

Biothings presentation

Cyrus Afrasiabi

Archivematica and Local Authority Archive Services

Paweł Jaskulski

Presentation accompanying demonstration of Archivematica to EERAC (East of England Regional Archives Council) members introducing OAIS (Open Archival Information System) methodology. Identifies common operations for both: transfer and ingest of digitally born archives into digital repository and accessioning paper-based archives. How digital preservation relates to and fits within traditional archival processing.

PERICLES Process Compiler - ‘Eye of the Storm: Preserving Digital Content in ...

PERICLES_FP7

This presentation was delivered by Noa Campos López and Marcel Hellkamp from PERICLES project partner Georg-August-Universität Göttingen (GWDG), at the interactive workshop ‘Eye of the Storm: Preserving Digital Content in an Ever-Changing World’ (Wellcome Collection Conference Centre, London, 2 December 2016). This full-day event aimed at introducing and experimenting with the PERICLES model-driven approach demonstrating its usefulness for managing change in evolving digital ecosystems. http://pericles-project.eu/

Introduction to digital curation

GarethKnight

People aggregator

Huntor Group

What is Digital Asset Management?

Asset Bank

Digital Asset Management and Digital Asset Management Software explained by leading vendor Asset Bank. This presentation starts with a definition of Digital Asset Management (DAM) and DAM software. It then cover key elements such as; uploading files. organising assets, user permissions, downloading files, lightboxes, searching, enterprise features, reporting, storage, pricing, and finally, a bit more information about Asset Bank

Asp .net folders and web.config

baabtra.com - No. 1 supplier of quality freshers

Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?

BIOVIA

Collection Update 1 for Pipeline Pilot 8.5 includes key new features for the Pipeline Pilot Client, as well as the Imaging, Next Gen Sequencing, Chemistry, Documents and Text, and Statistics and Modeling collections. An exciting new feature for the Pipeline Pilot Client is Protocol Comparison – the ability to compare protocols, or versions of protocols, allowing you to see and resolve differences between them.

Page 18Goal Implement a complete search engine. Milestones.docx

smile790243

Page 1/8 Goal: Implement a complete search engine. Milestones Overview Milestone Goal #1 Produce an initial index for the corpus and a basic retrieval component #2 Complete Search System Page 2/8 PROJECT: SEARCH ENGINE Corpus: all ICS web pages We will provide you with the crawled data as a zip file (webpages_raw.zip). This contains the downloaded content of the ICS web pages that were crawled by a previous quarter. You are expected to build your search engine index off of this data. Main challenges: Full HTML parsing, File/DB handling, handling user input (either using command line or desktop GUI application or web interface) COMPONENT 1 - INDEX: Create an inverted index for all the corpus given to you. You can either use a database to store your index (MongoDB, Redis, memcached are some examples) or you can store the index in a file. You are free to choose an approach here. The index should store more than just a simple list of documents where the token occurs. At the very least, your index should store the TF-IDF of every term/document. Sample Index: Note: This is a simplistic example provided for your understanding. Please do not consider this as the expected index format. A good inverted index will store more information than this. Index Structure: token – docId1, tf-idf1 ; docId2, tf-idf2 Example: informatics – doc_1, 5 ; doc_2, 10 ; doc_3, 7 You are encouraged to come up with heuristics that make sense and will help in retrieving relevant search results. For e.g. - words in bold and in heading (h1, h2, h3) could be treated as more important than the other words. These are useful metadata that could be added to your inverted index data. Optional (1 point for each meta data item up to 2 points max):: Extra credit will be given for ideas that improve the quality of the retrieval, so you may add more metadata to your index, if you think it will help improve the quality of the retrieval. For this, instead of storing a simple TF-IDF count for every page, you can store more information related to the page (e.g. position of the words in the page). To store this information, you need to design your index in such a way that it can store and retrieve all this metadata efficiently. Your index lookup during search should not be horribly slow, so pay attention to the structure of your index COMPONENT 2 – SEARCH AND RETRIEVE: Your program should prompt the user for a query. This doesn’t need to be a Web interface, it can be a console prompt. At the time of the query, your program will look up your index, perform some calculations (see ranking below) and give out the ranked list of pages that are relevant for the query. COMPONENT 3 - RANKING: At the very least, your ranking formula should include tf-idf scoring, but you should feel free to add additional components to this formula if you think they improve the retrieval. Optional (1 point for each parameter up to 2 points max): Extra credit will be given if your ranking formula includes par.

Asp .net folders and web.config

baabtra.com - No. 1 supplier of quality freshers

File management in OS

Bhavik Vashi

Islandora & Archivematica combined NDSA RAG poster for LITA

aaroncollie

The ECM world from the point of view of Alfresco - Linux Day 2013 - Rome

Piergiorgio Lucidi

Lecture 8 comp forensics 03 10-18 file system

Alchemist095

Personal Digital Archiving 2015 - NYU - Workshop

Artefactual Systems - Archivematica

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Similar to SFU Library's METS-Bagger Tool

BatIgTheLibraryGuru

Presentation 16 may keynote karin bredenbergNederlands Instituut voor Beeld en Geluid

NCompass Live: Best Practices for Digital Collections

Nebraska Library Commission

2020 07-30 elastic agent + ingest management

Daliya Spasova

APS-Presentation-MK.pptx

Madhura Arvind

Biothings presentation

Cyrus Afrasiabi

Archivematica and Local Authority Archive Services

Paweł Jaskulski

PERICLES Process Compiler - ‘Eye of the Storm: Preserving Digital Content in ...

PERICLES_FP7

Introduction to digital curation

GarethKnight

People aggregator

Huntor Group

What is Digital Asset Management?

Asset Bank

Asp .net folders and web.config

baabtra.com - No. 1 supplier of quality freshers

Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?

BIOVIA

Page 18Goal Implement a complete search engine. Milestones.docx

smile790243

Asp .net folders and web.config

baabtra.com - No. 1 supplier of quality freshers

File management in OS

Bhavik Vashi

Islandora & Archivematica combined NDSA RAG poster for LITA

aaroncollie

The ECM world from the point of view of Alfresco - Linux Day 2013 - Rome

Piergiorgio Lucidi

Lecture 8 comp forensics 03 10-18 file system

Alchemist095

Personal Digital Archiving 2015 - NYU - Workshop

Artefactual Systems - Archivematica

Similar to SFU Library's METS-Bagger Tool (20)

BatIg

Presentation 16 may keynote karin bredenberg

NCompass Live: Best Practices for Digital Collections

2020 07-30 elastic agent + ingest management

APS-Presentation-MK.pptx

Biothings presentation

Archivematica and Local Authority Archive Services

PERICLES Process Compiler - ‘Eye of the Storm: Preserving Digital Content in ...

Introduction to digital curation

People aggregator

What is Digital Asset Management?

Asp .net folders and web.config

Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?

Page 18Goal Implement a complete search engine. Milestones.docx

Asp .net folders and web.config

File management in OS

Islandora & Archivematica combined NDSA RAG poster for LITA

The ECM world from the point of view of Alfresco - Linux Day 2013 - Rome

Lecture 8 comp forensics 03 10-18 file system

Personal Digital Archiving 2015 - NYU - Workshop

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

Knowledge engineering: from people to machines and back

Elena Simperl

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Paul Groth

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Ramesh Iyer

In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team

FIDO Alliance Osaka Seminar: Overview.pdf

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

When stars align: studies in data quality, knowledge graphs, and machine lear...

Designing Great Products: The Power of Design and Leadership by Chief Designe...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

Accelerate your Kubernetes clusters with Varnish Caching

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Knowledge engineering: from people to machines and back

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Epistemic Interaction - tuning interfaces to provide information for AI support

Generating a custom Ruby SDK for your web service or Rails API using Smithy

SFU Library's METS-Bagger Tool

1. METS-Bagger Tool Normalizing existing digitized content into standardized packages for robust long-term management. Marcus Emmanuel Barnes #c4lbc 2013-11-28

2. Background ● SFU Library holds about 15 TB of content ○ the Library has created high-quality master versions of content it has digitized using ‘preservationfriendly’ formats. ○ descriptive metadata exists for almost all of it. However, this content was not previously managed with generally accepted digital preservation practice.

3. Solution ● SFU Library Digitized Content Packaging Specification ● METS-Bagger tool for normalizing existing digitized content based on this specification for robust long-term management.

4. METS-Bagger Tool ● Two components: ○ Collection normalization script ○ Integrity scripts based on collection manifest

5. Collection Normalization ● Processes existing collections of files into a format compliant with the SFU Library Digitized Content Packaging Specification ● Packaging Formats: ○ METS (http://www.loc.gov/standards/mets/) ○ BagIt (http://tools.ietf.org/html/draft-kunze-bagit)

6. How Collection Normalization Works 1. Configuration file for settings 2. Script walks the directory tree of a collection, compiles list of files to be preserved 3. Files are collated into items (e.g., newspaper issue), METS file is generated 4. Items files and associated METS file are bagged (and serialized) 5. Future: A collection manifest is created for the collection for integrity checking (automatic or manual).

7. Before and After Processing

8. Design Principles ● a minimalist implementation - uses as few METS and BagIt options as possible. ● incorporates three widely implemented and understood standards: METS, BagIt and UUID (Universally Unique Identifiers) ● Technical metadata included in METS should include at a minimum bit-level checksums, file type identification, creating application, and where possible format validity ● Whenever possible, include descriptive metadata for the item in the METS file.

9. Script Details ● Configuration file, main script, log file, processed collection output directory ● Uses Python for using the tool on multiple platforms ● Plugins for technical metadata (FITS) and descriptive metadata. ● Configuration options include: ○ test run (limited run size) ○ skipping technical metadata creation ○ file types of interest

10. Future ● Addition of manifest and integrity checking tools that check a collection against its manifest ● Additional plugins ● Sharing code on GitHub

11. Thank You This work was made possible by the support of: ● Simon Fraser University Library ● SFU Library Systems group ● Mark Jordan @mjordan

SFU Library's METS-Bagger Tool

Recommended

Recommended

More Related Content

Similar to SFU Library's METS-Bagger Tool

Similar to SFU Library's METS-Bagger Tool (20)

Recently uploaded

Recently uploaded (20)

SFU Library's METS-Bagger Tool