SlideShare a Scribd company logo
Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen
baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk
Minimal Effort Ingest
en.statsbiblioteket.dk/minimal-effort-ingest
Dec 3, 2015 2
Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen
baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk
State and University Library
● A National Library
– Responsible for preserving the Danish Cultural
Heritage
● Many diverse collections, from many legacy
systems
– These collections must be preserved, but very few
users want access.
Dec 3, 2015 3
Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen
baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk
What is Minimal Effort Ingest?
● A different approach to ingest and Quality
Assurance
● In OAIS detailed QA is part of ingest
– Strict compliance required before ingest
● Minimal Effort Ingest postpones most of QA
– Data ingested as is
– QA is done just after ingest - or even later, if resources
are sparse
– Failure in QA is handled within the repository
Dec 3, 2015 4
Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen
baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk
Why do Minimal Effort Ingest?
● Secure the incoming data quickly
● Old collections are preserved
– even if resources for QA are not available
● Update and rerun preservation actions as
needed
Dec 3, 2015 5
Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen
baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk
Minimal Effort Ingest – An example
● Collection: Wav files and a CSV file with metadata
1) Ingest all the files, just as File Objects
2) Generate technical metadata for the File Objects
3) Parse the CSV file and create Track Objects
4) Generate Access Copies for the Track Objects
5) Verify that the Track Metadata is correct
1) Simple checks such as duration
2) Complex checks could be akin to forensics
6) Do speech2text to generate better indexes
You can do as many of these as you have the budget
for.
If you do only 1, the collection is still well preserved
If you also do 2, you will be able to plan for format
preservation risks
If you do 3 the collection can be made available for
discovery
If you do 4 the collection can be made available for
access
If you do 5, you can verify that your collection actually
contain what you believe it do
If you do 6, you can improve the discovery greatly
Do note that point 4 and 5 can be done in reverse
order, if quality is more important than access.
Dec 3, 2015 6
Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen
baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk
In a timely fashion...
● The important matter is that everything, data and
metadata and context, is available when needed,
and not before
● This includes information not known at the time of
creation
● So the question becomes not
– How much metadata do I need?
● but rather
– When would I need this metadata?
Some metadata is only available at the time of
creation, even if it is only used much later, eg.
digitization hardware.
While it is good practice to get as much metadata as
possible as early is possible, do not assume you
can get all.
Some require tools (speech2text, OCR) which are
still improving
Some metadata require special skills to both
generate and understand
The most important metadata might not be something
the creator can provide
Journals and citation-counts is one such example.
Truthfulness is another.
Dec 3, 2015 7
Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen
baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk
Expensive Understandings
● In our experience, the most expensive part of
digital preservation is understanding your
collection
● This cost turned out to be fairly constant,
irrespective of the collection size
● This is even more true for Research Data
● Preserving the files and preserving the
understanding are very different challenges
Understanding a collection allows you to build data
models and to do QA
Datamodels are important for Access systems.
QA is only really important, if you are able to get a
better version of the data.
When receiving these data from a provider, you can
often request a new version, if something is broken.
When “represerving” an old collection or when getting
research data, the data is what it is, broken or not.
QA becomes less valuable, as a broken file is still
more valuable than no file
Preserving understanding. Is it nessessary, and how
much? Should I preserve the jpeg spec along with
my jpeg files? How about a dictionary?
Dec 3, 2015 8
Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen
baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk
Preservation Events
● Our archival record's life will often consist of these three
phases
1) Raw Ingest
2) Enrichment and transformation to data model
3) Preservation Actions
● The history of a Record should include all these phases.
This happens naturally if the transformation happens inside
the repository.
● Unfortunately, many traditional systems do their most
important transformations before ingest.
With Minimal Effort Ingest, even the preparation
happens inside the repository. So whatever
version/event tracking system the repository uses,
will also list the initial transformations.
It is hard to prove authenticity if you cannot show
what changes happened from “files on disk” to
“SIP” even if you know everything that happened
from “SIP” and onwards.
Dec 3, 2015 9
Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen
baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk
Preservation 2.0?
● Web 1.0 was the web of static webpages, and
the user would read but never contribute
● Web 2.0 is perhaps best exemplified by Wikis,
where the user is also an editor
● Records are updated, but with strong
versioning and history
This does not mean everybody can edit, it means that
the system is build around the concept of updating
and enriching content. We still envisage a strong
Curatorial presence.
The dead archival record is past. Records in the
repository are alive. They are updated, changed
and interlinked during their lifetime.
Design your preservation systems not as the archives
of old, but as the wikis of today.

More Related Content

Viewers also liked

The Biggest Lender
The Biggest LenderThe Biggest Lender
The Biggest LenderAdrian Teng
 
Trabajo de organigrama
Trabajo de organigramaTrabajo de organigrama
Trabajo de organigrama
grtm132
 
Екскурсія музеєм Баштанської ЗОШ І-ІІІ ст. №2
Екскурсія музеєм Баштанської ЗОШ І-ІІІ ст. №2Екскурсія музеєм Баштанської ЗОШ І-ІІІ ст. №2
Екскурсія музеєм Баштанської ЗОШ І-ІІІ ст. №2
painAlex
 
SuperMemo World
SuperMemo WorldSuperMemo World
cardinal health Q3 2007 Earnings Release
cardinal health 	Q3 2007 Earnings Releasecardinal health 	Q3 2007 Earnings Release
cardinal health Q3 2007 Earnings Releasefinance2
 
mckesson Annual Report as Filed on Form 10-K - 880k 2004
mckesson Annual Report as Filed on Form 10-K - 880k  2004mckesson Annual Report as Filed on Form 10-K - 880k  2004
mckesson Annual Report as Filed on Form 10-K - 880k 2004finance2
 
mckesson Annual Report as Filed on Form 10-K - 2.3M 2005
mckesson Annual Report as Filed on Form 10-K - 2.3M 2005mckesson Annual Report as Filed on Form 10-K - 2.3M 2005
mckesson Annual Report as Filed on Form 10-K - 2.3M 2005finance2
 
2003 Merrill Lynch Global Healthcare Conference
	 2003 Merrill Lynch Global Healthcare Conference	 2003 Merrill Lynch Global Healthcare Conference
2003 Merrill Lynch Global Healthcare Conferencefinance2
 
Mekesson Quarterly Reports 2002 1st
Mekesson Quarterly Reports 2002 1stMekesson Quarterly Reports 2002 1st
Mekesson Quarterly Reports 2002 1stfinance2
 

Viewers also liked (11)

practice profile
practice profilepractice profile
practice profile
 
The Biggest Lender
The Biggest LenderThe Biggest Lender
The Biggest Lender
 
MSanJoaquin_CV
MSanJoaquin_CVMSanJoaquin_CV
MSanJoaquin_CV
 
Trabajo de organigrama
Trabajo de organigramaTrabajo de organigrama
Trabajo de organigrama
 
Екскурсія музеєм Баштанської ЗОШ І-ІІІ ст. №2
Екскурсія музеєм Баштанської ЗОШ І-ІІІ ст. №2Екскурсія музеєм Баштанської ЗОШ І-ІІІ ст. №2
Екскурсія музеєм Баштанської ЗОШ І-ІІІ ст. №2
 
SuperMemo World
SuperMemo WorldSuperMemo World
SuperMemo World
 
cardinal health Q3 2007 Earnings Release
cardinal health 	Q3 2007 Earnings Releasecardinal health 	Q3 2007 Earnings Release
cardinal health Q3 2007 Earnings Release
 
mckesson Annual Report as Filed on Form 10-K - 880k 2004
mckesson Annual Report as Filed on Form 10-K - 880k  2004mckesson Annual Report as Filed on Form 10-K - 880k  2004
mckesson Annual Report as Filed on Form 10-K - 880k 2004
 
mckesson Annual Report as Filed on Form 10-K - 2.3M 2005
mckesson Annual Report as Filed on Form 10-K - 2.3M 2005mckesson Annual Report as Filed on Form 10-K - 2.3M 2005
mckesson Annual Report as Filed on Form 10-K - 2.3M 2005
 
2003 Merrill Lynch Global Healthcare Conference
	 2003 Merrill Lynch Global Healthcare Conference	 2003 Merrill Lynch Global Healthcare Conference
2003 Merrill Lynch Global Healthcare Conference
 
Mekesson Quarterly Reports 2002 1st
Mekesson Quarterly Reports 2002 1stMekesson Quarterly Reports 2002 1st
Mekesson Quarterly Reports 2002 1st
 

Similar to Minimal Effort Ingest for DPC Metadata meeting

Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
Aaron Collie
 
Getting started in digital preservation
Getting started in digital preservationGetting started in digital preservation
Getting started in digital preservation
Sarah Jones
 
Data presentation and transfer
Data presentation and transferData presentation and transfer
Data presentation and transferIyad Abou Rabii
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering Students
Aaron Collie
 
Getting to grips with Research Data Management
Getting to grips with Research Data ManagementGetting to grips with Research Data Management
Getting to grips with Research Data Management
IzzyChad
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
cunera
 
20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data Things
Katina Toufexis
 
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
John Scally
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
Sarah Anna Stewart
 
Smithies bodleian 2017_v.2.0
Smithies bodleian 2017_v.2.0Smithies bodleian 2017_v.2.0
Smithies bodleian 2017_v.2.0
jamessmithies
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
kulibrarians
 
Getting Things Done for Technical Communicators at TCUK14
Getting Things Done for Technical Communicators at TCUK14Getting Things Done for Technical Communicators at TCUK14
Getting Things Done for Technical Communicators at TCUK14
Karen Mardahl
 
Data management
Data management Data management
Data management
Graça Gabriel
 
Introducingthe anu datacommons
Introducingthe anu datacommonsIntroducingthe anu datacommons
Introducingthe anu datacommons
Doug Moncur
 
Getting to grips with research data management
Getting to grips with research data management Getting to grips with research data management
Getting to grips with research data management
Wendy Mears
 
How to share useful data
How to share useful dataHow to share useful data
How to share useful data
Peter McQuilton
 
Filling the digital preservation gap
Filling the digital preservation gapFilling the digital preservation gap
Filling the digital preservation gap
Jisc
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Subrata Saharia
 

Similar to Minimal Effort Ingest for DPC Metadata meeting (20)

Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
 
Getting started in digital preservation
Getting started in digital preservationGetting started in digital preservation
Getting started in digital preservation
 
Data presentation and transfer
Data presentation and transferData presentation and transfer
Data presentation and transfer
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering Students
 
Getting to grips with Research Data Management
Getting to grips with Research Data ManagementGetting to grips with Research Data Management
Getting to grips with Research Data Management
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
 
20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data Things
 
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Smithies bodleian 2017_v.2.0
Smithies bodleian 2017_v.2.0Smithies bodleian 2017_v.2.0
Smithies bodleian 2017_v.2.0
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
 
Getting Things Done for Technical Communicators at TCUK14
Getting Things Done for Technical Communicators at TCUK14Getting Things Done for Technical Communicators at TCUK14
Getting Things Done for Technical Communicators at TCUK14
 
Data management
Data management Data management
Data management
 
Patterson e life uksg 2013
Patterson e life uksg 2013Patterson e life uksg 2013
Patterson e life uksg 2013
 
Introducingthe anu datacommons
Introducingthe anu datacommonsIntroducingthe anu datacommons
Introducingthe anu datacommons
 
Getting to grips with research data management
Getting to grips with research data management Getting to grips with research data management
Getting to grips with research data management
 
How to share useful data
How to share useful dataHow to share useful data
How to share useful data
 
Filling the digital preservation gap
Filling the digital preservation gapFilling the digital preservation gap
Filling the digital preservation gap
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 

Recently uploaded

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 

Recently uploaded (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 

Minimal Effort Ingest for DPC Metadata meeting

  • 1. Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk Minimal Effort Ingest en.statsbiblioteket.dk/minimal-effort-ingest
  • 2. Dec 3, 2015 2 Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk State and University Library ● A National Library – Responsible for preserving the Danish Cultural Heritage ● Many diverse collections, from many legacy systems – These collections must be preserved, but very few users want access.
  • 3. Dec 3, 2015 3 Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk What is Minimal Effort Ingest? ● A different approach to ingest and Quality Assurance ● In OAIS detailed QA is part of ingest – Strict compliance required before ingest ● Minimal Effort Ingest postpones most of QA – Data ingested as is – QA is done just after ingest - or even later, if resources are sparse – Failure in QA is handled within the repository
  • 4. Dec 3, 2015 4 Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk Why do Minimal Effort Ingest? ● Secure the incoming data quickly ● Old collections are preserved – even if resources for QA are not available ● Update and rerun preservation actions as needed
  • 5. Dec 3, 2015 5 Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk Minimal Effort Ingest – An example ● Collection: Wav files and a CSV file with metadata 1) Ingest all the files, just as File Objects 2) Generate technical metadata for the File Objects 3) Parse the CSV file and create Track Objects 4) Generate Access Copies for the Track Objects 5) Verify that the Track Metadata is correct 1) Simple checks such as duration 2) Complex checks could be akin to forensics 6) Do speech2text to generate better indexes You can do as many of these as you have the budget for. If you do only 1, the collection is still well preserved If you also do 2, you will be able to plan for format preservation risks If you do 3 the collection can be made available for discovery If you do 4 the collection can be made available for access If you do 5, you can verify that your collection actually contain what you believe it do If you do 6, you can improve the discovery greatly Do note that point 4 and 5 can be done in reverse order, if quality is more important than access.
  • 6. Dec 3, 2015 6 Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk In a timely fashion... ● The important matter is that everything, data and metadata and context, is available when needed, and not before ● This includes information not known at the time of creation ● So the question becomes not – How much metadata do I need? ● but rather – When would I need this metadata? Some metadata is only available at the time of creation, even if it is only used much later, eg. digitization hardware. While it is good practice to get as much metadata as possible as early is possible, do not assume you can get all. Some require tools (speech2text, OCR) which are still improving Some metadata require special skills to both generate and understand The most important metadata might not be something the creator can provide Journals and citation-counts is one such example. Truthfulness is another.
  • 7. Dec 3, 2015 7 Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk Expensive Understandings ● In our experience, the most expensive part of digital preservation is understanding your collection ● This cost turned out to be fairly constant, irrespective of the collection size ● This is even more true for Research Data ● Preserving the files and preserving the understanding are very different challenges Understanding a collection allows you to build data models and to do QA Datamodels are important for Access systems. QA is only really important, if you are able to get a better version of the data. When receiving these data from a provider, you can often request a new version, if something is broken. When “represerving” an old collection or when getting research data, the data is what it is, broken or not. QA becomes less valuable, as a broken file is still more valuable than no file Preserving understanding. Is it nessessary, and how much? Should I preserve the jpeg spec along with my jpeg files? How about a dictionary?
  • 8. Dec 3, 2015 8 Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk Preservation Events ● Our archival record's life will often consist of these three phases 1) Raw Ingest 2) Enrichment and transformation to data model 3) Preservation Actions ● The history of a Record should include all these phases. This happens naturally if the transformation happens inside the repository. ● Unfortunately, many traditional systems do their most important transformations before ingest. With Minimal Effort Ingest, even the preparation happens inside the repository. So whatever version/event tracking system the repository uses, will also list the initial transformations. It is hard to prove authenticity if you cannot show what changes happened from “files on disk” to “SIP” even if you know everything that happened from “SIP” and onwards.
  • 9. Dec 3, 2015 9 Bolette Ammitzbøll Jurik Asger Askov Blekinge Kåre Fiedler Christiansen baj@statsbiblioteket.dk abr@statsbiblioteket.dk kfc@statsbiblioteket.dk Preservation 2.0? ● Web 1.0 was the web of static webpages, and the user would read but never contribute ● Web 2.0 is perhaps best exemplified by Wikis, where the user is also an editor ● Records are updated, but with strong versioning and history This does not mean everybody can edit, it means that the system is build around the concept of updating and enriching content. We still envisage a strong Curatorial presence. The dead archival record is past. Records in the repository are alive. They are updated, changed and interlinked during their lifetime. Design your preservation systems not as the archives of old, but as the wikis of today.

Editor's Notes

  1. You can do as many of these as you have the budget for. If you do only 1, the collection is still well preserved If you also do 2, you will be able to plan for format preservation risks If you do 3 the collection can be made available for discovery If you do 4 the collection can be made available for access If you do 5, you can verify that your collection actually contain what you believe it do If you do 6, you can improve the discovery greatly Do note that point 4 and 5 can be done in reverse order, if quality is more important than access.
  2. Some metadata is only available at the time of creation, even if it is only used much later, eg. digitization hardware. While it is good practice to get as much metadata as possible as early is possible, do not assume you can get all. Some require tools (speech2text, OCR) which are still improving Some metadata require special skills to both generate and understand The most important metadata might not be something the creator can provide Journals and citation-counts is one such example. Truthfulness is another.
  3. Understanding a collection allows you to build data models and to do QA Datamodels are important for Access systems. QA is only really important, if you are able to get a better version of the data. When receiving these data from a provider, you can often request a new version, if something is broken. When “represerving” an old collection or when getting research data, the data is what it is, broken or not. QA becomes less valuable, as a broken file is still more valuable than no file Preserving understanding. Is it nessessary, and how much? Should I preserve the jpeg spec along with my jpeg files? How about a dictionary?
  4. With Minimal Effort Ingest, even the preparation happens inside the repository. So whatever version/event tracking system the repository uses, will also list the initial transformations. It is hard to prove authenticity if you cannot show what changes happened from “files on disk” to “SIP” even if you know everything that happened from “SIP” and onwards.
  5. This does not mean everybody can edit, it means that the system is build around the concept of updating and enriching content. We still envisage a strong Curatorial presence. The dead archival record is past. Records in the repository are alive. They are updated, changed and interlinked during their lifetime. Design your preservation systems not as the archives of old, but as the wikis of today.