SlideShare a Scribd company logo
•Andrew Jackson
•Web Archiving Technical Lead
•British Library
Unified Characterisation, Please
The Practitioners' Have Spoken…
 Quality Assurance (of broken or potentially broken data):
 Quality assurance, Bit rot, and Integrity
 Appraisal and Assessment:
 Appraisal and assessment, Conformance, Unknown
characteristics, and Unknown file formats.
 Identify/Locate Preservation Worthy Data
 Identify Preservation Risks:
 Obsolescence, preservation risk and business constraint
 Long tail of many other issues:
 Contextual and Data capture issues through to Embedded
objects, and broader issues around Value and cost.
 Plus: Sustainable Tools
2
Appraisal and Assessment
Conformance, Unknown characteristics, and Unknown file
formats. Identify/Locate Preservation Worthy Data
 Identification
 Always used to „route‟ data to software that can understand it.
 Use minimum information to identify:
 e.g. header only if possible. “Truncated PDF”, not
“UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing
.dbf should be reported as such.
 Validation
 Two modes needed: “Fast fail”, “Log and continue” /Quirks
 Stop baseless distinction between “Well formed” and “Valid”
 Validation is irrelevant to digital preservation assessment:
 e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.
 We‟re on the wrong side of Postel‟s Law.
 Unknown completeness and failure to future-proof:
 e.g. JHOVE tries to validate versions of PDF it cannot know.
 e.g. Tools sometimes interpret/migrate data opaquely. 3
• t
4
• this
5
Identify Preservation Risks
Obsolescence, preservation risk and business constraint
 Significant Properties are irrelevant here.
 It‟s not really about the content, but about the context.
 Dependency Analysis:
 What software does this need?
 Does this file use format features that are not well supported
across implementations?
 What other resources are transcluded?
 Fonts? c.f. OfficeDDT.
 Remote embeds?
 Embedded scripts that might mask dependencies?
 Do some operations require a password?
 e.g. JHOVE cannot spot „harmless‟ PDF encryption.
6
Sustainable Tools
Our Tools
 Pure-Java Characterisation:
 JHOVE („clean room‟ implementation)
 New Zealand Metadata Extractor (NZME)
 Apache Tika
 Java-based aggregation of various CLI tools:
 JHOVE2
 FITS
 Other Characterisation:
 XCL – C++/XML „clean room‟ extended with ImageMagick
 Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...
 Identification:
 DROID, FIDO, Apache Tika, File
 Visualisation:
 C3PO, and many non-specialised tools.
7
Sustainable Tools
Up to date? Working together?
 Software Dependency Management:
 FITS/JHOVE2 embed old DROID versions, hard to upgrade.
 Dead dependencies: FITS and FFIdent, NZME and Jflac.
 Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?
 Embed shared modules instead?
 Software Project Management and Communication:
 JHOVE, JHOVE2? FITS?
 JHOVE2 only compiles on Sheila‟s branch?
 Roadmaps, issue management, testing, C.I., etc.
 Cross-project coordination and bug-fixing?
 Complexity: JHOVE2, XCL, extremely complex
 JHOVE2 Berkley DB causes checksum failures in tests
 Tika solves same problem using SAX 8
Sustainable Tools
Shared tests?
 Separate projects arise from separate workflows
 Start by understand commonality and find gaps?
 Share test cases and compare results?
 The OPF Format Corpus contains various valid and invalid files.
 Built by practitioners' to test real use cases.
 e.g. JP2 features, PDF Cabinet of Horrors.
 Do the tools give consistent and complementary results?
 Let‟s find out!
 c.f. Dave Tarrant‟s REF for Identification:
 http://data.openplanetsfoundation.org/ref/
 http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/
9
Bit-mashing as Tool QA
 Bitwise exploration of data sensitivity.
 One way to compare tools.
 Helps understand formats.
 c.f. Jay Gattuso‟s recent OPF blog.
10
Quality Assurance (of broken or potentially broken data)
Quality assurance, Bit rot, and Integrity
 JHOVE let failed TIFF-JP2 through…
 Jpylyzer does better.
 Both fall far short of actual rendering.
11
Where's the unification?
Where should we work together?
 Shared test corpora and test framework:
 Start with the OPF Format Corpus?
 Pull other corpora in by reference:
 http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A
 Sustainable version of Dave Tarrant‟s REF?
 Extend with bit-mashing to compare tools?
 Aim to coordinate more:
 Make it clear where to go? (More about OfficeDDT).
 Consider merging projects?
 Consider sharing underlying libraries?
 Consider building Tika modules?
 Please consider Apache Preflight as base for PDF validation.
12

More Related Content

What's hot

Advanced .net api (ewout)
Advanced .net api (ewout)Advanced .net api (ewout)
Advanced .net api (ewout)
DevDays
 
Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)
DevDays
 
Furore devdays 2017- profiling academy - profiling guidelines v1
Furore devdays 2017- profiling academy - profiling guidelines v1Furore devdays 2017- profiling academy - profiling guidelines v1
Furore devdays 2017- profiling academy - profiling guidelines v1
DevDays
 
Devdays 2017 implementation guide authoring - ardon toonstra
Devdays 2017  implementation guide authoring - ardon toonstraDevdays 2017  implementation guide authoring - ardon toonstra
Devdays 2017 implementation guide authoring - ardon toonstra
DevDays
 
Fhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_servicesFhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_services
DevDays
 
Furore devdays2017 tdd-2-advanced
Furore devdays2017 tdd-2-advancedFurore devdays2017 tdd-2-advanced
Furore devdays2017 tdd-2-advanced
DevDays
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)
DevDays
 
Software Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery WorkshopSoftware Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery WorkshopMartin Hammitzsch
 
Fhir dev days 2017 fhir profiling - overview and introduction v07
Fhir dev days 2017   fhir profiling - overview and introduction v07Fhir dev days 2017   fhir profiling - overview and introduction v07
Fhir dev days 2017 fhir profiling - overview and introduction v07
DevDays
 
Project Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role PilotProject Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role Pilot
CASRAI
 
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
dreusser
 
Security overview (grahame)
Security overview (grahame)Security overview (grahame)
Security overview (grahame)
DevDays
 
The Future Publication of Software
The Future Publication of SoftwareThe Future Publication of Software
The Future Publication of Software
Martin Hammitzsch
 
Fhir foundation (grahame)
Fhir foundation (grahame)Fhir foundation (grahame)
Fhir foundation (grahame)
DevDays
 
Hackdays and workshops 2019
Hackdays and workshops 2019Hackdays and workshops 2019
Hackdays and workshops 2019
Jisc
 
Whats new (grahame)
Whats new (grahame)Whats new (grahame)
Whats new (grahame)
DevDays
 
Furore devdays 2017- continua implementing fhir
Furore devdays 2017- continua implementing fhirFurore devdays 2017- continua implementing fhir
Furore devdays 2017- continua implementing fhir
DevDays
 
fhir-documents
fhir-documentsfhir-documents
fhir-documents
DevDays
 
Furore devdays 2017- fhir and devices - cooper thc2
Furore devdays 2017- fhir and devices - cooper thc2Furore devdays 2017- fhir and devices - cooper thc2
Furore devdays 2017- fhir and devices - cooper thc2
DevDays
 

What's hot (20)

Advanced .net api (ewout)
Advanced .net api (ewout)Advanced .net api (ewout)
Advanced .net api (ewout)
 
Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)
 
Furore devdays 2017- profiling academy - profiling guidelines v1
Furore devdays 2017- profiling academy - profiling guidelines v1Furore devdays 2017- profiling academy - profiling guidelines v1
Furore devdays 2017- profiling academy - profiling guidelines v1
 
Devdays 2017 implementation guide authoring - ardon toonstra
Devdays 2017  implementation guide authoring - ardon toonstraDevdays 2017  implementation guide authoring - ardon toonstra
Devdays 2017 implementation guide authoring - ardon toonstra
 
Fhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_servicesFhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_services
 
Furore devdays2017 tdd-2-advanced
Furore devdays2017 tdd-2-advancedFurore devdays2017 tdd-2-advanced
Furore devdays2017 tdd-2-advanced
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)
 
Software Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery WorkshopSoftware Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery Workshop
 
Fhir dev days 2017 fhir profiling - overview and introduction v07
Fhir dev days 2017   fhir profiling - overview and introduction v07Fhir dev days 2017   fhir profiling - overview and introduction v07
Fhir dev days 2017 fhir profiling - overview and introduction v07
 
Project Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role PilotProject Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role Pilot
 
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
 
Security overview (grahame)
Security overview (grahame)Security overview (grahame)
Security overview (grahame)
 
The Future Publication of Software
The Future Publication of SoftwareThe Future Publication of Software
The Future Publication of Software
 
Fhir foundation (grahame)
Fhir foundation (grahame)Fhir foundation (grahame)
Fhir foundation (grahame)
 
Hackdays and workshops 2019
Hackdays and workshops 2019Hackdays and workshops 2019
Hackdays and workshops 2019
 
Whats new (grahame)
Whats new (grahame)Whats new (grahame)
Whats new (grahame)
 
Use of ISOcat within CMDI
Use of ISOcat within CMDIUse of ISOcat within CMDI
Use of ISOcat within CMDI
 
Furore devdays 2017- continua implementing fhir
Furore devdays 2017- continua implementing fhirFurore devdays 2017- continua implementing fhir
Furore devdays 2017- continua implementing fhir
 
fhir-documents
fhir-documentsfhir-documents
fhir-documents
 
Furore devdays 2017- fhir and devices - cooper thc2
Furore devdays 2017- fhir and devices - cooper thc2Furore devdays 2017- fhir and devices - cooper thc2
Furore devdays 2017- fhir and devices - cooper thc2
 

Viewers also liked

Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
JISC KeepIt project
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
Future Perfect 2012
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
SCAPE Project
 
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
stepheneisenhauer
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories prwheatley
 
[Dpf manager] berlin workshop
[Dpf manager] berlin workshop[Dpf manager] berlin workshop
[Dpf manager] berlin workshop
TECNIO Centre EASY & Smart Cities Master
 
Preservation content in_files
Preservation content in_filesPreservation content in_files
Preservation content in_files
Richard Wright
 

Viewers also liked (7)

Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories
 
[Dpf manager] berlin workshop
[Dpf manager] berlin workshop[Dpf manager] berlin workshop
[Dpf manager] berlin workshop
 
Preservation content in_files
Preservation content in_filesPreservation content in_files
Preservation content in_files
 

Similar to Unified characterisation, please

Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingDan Kaminsky
 
Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)softwaresatish
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journey
Adrian Olszewski
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journey
Adrian Olszewski
 
The  Selection Between An Open Source And Vended Software in Libraries: Oppor...
The  Selection Between An Open Source And Vended Software in Libraries:Oppor...The  Selection Between An Open Source And Vended Software in Libraries:Oppor...
The  Selection Between An Open Source And Vended Software in Libraries: Oppor...
Hong (Jenny) Jing
 
Evaluation of format identification tools
Evaluation of format identification toolsEvaluation of format identification tools
Evaluation of format identification tools
SCAPE Project
 
Malicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropyMalicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropy
ijsptm
 
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File SharingESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
Hitachi Vantara
 
Information Management 2marks with answer
Information Management 2marks with answerInformation Management 2marks with answer
Information Management 2marks with answer
suchi2480
 
Considerations for using personal information management (pim) software for d...
Considerations for using personal information management (pim) software for d...Considerations for using personal information management (pim) software for d...
Considerations for using personal information management (pim) software for d...
Merlien Institute
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practiceswebuploader
 
Everyone's A Mechanic
Everyone's A MechanicEveryone's A Mechanic
Everyone's A Mechanic
Brad Houston
 
CucumberSeleniumWD
CucumberSeleniumWDCucumberSeleniumWD
CucumberSeleniumWDVikas Sarin
 
What Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesWhat Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniques
LloydMoore
 
Supporting PDF accessibility evaluation: Early results from the FixRep project
 Supporting PDF accessibility evaluation: Early results from the FixRep project Supporting PDF accessibility evaluation: Early results from the FixRep project
Supporting PDF accessibility evaluation: Early results from the FixRep project
UKOLN (dev), University of Bath
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo
 
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
jbarclay
 
Writting Better Software
Writting Better SoftwareWritting Better Software
Writting Better Softwaresvilen.ivanov
 
Tooling on distributed services
Tooling on distributed servicesTooling on distributed services
Tooling on distributed services
Hiraq Citra M
 

Similar to Unified characterisation, please (20)

Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
 
Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journey
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journey
 
The  Selection Between An Open Source And Vended Software in Libraries: Oppor...
The  Selection Between An Open Source And Vended Software in Libraries:Oppor...The  Selection Between An Open Source And Vended Software in Libraries:Oppor...
The  Selection Between An Open Source And Vended Software in Libraries: Oppor...
 
Evaluation of format identification tools
Evaluation of format identification toolsEvaluation of format identification tools
Evaluation of format identification tools
 
Malicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropyMalicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropy
 
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File SharingESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
 
Information Management 2marks with answer
Information Management 2marks with answerInformation Management 2marks with answer
Information Management 2marks with answer
 
Considerations for using personal information management (pim) software for d...
Considerations for using personal information management (pim) software for d...Considerations for using personal information management (pim) software for d...
Considerations for using personal information management (pim) software for d...
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 
Everyone's A Mechanic
Everyone's A MechanicEveryone's A Mechanic
Everyone's A Mechanic
 
QQML presentation
QQML presentationQQML presentation
QQML presentation
 
CucumberSeleniumWD
CucumberSeleniumWDCucumberSeleniumWD
CucumberSeleniumWD
 
What Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesWhat Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniques
 
Supporting PDF accessibility evaluation: Early results from the FixRep project
 Supporting PDF accessibility evaluation: Early results from the FixRep project Supporting PDF accessibility evaluation: Early results from the FixRep project
Supporting PDF accessibility evaluation: Early results from the FixRep project
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
 
Writting Better Software
Writting Better SoftwareWritting Better Software
Writting Better Software
 
Tooling on distributed services
Tooling on distributed servicesTooling on distributed services
Tooling on distributed services
 

More from Andy Jackson

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' Issue
Andy Jackson
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?
Andy Jackson
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesAndy Jackson
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27
Andy Jackson
 
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryFormats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryAndy Jackson
 

More from Andy Jackson (7)

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' Issue
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archives
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27
 
IIPC GA 2014 Solr
IIPC GA 2014 SolrIIPC GA 2014 Solr
IIPC GA 2014 Solr
 
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryFormats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

Unified characterisation, please

  • 1. •Andrew Jackson •Web Archiving Technical Lead •British Library Unified Characterisation, Please
  • 2. The Practitioners' Have Spoken…  Quality Assurance (of broken or potentially broken data):  Quality assurance, Bit rot, and Integrity  Appraisal and Assessment:  Appraisal and assessment, Conformance, Unknown characteristics, and Unknown file formats.  Identify/Locate Preservation Worthy Data  Identify Preservation Risks:  Obsolescence, preservation risk and business constraint  Long tail of many other issues:  Contextual and Data capture issues through to Embedded objects, and broader issues around Value and cost.  Plus: Sustainable Tools 2
  • 3. Appraisal and Assessment Conformance, Unknown characteristics, and Unknown file formats. Identify/Locate Preservation Worthy Data  Identification  Always used to „route‟ data to software that can understand it.  Use minimum information to identify:  e.g. header only if possible. “Truncated PDF”, not “UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing .dbf should be reported as such.  Validation  Two modes needed: “Fast fail”, “Log and continue” /Quirks  Stop baseless distinction between “Well formed” and “Valid”  Validation is irrelevant to digital preservation assessment:  e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.  We‟re on the wrong side of Postel‟s Law.  Unknown completeness and failure to future-proof:  e.g. JHOVE tries to validate versions of PDF it cannot know.  e.g. Tools sometimes interpret/migrate data opaquely. 3
  • 6. Identify Preservation Risks Obsolescence, preservation risk and business constraint  Significant Properties are irrelevant here.  It‟s not really about the content, but about the context.  Dependency Analysis:  What software does this need?  Does this file use format features that are not well supported across implementations?  What other resources are transcluded?  Fonts? c.f. OfficeDDT.  Remote embeds?  Embedded scripts that might mask dependencies?  Do some operations require a password?  e.g. JHOVE cannot spot „harmless‟ PDF encryption. 6
  • 7. Sustainable Tools Our Tools  Pure-Java Characterisation:  JHOVE („clean room‟ implementation)  New Zealand Metadata Extractor (NZME)  Apache Tika  Java-based aggregation of various CLI tools:  JHOVE2  FITS  Other Characterisation:  XCL – C++/XML „clean room‟ extended with ImageMagick  Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...  Identification:  DROID, FIDO, Apache Tika, File  Visualisation:  C3PO, and many non-specialised tools. 7
  • 8. Sustainable Tools Up to date? Working together?  Software Dependency Management:  FITS/JHOVE2 embed old DROID versions, hard to upgrade.  Dead dependencies: FITS and FFIdent, NZME and Jflac.  Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?  Embed shared modules instead?  Software Project Management and Communication:  JHOVE, JHOVE2? FITS?  JHOVE2 only compiles on Sheila‟s branch?  Roadmaps, issue management, testing, C.I., etc.  Cross-project coordination and bug-fixing?  Complexity: JHOVE2, XCL, extremely complex  JHOVE2 Berkley DB causes checksum failures in tests  Tika solves same problem using SAX 8
  • 9. Sustainable Tools Shared tests?  Separate projects arise from separate workflows  Start by understand commonality and find gaps?  Share test cases and compare results?  The OPF Format Corpus contains various valid and invalid files.  Built by practitioners' to test real use cases.  e.g. JP2 features, PDF Cabinet of Horrors.  Do the tools give consistent and complementary results?  Let‟s find out!  c.f. Dave Tarrant‟s REF for Identification:  http://data.openplanetsfoundation.org/ref/  http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/ 9
  • 10. Bit-mashing as Tool QA  Bitwise exploration of data sensitivity.  One way to compare tools.  Helps understand formats.  c.f. Jay Gattuso‟s recent OPF blog. 10
  • 11. Quality Assurance (of broken or potentially broken data) Quality assurance, Bit rot, and Integrity  JHOVE let failed TIFF-JP2 through…  Jpylyzer does better.  Both fall far short of actual rendering. 11
  • 12. Where's the unification? Where should we work together?  Shared test corpora and test framework:  Start with the OPF Format Corpus?  Pull other corpora in by reference:  http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A  Sustainable version of Dave Tarrant‟s REF?  Extend with bit-mashing to compare tools?  Aim to coordinate more:  Make it clear where to go? (More about OfficeDDT).  Consider merging projects?  Consider sharing underlying libraries?  Consider building Tika modules?  Please consider Apache Preflight as base for PDF validation. 12