DROID is a tool used by the National Digital Heritage Archive (NDHA) to automatically identify file formats during ingest into the Rosetta digital preservation system. The NDHA conducted research comparing format identification results across different versions of DROID and found that 75% of file types were identified consistently, while 26% showed multiple possible identifications. The NDHA recommends creating a test dataset spanning all file formats, more research on format identification persistence, and robust testing of new DROID signatures to limit impacts on users.
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Vince Smith
This is a derivative of a talk I gave at the Linnean society on 20th Sept. 2012. This version was given at the i4Life Environmental Genomics workshop on 25th Sept. and refocused to look at the dark taxa problem and developing published descriptions of molecular sequence clusters.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Vince Smith
This is a derivative of a talk I gave at the Linnean society on 20th Sept. 2012. This version was given at the i4Life Environmental Genomics workshop on 25th Sept. and refocused to look at the dark taxa problem and developing published descriptions of molecular sequence clusters.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
This presentation introduces preservation workflow, a process to manage the risk associated with file formats of different digital objects. It was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/
Keepit Course 3: Provenance (and OPM), based on slides by Luc MoreauJISC KeepIt project
This presentation offers a brief introduction to provenance, a record of the process that led to the current state of an object, based on a new descriptive model designed to allow provenance information to be exchanged between systems, the Open Provenance Model (OPM). It was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/
Digital forensics is the scientific examination and analysis of data held on or retrieved from, computer storage media in such a way that the information can be used as evidence in a court of law.
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE
The 2019 International Open Access Week will be held October 21-27, 2019. This year’s theme, “Open for Whom? Equity in Open Knowledge,” builds on the groundwork laid during last year’s focus of “Designing Equitable Foundations for Open Knowledge.”
As has become a tradition of sorts, OpenAIRE organises a series of webinars during this week, highlighting OpenAIRE activities, services and tools, and reach out to the wider community with relevant talks on many aspects of Open Science.
This presentation introduces preservation workflow, a process to manage the risk associated with file formats of different digital objects. It was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/
Keepit Course 3: Provenance (and OPM), based on slides by Luc MoreauJISC KeepIt project
This presentation offers a brief introduction to provenance, a record of the process that led to the current state of an object, based on a new descriptive model designed to allow provenance information to be exchanged between systems, the Open Provenance Model (OPM). It was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/
Digital forensics is the scientific examination and analysis of data held on or retrieved from, computer storage media in such a way that the information can be used as evidence in a court of law.
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE
The 2019 International Open Access Week will be held October 21-27, 2019. This year’s theme, “Open for Whom? Equity in Open Knowledge,” builds on the groundwork laid during last year’s focus of “Designing Equitable Foundations for Open Knowledge.”
As has become a tradition of sorts, OpenAIRE organises a series of webinars during this week, highlighting OpenAIRE activities, services and tools, and reach out to the wider community with relevant talks on many aspects of Open Science.
Similar to Jay Gattuso Persistently Identifying Formats (20)
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 6
Jay Gattuso Persistently Identifying Formats
1. ‘Persistently’ Identifying Formats
PRONOM, DROID and the NDHA
Jay Gattuso
Digital Preservation Analyst
National Digital Heritage Archive
National Library of New Zealand
2. Summary
How Rosetta uses DROID
How DROID has changed
Research NDHA completed
Results
Recommendations
3. DROID & PRONOM
• PRONOM is the most
widely used file format
registry in the sector
• DROID is a tool that
‘identifies’ file types (based
on PRONOM records)
• Both are from TNA (UK)
• DROID Signature v59
EP/1958/2520-F
– 551 signature sets Registry, Hunter Building, Victoria University of Wellington
Photograph taken for the Evening Post newspaper, 31 Jul 1958
– 864 file type records Alexander Turnbull Library
www.nationalarchives.gov.uk/PRONOM/Default.aspx
4. Rosetta – A Brief History
• NLNZ Digital Preservation
Repository
• 4 years since inception
• 18 months out of project
• 8 significant
upgrades/software
revisions
• ~6 Million digital objects to 1/1-000008-G
Smiley's stables and horse repository, Whanganui
date Harding, William James, 1826-1899 :Negatives of Wanganui district .
Alexander Turnbull Library
• Backbone of the ANZ GDAP
5. Write Once, Read Many
Inside Rosetta, format
identification is a ‘WORM’ process.
As a part of the ingest
routine, format identification is
automatically undertaken, written
to the file records, and the system
database, and used thereafter as
a consistent ‘label’.
We rely on the persistence of the
label to accurately plan activities and
E-272-f-001
‘measure’ the content or shape of the Abbot, John 1751-1840 :
Original drawings of insects by J Abott. [1816?]
repository. Alexander Turnbull Library
.
6. Behaviours and functions based on
DROID format assertions
Rosetta uses DROID to
automatically establish
format type.
8. Shape Sorting...
Where:
• The area inside the box
is Rosetta
• Each block is a DO
• Each shape is a format
• The ‘Sorter’ is DROID
9. Shape Sorting...
Process:
• A record is kept of the
‘shape’ the DO entered the
box via
• The record is used by the
system to trigger activities
• The DO can be removed from
the box using the same
shaped hole it used on entry
10. Shape Sorting...
Expectations:
• The ‘Sorter’ never changes
• The blocks never change
• A DO placed in the box
yesterday will be the same
shape tomorrow
• A DO placed in the box
yesterday will be extractable
via the shape tomorrow
11. Shape Sorting...
The reality for NDHA:
• DROID has undergone 2
major revisions
• Container signatures have
been included
• Since Rosetta v1 release:
– 406 new formats,
– 600 changes to signatures
– (This is generally a good thing!)
12. Identifying and Quantifying Change
• Rosetta has used DROID versions
3 and 5, currently testing with 6
• Rosetta has used DROID
signature versions v13, v37, v45
and v49, testing with v52
• Proposal to use a new DROID
method in Rosetta
• How has/will this affect the way
we characterise Digital Objects at EP/1958/0585-F
the NDHA? Signature of Queen Elizabeth II in a visitors book
Negatives of the Evening Post newspaper. Feb 1958
Alexander Turnbull Library
13. Identifying and Quantifying Change
• Source set:
– 26,000 digital objects,
– ~600 Gb of content,
– spanning 61 format types
– all from the live system
• DROID v3, DROID v5, DROID v6
and DROID v6 ‘FAST’ tested
• Signatures v13, v37, v45, v49
and v50 tested EP/1990/0432/29-F
New school patrol system being tested , Wellington
• All files tested with and Photograph taken by John Nicholson
ca 2 Feb 1990
without file extensions Alexander Turnbull Library
14. Identifying and Quantifying Change
• 1 million DROID ‘assertions’ captured
• Python and MySQL used to
sort, clean, filter, draw graphics and
otherwise interpret results
• Paper competed and will be available
on the OPF website
www.openplanetsfoundation.org DCDL-0004533
Eric Idle. 5 December, 2007.
Webb, Murray, 1947- : Digital caricatures published from
29 July 2005 onwards
Alexander Turnbull Library
15. Summary of Results
Of the 61 tested file types :
75% performed identically
for all tested versions of
DROID and signature
versions
fmt/49
(RTF 1.4)
16. Summary of Results
Of the 61 tested file types :
40% consistently offered
a single PUID across the
range of DROID tests
By extension: gif, avi, png,
jpg, html, xml, bmp, wp, and
some subsets of doc, ppt and
exe
fmt/12
(PNG 1.1)
17. Summary of Results
Of the 61 tested file types :
In 26% of the file types
multiple PUIDs are
equally asserted by
DROID at various times.
By extension:
docx,xlsx,pptx, some
pdf, doc, xls, ppt, txt, log, aif
f, and arc fmt/7
(TIF format)
18. Summary of Results
Of the 61 tested file types :
In 16% of the file types
DROID version 6 in ‘FAST’
mode performs differently
DROID version 6 in
standard mode
By extension:
epubs, mp4, flac, wav, zip and
some subsets of pdf, xls, tif fmt/6
and exe (Waveform Audio)
19. Recommendation 1
There is a clear need
for a community
owned dataset that
spans the PRONOM
catalogue to support
testing
(This should be
community created) ExL-fmt/62 - fmt/189
(MS Open Office XML 2007)
20. Recommendation 2
It is strongly
recommended that
more research is
undertaken looking at
the persistence of
PUID’s to give a more
complete history of
file type assertions by
PRONOM/DROID
fmt/14
(PDF 1.0)
21. Recommendation 3
Given the variances
observed, especially with
DROID v6 ‘FAST’ mode, it
is recommended that all
signatures are robustly
tested prior to
release, and efforts are
made to maintain
consistency with legacy
signatures, and limit x-fmt/263
(ZIP format)
impact on users
22. Recap
How Rosetta uses DROID
How DROID has changed
Research NDHA completed
Results
Recommendations
23. Thank you
jay.gattuso@dia.govt.nz
Rosetta demo – Wednesday 28th March
9am to 1pm @ NLNZ - 77 Thorndon Quay
Paper available through the Open Planets Website
www.openplanetsfoundation.org