125 Databases for the
Year 2080
A technology challenge and how it can be met
Dr. Kai Naumann – Landesarchiv Baden-Württemberg (Germany)
WADL Workshop on IJDC 2020, Wuhan (China)
Landesarchiv Baden-Württemberg at a glance
• knowledge centre about the past of
the state of Baden-Württenberg
• key research infrastructure
• saves records of all kinds as cultural
heritage, preserves them and makes
them accessible
• provides transparency of
governmental, administrative, and
judicial decision-making
• archives government websites and
other sites with relevance to Baden-
Württemberg since 2006 --> about
300 URLs twice a year
• 9 sites throughout the country
• 11 million EUR overall budget
• 308 employees
• 1207 years: oldest dated charter
• 10.138 consultations per year
• 152.284 meters of occupied shelves
• 2.095.106 photographs
• 13.226.262 pages of scanned
documents
• 290.783.182 datasets rows
• ∞ eternal survival as a task
Our Oldest Database – the 1961 census
• Conceived at Statistical Offices of Germany in 1960
• Populated in 1961 on rented IBM machines
• 6 million individual punched cards destroyed in 1968
by a flooding
• Surviving part: calculated sums on ca. 1,592,821
punched cards
• Migrated to magnetic tape in the 1960s
• Migrated to CD-ROM in the 1990s
• Transferred to the State Archives in 2006
• Can we do better?!
LABW StAL E 258 II Bü 214
http://www.landesarchiv-bw.de/plink/?f=2-335336
Why we set up the challenge
• Emulation as a service - enormous progress since 2010
• SIARD - method of long-term database normalization – efforts to
establish SIARD as an European Union Standard
The challenge
• How do you preserve 125 databases of diverse origin for future use
from the year 2080 onwards?
• Prepare them in such a way that they can be used in as many ways as
possible in 2080.
• In the following 60 years
• a) no costs should be incurred apart from secure storage
• b) the database contents must not be publicly accessible.
How to preserve?
Pictures taken by the author
Political and legislative issues
Global Intellectual Property (IP) legislation is poorely prepared for
obsolesence.
Orphaned books (author and editor unknown) may freely be copied and
disseminated in most parts of the world.
The status of orphaned software is unclear, risks looming from unclear IP
claims.
In most countries of the world, no agency is responsible for preserving
software.
The European DSM directive has recently moved into a good direction, but
work has to continue in order to assure a risk-free environment for the
software emulation approaches.
CSV solution
• Choose the most important tables or prepare archival tables.
• Export them to CSV.
• Make an XML description of the fields and relations.
• Take screenshots of the graphical user interface (GUI).
• Add handbooks and tutorials for the database.
• Wait.
XML Solution
• Choose the most important tables or prepare archival tables.
• Export them to an XML Schema containing the most important
features of the DBMS (e.g. SIARD Schema).
• Take screenshots of the graphical user interface (GUI).
• Add handbooks and tutorials for the database.
• Wait.
Disk image solution
• Take a disk image of the client hardware.
• Take a disk image of the server hardware.
• Preserve necessary Operating System environments.
• Add handbooks or tutorials for the database.
• Regularly check performance of emulative software stack.
Docker image solution
• Take a Docker image of the client software.
• Take a Docker image of the server software.
• Preserve necessary Operating System environments.
• Add handbooks or tutorials for the database.
• Regularly check performance of emulative software stack.
Web Crawler solution
• This only works for databases with a full web-based frontend
displaying a complete list of their objects.
• Let a crawler translate all database content into an HTML/JavaScript
Container (e.g. WARC file).
• Regularly visit the crawl to test accessibility.
• In order to make quality assessments:
• Let Archive.org crawl the server as well
• Also use the CSV solution on the data
Solutions and their cost forecast
CSV Solution
XML Solution
Disk Image Solution
Docker Image Solution
Web Crawler Solution
0
50
100
150
200
250
01.01.2020
01.01.2022
01.01.2024
01.01.2026
01.01.2028
01.01.2030
01.01.2032
01.01.2034
01.01.2036
01.01.2038
01.01.2040
01.01.2042
01.01.2044
01.01.2046
01.01.2048
01.01.2050
01.01.2052
01.01.2054
01.01.2056
01.01.2058
01.01.2060
01.01.2062
01.01.2064
01.01.2066
01.01.2068
01.01.2070
01.01.2072
01.01.2074
01.01.2076
01.01.2078
01.01.2080
CSV Solution XML Solution Disk Image Solution Docker Image Solution Web Crawler Solution
Any questions? Want to join the quest?
• Further ideas, business models welcome!
• I will try to continue collecting answers at #WeMissiPRES
• Feel invited to a workshop on the issue at Stuttgart (Germany) in
2021!
• Contact me:
• Dr. Kai Naumann, Landesarchiv Baden-Württemberg
• kai <dot> naumann <at> la-bw <dot> de
• Twitter @Naumann_Kai
• Phone 0049 711 212 4284

125 Databases for the Year 2080

  • 1.
    125 Databases forthe Year 2080 A technology challenge and how it can be met Dr. Kai Naumann – Landesarchiv Baden-Württemberg (Germany) WADL Workshop on IJDC 2020, Wuhan (China)
  • 2.
    Landesarchiv Baden-Württemberg ata glance • knowledge centre about the past of the state of Baden-Württenberg • key research infrastructure • saves records of all kinds as cultural heritage, preserves them and makes them accessible • provides transparency of governmental, administrative, and judicial decision-making • archives government websites and other sites with relevance to Baden- Württemberg since 2006 --> about 300 URLs twice a year • 9 sites throughout the country • 11 million EUR overall budget • 308 employees • 1207 years: oldest dated charter • 10.138 consultations per year • 152.284 meters of occupied shelves • 2.095.106 photographs • 13.226.262 pages of scanned documents • 290.783.182 datasets rows • ∞ eternal survival as a task
  • 3.
    Our Oldest Database– the 1961 census • Conceived at Statistical Offices of Germany in 1960 • Populated in 1961 on rented IBM machines • 6 million individual punched cards destroyed in 1968 by a flooding • Surviving part: calculated sums on ca. 1,592,821 punched cards • Migrated to magnetic tape in the 1960s • Migrated to CD-ROM in the 1990s • Transferred to the State Archives in 2006 • Can we do better?! LABW StAL E 258 II Bü 214 http://www.landesarchiv-bw.de/plink/?f=2-335336
  • 4.
    Why we setup the challenge • Emulation as a service - enormous progress since 2010 • SIARD - method of long-term database normalization – efforts to establish SIARD as an European Union Standard
  • 5.
    The challenge • Howdo you preserve 125 databases of diverse origin for future use from the year 2080 onwards? • Prepare them in such a way that they can be used in as many ways as possible in 2080. • In the following 60 years • a) no costs should be incurred apart from secure storage • b) the database contents must not be publicly accessible.
  • 6.
    How to preserve? Picturestaken by the author
  • 7.
    Political and legislativeissues Global Intellectual Property (IP) legislation is poorely prepared for obsolesence. Orphaned books (author and editor unknown) may freely be copied and disseminated in most parts of the world. The status of orphaned software is unclear, risks looming from unclear IP claims. In most countries of the world, no agency is responsible for preserving software. The European DSM directive has recently moved into a good direction, but work has to continue in order to assure a risk-free environment for the software emulation approaches.
  • 8.
    CSV solution • Choosethe most important tables or prepare archival tables. • Export them to CSV. • Make an XML description of the fields and relations. • Take screenshots of the graphical user interface (GUI). • Add handbooks and tutorials for the database. • Wait.
  • 9.
    XML Solution • Choosethe most important tables or prepare archival tables. • Export them to an XML Schema containing the most important features of the DBMS (e.g. SIARD Schema). • Take screenshots of the graphical user interface (GUI). • Add handbooks and tutorials for the database. • Wait.
  • 10.
    Disk image solution •Take a disk image of the client hardware. • Take a disk image of the server hardware. • Preserve necessary Operating System environments. • Add handbooks or tutorials for the database. • Regularly check performance of emulative software stack.
  • 11.
    Docker image solution •Take a Docker image of the client software. • Take a Docker image of the server software. • Preserve necessary Operating System environments. • Add handbooks or tutorials for the database. • Regularly check performance of emulative software stack.
  • 12.
    Web Crawler solution •This only works for databases with a full web-based frontend displaying a complete list of their objects. • Let a crawler translate all database content into an HTML/JavaScript Container (e.g. WARC file). • Regularly visit the crawl to test accessibility. • In order to make quality assessments: • Let Archive.org crawl the server as well • Also use the CSV solution on the data
  • 13.
    Solutions and theircost forecast CSV Solution XML Solution Disk Image Solution Docker Image Solution Web Crawler Solution 0 50 100 150 200 250 01.01.2020 01.01.2022 01.01.2024 01.01.2026 01.01.2028 01.01.2030 01.01.2032 01.01.2034 01.01.2036 01.01.2038 01.01.2040 01.01.2042 01.01.2044 01.01.2046 01.01.2048 01.01.2050 01.01.2052 01.01.2054 01.01.2056 01.01.2058 01.01.2060 01.01.2062 01.01.2064 01.01.2066 01.01.2068 01.01.2070 01.01.2072 01.01.2074 01.01.2076 01.01.2078 01.01.2080 CSV Solution XML Solution Disk Image Solution Docker Image Solution Web Crawler Solution
  • 14.
    Any questions? Wantto join the quest? • Further ideas, business models welcome! • I will try to continue collecting answers at #WeMissiPRES • Feel invited to a workshop on the issue at Stuttgart (Germany) in 2021! • Contact me: • Dr. Kai Naumann, Landesarchiv Baden-Württemberg • kai <dot> naumann <at> la-bw <dot> de • Twitter @Naumann_Kai • Phone 0049 711 212 4284