European Archival Records and Knowledge Preservation
#earkproject www.eark-project.eu @EARKProject
An OAIS-oriented System for Fast Package
Creation, Search, and Access
Sven Schlarb, Rainer Schmidt, Roman Karl, Mihai Bartha, Jan
Rörden, Janet Delve, Kuldar Aas
Presenter: Sven Schlarb <sven.schlarb@ait.ac.at>
AIT Austrian Institute of Technology
IPRES 2016
Bern, October 3, 2016
THE
E-ARK PROJECT
IS
CO-FUNDED
BY THE
EUROPEAN
COMMISSION
UNDER THE
ICT-PSP
PROGRAMME
www.eark-project.eu
● E-ARK has defined a basic structure and recommended metadata standards for
information packages.
● E-ARK has created a reference implementation covering the functional entities for
ingest, archiving, and access according to the OAIS reference model.
● The SME partners KEEP Solutions and ESS have adapted their archiving solutions.
– RODA repository (KEEP)
– ESS Preservation Platform (ESS)
● AIT has developed an environment for processing information packages
(SIP, AIP, DIP).
– Providing a graphical front-end called earkweb.
● AIT has developed a scalable backend repository for storing, discovering, and
accessing data contained in information packages.
– Initially based on the Lily repository project, now Cloudera Search.
Main outcomes
• Modular package
transformation workflows
& metadata creation
• Parallelize full-text
indexing
•Fast random access
to individual files
•Aggregating data
using facet queries
•Data mining (Classification,
NER)
Faceted Search & Data Mining
Access
Full-text indexing & search
Package transformation and Ingest
Reference Implementation Functionality
• Pre-Ingest (Producer)
– Tasks: SIP Creation, Validation, Submission
– E-ARK Tools: Database Preservation Tooklit, RODA/EPP Tools, earkweb
• Ingest
– Tasks: SIP Validation, Archival Processing, AIP Creation
– E-ARK Tools: earkweb, RODA, EPP
• Archival Storage
– Tasks: Storage to Archival Repository
– E-ARK Tools: Lily Repository (Cloudera Search), RODA, EPP
• Data Management
– Tasks: Discover, Select, and Manipulate Records
– E-ARK Tools: Lily Repository, RODA, EPP
• Access
– Tasks: DIP Creation and activation (e.g. within an RDBMS)
– E-ARK Tools: earkweb, RODA
E-ARK Archival Workflow
SIP
E-ARK Information Package (simplified)
representations
metadata
[schemas/documentation]
Structural metadata
Provenance metadata
Technical metadata
Descriptive metadata
SIP
DIP
DIP
Lifecycle
Metadata edits
Migrations
Add emulation info
• earkweb is based on Phython and the Celery task
execution system.
– Create archival workflows from predefined tasks which
can be executed in parallel on a computer cluster.
– Examples are data validation, format migration, content
extraction, database transformation, packaging,
interfacing with storage systems.
– earkweb provides a graphical interface and can be
used interactively as well as in batch mode.
earkweb
• The E-ARK Lily repository provides a scalable
backend for storing, discovering, and accessing AIPs
based on technologies like SolR, MapReduce, and
HBase.
– The repository is entirely distributed allowing us to
handle huge amounts of data
– It provides full-text search, browsing, random access to
data contained in IPs.
– It provides APIs allowing one to carry out computations
(like data mining tasks) across the archived content.
E-ARK Lily/Hadoop Repository
6/30/16
Worker Worker Worker Worker
Staging/Storage Area
NAS
<<package transfer>>
decoupled
<<notification>>
<<search and retrieval>>
Information
package
status
Task
results
Cluster Deployment Stack
Standalone Deployment Stack
6/30/16
Worker Worker Worker Worker
Staging/Storage Area
NAS <<indexing>>
<<search and retrieval>>
Information
package
status
Task
results
Search & Access
• Search within and across information packages
– Full text index for office documents, PDF, MS Word, etc.
– Search based on defined fields, e.g. size, mime-type, package, etc.
– Results directly linked with the Lily content repository
• Faceted queries allowing to cluster search results into
different categories
• Spatio-temporal search in geographical datasets
• Filter search according to estimated text category
(machine learning/text classification)
Data Mining/NLP
• Purpose:
●
Show how to analyse digital resources contained in
the archive in an exemplary manner.
• Selected use cases:
●
Location names occurring in texts.
●
Named entity recognition and incorporation of geo-
information
●
Text classification
Location names occurring in texts
●
StanfordNER for NER
●
nominatim (database behind
openstreetmap.org) for georeferencing
●
peripleo for visualization
Location names occurring in texts
Peripleo - PELAGIOS Project
Geographical/timeline search
Peripleo - PELAGIOS Project
●
Provided: GML data and TIFF images of
maps with metadata (coordinate system,
time, etc.)
●
Convert GML data to Peripleo RDF
●
Translate coordinate system if necessary
●
Use peripleo to search for and visualize
regions and filter by time
Geographical/timeline search
Peripleo - PELAGIOS Project
Text classification using
scikit-learn
● Prepare data to train SVM classifier
● Dump full-texts of the repository into re-
usable packages
● Apply text classification and update SolR
records accordingly
Database archiving, rebuilding
and analysis
source: wikipedia
SIARD
RDBMS data
(up to 80TB)
e.g. Postgres e.g. Oracle
Submit ... Archive ... Reconstruct ... Analyse.
• National Archive of Hungary
– Full scale cluster deployment of earkweb and
Hadoop/lily back-end.
– Ingest, search, and access on large-volumes of AIPs.
• National Archive of Slovenia
– earkweb and Peripleo installation for ingesting,
visualising, and searching geo-data.
• Danish National Archives
– earkweb standalone installation
Current Pilots
Want to try it out?
• Single-machine deployment of the E-ARK
Reference Implementation available online:
http://earkdev.ait.ac.at/earkweb
• Oracle Virtualbox VM (Standalone
Deployment!) available for download:
http://earkdev.ait.ac.at/eark/pilots/eark-
pilot-vm.ova
• General information about E-ARK:
http://www.eark-project.eu

E-ARK-iPRES2016-Bern-October-2016

  • 1.
    European Archival Recordsand Knowledge Preservation #earkproject www.eark-project.eu @EARKProject An OAIS-oriented System for Fast Package Creation, Search, and Access Sven Schlarb, Rainer Schmidt, Roman Karl, Mihai Bartha, Jan Rörden, Janet Delve, Kuldar Aas Presenter: Sven Schlarb <sven.schlarb@ait.ac.at> AIT Austrian Institute of Technology IPRES 2016 Bern, October 3, 2016
  • 2.
    THE E-ARK PROJECT IS CO-FUNDED BY THE EUROPEAN COMMISSION UNDERTHE ICT-PSP PROGRAMME www.eark-project.eu
  • 3.
    ● E-ARK hasdefined a basic structure and recommended metadata standards for information packages. ● E-ARK has created a reference implementation covering the functional entities for ingest, archiving, and access according to the OAIS reference model. ● The SME partners KEEP Solutions and ESS have adapted their archiving solutions. – RODA repository (KEEP) – ESS Preservation Platform (ESS) ● AIT has developed an environment for processing information packages (SIP, AIP, DIP). – Providing a graphical front-end called earkweb. ● AIT has developed a scalable backend repository for storing, discovering, and accessing data contained in information packages. – Initially based on the Lily repository project, now Cloudera Search. Main outcomes
  • 4.
    • Modular package transformationworkflows & metadata creation • Parallelize full-text indexing •Fast random access to individual files •Aggregating data using facet queries •Data mining (Classification, NER) Faceted Search & Data Mining Access Full-text indexing & search Package transformation and Ingest Reference Implementation Functionality
  • 5.
    • Pre-Ingest (Producer) –Tasks: SIP Creation, Validation, Submission – E-ARK Tools: Database Preservation Tooklit, RODA/EPP Tools, earkweb • Ingest – Tasks: SIP Validation, Archival Processing, AIP Creation – E-ARK Tools: earkweb, RODA, EPP • Archival Storage – Tasks: Storage to Archival Repository – E-ARK Tools: Lily Repository (Cloudera Search), RODA, EPP • Data Management – Tasks: Discover, Select, and Manipulate Records – E-ARK Tools: Lily Repository, RODA, EPP • Access – Tasks: DIP Creation and activation (e.g. within an RDBMS) – E-ARK Tools: earkweb, RODA E-ARK Archival Workflow
  • 6.
    SIP E-ARK Information Package(simplified) representations metadata [schemas/documentation] Structural metadata Provenance metadata Technical metadata Descriptive metadata SIP DIP DIP Lifecycle Metadata edits Migrations Add emulation info
  • 7.
    • earkweb isbased on Phython and the Celery task execution system. – Create archival workflows from predefined tasks which can be executed in parallel on a computer cluster. – Examples are data validation, format migration, content extraction, database transformation, packaging, interfacing with storage systems. – earkweb provides a graphical interface and can be used interactively as well as in batch mode. earkweb
  • 8.
    • The E-ARKLily repository provides a scalable backend for storing, discovering, and accessing AIPs based on technologies like SolR, MapReduce, and HBase. – The repository is entirely distributed allowing us to handle huge amounts of data – It provides full-text search, browsing, random access to data contained in IPs. – It provides APIs allowing one to carry out computations (like data mining tasks) across the archived content. E-ARK Lily/Hadoop Repository
  • 9.
    6/30/16 Worker Worker WorkerWorker Staging/Storage Area NAS <<package transfer>> decoupled <<notification>> <<search and retrieval>> Information package status Task results Cluster Deployment Stack
  • 10.
    Standalone Deployment Stack 6/30/16 WorkerWorker Worker Worker Staging/Storage Area NAS <<indexing>> <<search and retrieval>> Information package status Task results
  • 11.
    Search & Access •Search within and across information packages – Full text index for office documents, PDF, MS Word, etc. – Search based on defined fields, e.g. size, mime-type, package, etc. – Results directly linked with the Lily content repository • Faceted queries allowing to cluster search results into different categories • Spatio-temporal search in geographical datasets • Filter search according to estimated text category (machine learning/text classification)
  • 14.
    Data Mining/NLP • Purpose: ● Showhow to analyse digital resources contained in the archive in an exemplary manner. • Selected use cases: ● Location names occurring in texts. ● Named entity recognition and incorporation of geo- information ● Text classification
  • 15.
    Location names occurringin texts ● StanfordNER for NER ● nominatim (database behind openstreetmap.org) for georeferencing ● peripleo for visualization
  • 16.
    Location names occurringin texts Peripleo - PELAGIOS Project
  • 17.
    Geographical/timeline search Peripleo -PELAGIOS Project ● Provided: GML data and TIFF images of maps with metadata (coordinate system, time, etc.) ● Convert GML data to Peripleo RDF ● Translate coordinate system if necessary ● Use peripleo to search for and visualize regions and filter by time
  • 18.
  • 19.
    Text classification using scikit-learn ●Prepare data to train SVM classifier ● Dump full-texts of the repository into re- usable packages ● Apply text classification and update SolR records accordingly
  • 20.
    Database archiving, rebuilding andanalysis source: wikipedia SIARD RDBMS data (up to 80TB) e.g. Postgres e.g. Oracle Submit ... Archive ... Reconstruct ... Analyse.
  • 21.
    • National Archiveof Hungary – Full scale cluster deployment of earkweb and Hadoop/lily back-end. – Ingest, search, and access on large-volumes of AIPs. • National Archive of Slovenia – earkweb and Peripleo installation for ingesting, visualising, and searching geo-data. • Danish National Archives – earkweb standalone installation Current Pilots
  • 22.
    Want to tryit out? • Single-machine deployment of the E-ARK Reference Implementation available online: http://earkdev.ait.ac.at/earkweb • Oracle Virtualbox VM (Standalone Deployment!) available for download: http://earkdev.ait.ac.at/eark/pilots/eark- pilot-vm.ova • General information about E-ARK: http://www.eark-project.eu