Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data


Published on

This paper presents approaches for building new
knowledge using emerging methods and big data technologies
together with archival practices.
Two cases studies have been considered. The first one called
HERBADROP is concerned with preservation and analysis of
herbarium images. The second one called EUROPEANA investigates
how to facilitate the re-use of cultural heritage language
resources for research purposes. The common point between
these two case studies is that they are both concerned with the
use of valuable heritage resources within the EUDAT (European
Data) infrastructure. HERBADROP leverages on the data services
provided by EUDAT for long-term preservation, while EUROPEANA
leverages on EUDAT to achieve citability and persistent
identification of cultural heritage datasets.
EUDAT1 is an initiative of some of the main European data
centers and together with community research infrastructure
organisations, to build a common eInfrastructure for general
research data management.
In this paper, we show how technologcal trends may offer some
new research potential in the domain of computational archival
science in particular appraising the challenges of producing
quality, meaning, knowledge and value from quantity, tracing
data and analytic provenance across complex big data platforms
and knowledge production ecosystems.

Published in: Data & Analytics
  1. 1. www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Building new knowledge from distributed scientific corpus HERBADROP & EUROPEANA: two concrete case studies for exploring big archival data 2nd Computational Archival Science (CAS) workshop Boston, USA, December 2017 Pascal Dugénie, Daan Broeder, Nuno Freire
  2. 2. Massively distributed collections Digital Infrastructures for Research Opportunities for preserving valuable scientific heritage Collaborative Data Infrastructure (CDI) Trusted Digital Repositories (TDR) ISO 16363, ISO 14721 (OAIS) High-speed network infrastructures LONG-TERM PRESERVATION Monitoring Data Storage Persistent ID Metadata Data curation and policies Natural heritage Cultural heritage HPC infrastructures BIG DATA analysis tools sharing distributed corpora extraction of text in images knowledge building visibility of data
  3. 3. EUDAT: A truly pan-European Infrastructure EUDAT offers common data services to both research communities and individuals through a large network of European organisations. EUDAT wants to enable European researchers from any discipline to preserve, find, access, and process data in a trusted environment, as part of a Collaborative Data Infrastructure. European infrastructures Technology Providers Research Communities
  4. 4. B2 Service Suite Covering both access and deposit, from informal data sharing to long-term archiving, and addressing identification, discoverability and computability of both long-tail and big data, EUDAT services seek to address the full lifecycle of research data
  5. 5. Common Language Resources and Technology Infrastructure (CLARIN) Building solutions with the communities European Network for Earth System Modelling (ENES) Distributed infrastructure for life-science information (ELIXIR) European Plate Observing System (EPOS) - Solid Earth sciences Research Infrastructure Integrated Carbon Observation System (ICOS) to quantify & understand greenhouse gas balance Long-Term Ecosystem Research (LTER) in Europe EUDAT services are designed, built and implemented together with user communities.
  6. 6. Challenges and problem to be solved  Digitalized images  physical copies are fragile  digital copy must be preserved  Exploitation of digital copies  description metadata and classification is complex  images contain a lot of information that should be extracted and made available
  7. 7. Herbadrop rationale • Millions of specimens in herbaria all over the world • Global trend to industrial digitizing • Data difficult to handle even for medium size institutes • Same challenges being faced by hundreds of herbaria in Europe • Makes sense to work together to develop a solution tiff: 180MB zip: 80MB jpg: 1MB Total: 161MB
  8. 8. Herbadrop in Europe MEISE, BE n
  9. 9. Herbadrop objectives PRESERVATION1 INFORMATION EXTRACTION 2 KNOWLEDGE BUILDING 3 deep learning using OCR results with access with the whole community for crowdsourcing long-term preservation of herbarium specimen images curent scope extracting information from images by using Optical Character Recognition (OCR) basic image analysis techniques perspectives
  10. 10. HERBADROP/EUDAT Workflows STORAGE TRANSFER Transferring images using B2SAFE service OCR ACCES MONITORING images Performing OCR analysis using HPC Ingesting OCR results in a full text indexing engine Controling data quality (file format and integrity) OCR ARCHIVING Surveying bit-stream integrity and data quality Ingesting images and metadata for long-term archiving Producing regular statistical reports Producing regular statistical reports Monitoring data and processes status reports statistics Harvesting and indexing metadata Offering open access to full text engine, images and metadata CERTIFICATION Implementing a DSA-based certification including appropriate SLA
  11. 11. Europeana: European Cultural Heritage on the Web The main goal of Europeana is to provide access to cultural heritage and encourage people to engage with culture. • And the main access point is the Web! • Promoting the research use of heritage data resources is in its early stages of development CC BY-SAPerspectives on using for publishing and harvesting metadata at Europeana CC BY-SA
  12. 12. The Challenges (1/2) The Generic Challenge How to facilitate the re-use of Cultural Heritage language resources for research purposes … by exploiting the existing and emerging European research infrastructures How can the resources be discovered How can the resources be shared in practical ways for researchers How can advanced computation be applied to these Cultural Heritage datasets How can the resources and datasets be cited and referenced in research How can the Cultural Heritage institutions re-use the outcomes of research
  13. 13. The Challenges (2/2) The Specific Challenges of the Pilot To identify requirements for technical interoperability between the two infrastructures Creating best practice guidelines for the publication and citation of cultural heritage data Facilitate the collaborative work between researchers, with focus on: Humanities Social Sciences Computer science
  14. 14. Europeana Newspapers Corpus The pilot aims to expose the full text aggregated in the Europeana Newspapers project. This corpus contains over 11 million pages of full text of historic newspapers Mainly from the 19th century Aggregated from national and research libraries across Europe. The pilot aims to expose and improve the text for more data driven usage …based on EUDAT Data services…
  15. 15. EUDAT service uptake Europeana Newspaper Pilot relies on the following EUDAT services: Research data storage and sharing (B2SHARE): as to undertake the enrichment of the datasets as well as, more generally, expose them for re-use by other academics, particularly those outside the digital humanities Persistent Identification Service (B2HANDLE): Persistent identification of the main objects of the full-text corpus: the newspapers titles and individual issues Multi-disciplinary joint metadata catalogue (B2FIND): so that scientists will be able to obtain the full corpus for machine processing select just a portion of the corpus benefitting from the enrichment of article-level annotations with named entities and topics
  16. 16. www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Conclusions & Perspectives
  17. 17. Conclusions • General conclusions: • A successful application of the EUDAT services was achieved • Heritage research data brought new requirements to EUDAT • HERBADROP: • Application of EUDAT’s computational capabilities are identifying new challenges: • How to address poor quality OCR • Amount of data is large and may become a limitation for accurate and exhaustive analysis • EUROPEANA: • Learned about the requirements of research usage • Some may have impact on its data providers
  18. 18. HERBADROP and EUROPEANA: Some perspectives for data services  Improving discoverability of heritage research data resources  Full-text based  Metadata based  Additional heritage specific metadata support in EUDAT  Dat formats support, and semantics  Semantic annotations  Computational processing for heritage use cases:  OCR  Image analysis tools
  19. 19. For additional information Nuno Freire, Europeana DSI/INESC-ID