Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
EVA/Minerva 2016
Integration and Retrieval of
Heterogeneous Archival Metadata
CONNECTING
COLLECTIONS
Kepa J. Rodriguez – A...
Outline
●
Data integration in the first phase of the project
●
Our actual integration approach
●
Retrieval of data using c...
Data integration in the first phase of the project
●
Holding institutions delivered data in very different formats:
●
XML,...
Proposal for the second phase of the project
● Data conversion
● Data publication and synchronization
● Data ingestion
Data conversion
●
Converstion tool: different data formats into EAD:
●
XML, JSON, CSV...
●
Generic transformation
●
Useful...
EAD File sample (1)
<archdesc level="subgrp">
<did>
<unitid>M.49.E</unitid>
<unittitle encodinganalog="3.1.2">Testimonies ...
EAD File sample (2)
…...
<originalsloc encodinganalog="3.5.1">
<p>ZYDOWSKI INSTYTUT HISTORYCZNY - ZIH, WARSZAWA, POLAND</p...
Data publication and synchronization
●
We plan to use two data publication protocols:
●
OAI-PMH: one of the first protocol...
Data ingestion
●
After data is ingested into the portal, it will receive a
permanent URL:
●
Formal protocol is in progress...
Data retrieval
●
The user needs to be able to retrieve information related to
selected topics, places, people, organizatio...
EHRI controlled vocabularies
●
EHRI Thesaurus
●
Concepts: hierarchy of concepts formalized in SKOS
●
A first set translate...
Problems of the first approach of the project
●
A vocabulary built with knowledge about the Shoah can be
helpful to repres...
The reality of the data
●
Different institutions use different systems to assign
keywords (or no system)
●
Keywords can ha...
EHRI's data driven approach (1)
●
Extraction of access points of the EAD files during import
<controlaccess>
<geogname>Pol...
EHRI's data driven approach (2)
●
Person, corporate bodies:
●
Check whether we have corresponding authority files
●
If we ...
EHRI's data driven approach (3)
●
Concepts/terms: the most complicated case
●
Archives used very different strategies for ...
Ghethos and Concentration Camps
●
We evaluate to start a WikiData project for ghettos and
concentration camps
●
Strategy:
...
NIOD Institute for War, Holocaust and Genocide
Studies (NL)
 
CEGESOMA Centre for Historical Research and
Documentation
on...
CONNECTING
COLLECTIONS
Integration and Retrieval of
Heterogeneous Archival
Metadata
09/11/2016
Upcoming SlideShare
Loading in …5
×

F2 kepa rodriguez_ehri_integration_retrieva_minerva_2016

116 views

Published on

Dr. Kepa Rodriguez, Data and Content Specialist, Archives Division, Yad Vashem
Integration and Retrieval of Heterogeneous Archival Metadata
2016 EVA/Minerva Jerusalem International Conference on Digitisation of Cultural Heritage
http://2016.minervaisrael.org.il
http://www.digital-heritage.org.il

Published in: Education
  • Be the first to comment

  • Be the first to like this

F2 kepa rodriguez_ehri_integration_retrieva_minerva_2016

  1. 1. EVA/Minerva 2016 Integration and Retrieval of Heterogeneous Archival Metadata CONNECTING COLLECTIONS Kepa J. Rodriguez – Archives Yad Vashem 09/11/2016
  2. 2. Outline ● Data integration in the first phase of the project ● Our actual integration approach ● Retrieval of data using controlled vocabularies ● Development of the EHRI controlled vocabularies
  3. 3. Data integration in the first phase of the project ● Holding institutions delivered data in very different formats: ● XML, text files, CSV, JSON, etc... ● Ingestion into the portal was made case by case ● We interpreted data model and map it with our model ● Sometimes without help of the institution ● Lots of data introduced by hand ● Process no sustainable, it cannot be repeated ● No automatic updates are possible ● If an institution updates content, data has to be updated by hand ● Other problems: infrastructure, persistent identifiers, etc.
  4. 4. Proposal for the second phase of the project ● Data conversion ● Data publication and synchronization ● Data ingestion
  5. 5. Data conversion ● Converstion tool: different data formats into EAD: ● XML, JSON, CSV... ● Generic transformation ● Useful for a relevant number of institutions ● Reusable functions, as mappings for specific fields of their export format into EAD ● Utilities to configure specific transformations ● Validation of the output: ● Machine validation: XML validation protocols ● Schematron, RNG ● Human validation: HTML preview including mark-up for validation errors
  6. 6. EAD File sample (1) <archdesc level="subgrp"> <did> <unitid>M.49.E</unitid> <unittitle encodinganalog="3.1.2">Testimonies of Holocaust Survivors collected by the Central Jewish Historical Commission in Poland, 1944-1947</unittitle> <physdesc encodinganalog="3.1.5">6845 files</physdesc> <langmaterial> <language langcode="deu" encodinganalog="3.4.3">German</language> <language langcode="pol" encodinganalog="3.4.3">Polish</language> <language langcode="yid" encodinganalog="3.4.3">Yiddish</language> </langmaterial> <repository> <corpname>‫ושם‬ ‫יד‬ ‫ארכיון‬ / Yad Vashem Archives</corpname> </repository> </did> <scopecontent encodinganalog="3.3.1"> <p>The collection consists of approximately 7,200 testimonies collected by the Centralna Żydowska Komisja Historyczna (Central Jewish Historical Committee) in Poland during its during its active years, 1944-1947. ….. as well as testimonies from survivors who fought in partisan units and survivors who were in hiding.</p> </scopecontent> …....
  7. 7. EAD File sample (2) …... <originalsloc encodinganalog="3.5.1"> <p>ZYDOWSKI INSTYTUT HISTORYCZNY - ZIH, WARSZAWA, POLAND</p> </originalsloc> …... <controlaccess> <geogname>Poland</geogname> <geogname>Warsaw</geogname> </controlaccess> <controlaccess> <subject>Persecution of Jews</subject> <subject>Testimonies, Biographies</subject> <subject>Holocaust survivors</subject> </controlaccess> <controlaccess> <corpname>Centralna Żydowska Komisja Historyczna</corpname> </controlaccess> </archdesc>
  8. 8. Data publication and synchronization ● We plan to use two data publication protocols: ● OAI-PMH: one of the first protocols for publication of data ● Publication of data in different formats: Dublin Core (default), EAD, etc. ● PMH-servers are not easy to implement and to mantain for small archives ● But we want to implement a client for institutions that already use it ● RessourceSync: a new protocol ● Based on SiteMaps ● Data can be published on the web page of the institution ● Higher security ● Use sitemaps to expose changes and updates ● Only modified and new data will be tranferred to the portal ● Both are standard protocols of the Open Archives Initiative
  9. 9. Data ingestion ● After data is ingested into the portal, it will receive a permanent URL: ● Formal protocol is in progress ● Necessary to publish our data in the Linked Open Data cloud ● Updates: data will be overwritten ● But the portal keeps the user generated data ● But... is it enough for the user just to have all information in a single infrastructure?
  10. 10. Data retrieval ● The user needs to be able to retrieve information related to selected topics, places, people, organizations, creators... ● Regardless which institution holds it ● Regardless in which language the metadata is written
  11. 11. EHRI controlled vocabularies ● EHRI Thesaurus ● Concepts: hierarchy of concepts formalized in SKOS ● A first set translated into 10 languages ● Made by historians and content specialists ● Authority lists: ● Named entities or instances of the concepts ● Proposed by historians and especialists: not really useful for indexing and retrieval of data ● During import a lot were added by hand to address necessities of the real data ● Domain specific authorities: Ghettos, Camps, Administrative Districts ● Vocabularies created for applications in the portal: ● Two research guides ● Linked to the EHRI Thesaurus
  12. 12. Problems of the first approach of the project ● A vocabulary built with knowledge about the Shoah can be helpful to represent the history, but not necessarily the documentation: ● The complilation of an encyclopedia and the implementation of an engine for cataloguing and retrieval are two very different things and require different strategies and kinds of expertise. ● The vocabularies should be able to retrieve the real existing data: ● Vocabularies should be able to describe the data, not only the content... i.e: types of documents, physical format of the data... ● A strategy to increase te datasets when new data addresses new necessities has to be implemented.
  13. 13. The reality of the data ● Different institutions use different systems to assign keywords (or no system) ● Keywords can have different relevance in different systems ● In a National Archive “holocaust” can be a relevant keyword, but it is not relevant for the EHRI portal. ● A same keyword can have different meanings in different knowledge basis ● i.e: “labor” in one set of imported data corresponds to “forced labor”, in another set to “trade unions” ● Relevant information is often given as free text: ● Necessary to use Natural Language Processing to extract this information, but we can do in the project only in a experimental level.
  14. 14. EHRI's data driven approach (1) ● Extraction of access points of the EAD files during import <controlaccess> <geogname>Poland</geogname> <geogname>Warsaw</geogname> </controlaccess> <controlaccess> <subject>Persecution of Jews</subject> <subject>Testimonies, Biographies</subject> <subject>Holocaust survivors</subject> </controlaccess> <controlaccess> <corpname>Centralna Żydowska Komisja Historyczna</corpname> </controlaccess>
  15. 15. EHRI's data driven approach (2) ● Person, corporate bodies: ● Check whether we have corresponding authority files ● If we have: link the description unit with the correspoinding authority file ● If we don't have: create a new authority file ● Priority of EHRI: creators of archival collections ● Places: ● Link the places with the geographical database GeoNames ● Problematic for historical places, some of them will be added as extra vocabulary.
  16. 16. EHRI's data driven approach (3) ● Concepts/terms: the most complicated case ● Archives used very different strategies for concepts: ● Some institutions make composition of terms using different rules (or no-rule) ● Subject: “Jews--Persecution--France” (data of USHMM) ● EHRI has an atomic approach ● Subject: “Persecution of Jews” ● Place: “France” ● Steps to process concepts/terms: ● Terms are normalized and de-duplicated ● If there are equivalent terms in the thesaurus we establish a link ● If there are not equivalent terms the concept goes to further analysis ● If necessary a board of experts will consider to accomodate a new concept in our concept hierarchy.
  17. 17. Ghethos and Concentration Camps ● We evaluate to start a WikiData project for ghettos and concentration camps ● Strategy: ● Extract information from the actual thesaurus and alternative sources ● Encyclopedic knowledge ● Data from project partners ● Integration of all this data in the WikiData platform ● Enrichment with help of the community ● Multilingual labels and no controversial information ● Finally the data in WikiData and in the portal should be synchronized
  18. 18. NIOD Institute for War, Holocaust and Genocide Studies (NL)   CEGESOMA Centre for Historical Research and Documentation on War and Contemporary Society (BE)   Jewish Museum in Prague (CZ)   Center for Holocaust Studies at the Institute for Contemporary History in Munich (DE)   YAD VASHEM The Holocaust Martyrs’ and Heroes’ Remembrance Authority (IL) United States Holocaust Memorial Museum (USA) Bundesarchiv (DE)   The Wiener Library Institute for the Study of the Holocaust & Genocide (UK) Holocaust Documentation Centre (SK) Polish Center for Holocaust Research (PL)   The Jewish Museum of Greece (GR) Jewish Historical Institute (PL) King’s College London (UK)   Ontotext AD (BG)   Elie Wiesel National Institute for the Study of Holocaust in Romania (RO)   DANS Data Archiving and Networked Services (NL)   Shoah Memorial, Museum, Center for Contemporary Jewish Documentation (FR)   ITS International Tracing Service (DE)   Hungarian Jewish Archives (HU)   INRIA Institute for Research in Computer Science and Automation (FR)   Vilna Gaon State Jewish Museum (LT)   VWI Vienna Wiesenthal Institute for Holocaust Studies (AT) Foundation Jewish Contemporary Documentation Center (IT) CONNECTING KNOWLEDGE
  19. 19. CONNECTING COLLECTIONS Integration and Retrieval of Heterogeneous Archival Metadata 09/11/2016

×