Harvesting&Metadata Enrich Project EVA 2009


Published on

Harvesting&Metadata EVA Florence 2009 Progetto Enrich

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Harvesting&Metadata Enrich Project EVA 2009

  1. 1. H arvesting & M etadata Florence, April 30 th 2009
  2. 2. H arvesting & M etadata The OAI-PMH Standard Rudy Becarelli [email_address]
  3. 3. <ul><li>“ The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. ... “ </li></ul><ul><li>The OAI approach: </li></ul><ul><ul><li>to enable access to Web-accessible material </li></ul></ul><ul><ul><li>interoperable repositories for metadata sharing, publishing and archiving . </li></ul></ul><ul><li>Low-barrier interoperability framework to access digital materials. </li></ul>The Open Archive Initiative Mission The OAI-PMH Standard
  4. 4. <ul><li>The OAI-Protocol for Metadata Harvesting ( OAI-PMH ): </li></ul><ul><ul><li>Simple technical option based on the open standards HTTP and XML. </li></ul></ul><ul><ul><li>Any format of metadata </li></ul></ul><ul><ul><li>Unqualified Dublin Core is specified to provide a basic level of interoperability </li></ul></ul><ul><li>Metadata from many sources can be gathered together in one database </li></ul><ul><li>The link between metadata and the related content is not defined by the OAI protocol </li></ul><ul><li>OAI-PMH makes it possible to bring the data together in one place. In order to provide services, the harvesting approach must be combined with other mechanisms </li></ul>The OAI-PMH Standard
  5. 5. Resource: object the metadata are &quot;about&quot; Item: component of a repository from which metadata about a resource can be disseminated; has an unique identifier Record: metadata in a specific metadata format Identifier: unique key for an item in a repository Set: optional construct for grouping items in a repository The OAI-PMH Standard
  6. 6. <ul><li>Archive a repository for stored information. </li></ul><ul><li>Protocol a set of rules defining communication between systems (HTTP, XML). </li></ul><ul><li>Harvesting refers specifically to the gathering together of metadata from a number of distributed repositories into a combined data store. </li></ul><ul><li>Data Provider maintains one or more repositories (web servers) that support the OAI-PMH as a means of exposing metadata (1). </li></ul><ul><li>Service Provider issues OAI-PMH requests to data providers and uses the metadata as a basis for building value-added services (1). </li></ul><ul><li>(1) OAI definition quoted from FAQ on OAI Web site </li></ul>The OAI-PMH Standard
  7. 7. <ul><li>Data Providers (open archives, repositories) provide free access to metadata, and may, but do not necessarily, offer free access to full texts or other resources. </li></ul><ul><li>Service Providers use the OAI interfaces of the Data Providers to harvest and store metadata. </li></ul><ul><ul><li>no live search requests to the Data Providers; </li></ul></ul><ul><ul><li>services are based on the harvested data via OAI-PMH. </li></ul></ul><ul><ul><li>may select certain subsets from Data Providers </li></ul></ul>The OAI-PMH Standard
  8. 8. <ul><li>Multiple Service Providers can harvest from multiple Data Providers. </li></ul><ul><li>Aggregators can sit between Data Providers and Service Providers. </li></ul>The OAI-PMH Standard
  9. 9. <ul><li>Based on HTTP. </li></ul><ul><li>Request arguments are issued as GET or POST parameters. </li></ul><ul><li>Verbs </li></ul><ul><li>Responses are encoded in XML syntax. </li></ul><ul><li>Error messages are HTTP-based. </li></ul><ul><li>Sets (optional) </li></ul><ul><li>OAI-PMH supports flow control . </li></ul>The OAI-PMH Standard
  10. 10. The OAI-PMH Standard
  11. 11. H arvesting & M etadata CulturaItalia experience Fabio Lanzi [email_address]
  12. 12. <ul><li>An Italian experience: building an OAI-PMH Data Provider for CulturaItalia www.culturaitalia.it </li></ul><ul><li>This Data Provider is conceived as a repository for metadata about Tuscany pieces of art. </li></ul><ul><li>The mission of CulturaItalia: </li></ul><ul><ul><li>to promote Italian culture and heritage in Italy and abroad, </li></ul></ul><ul><ul><li>to promote and integrate existing resources. </li></ul></ul>CulturaItalia experience
  13. 13. <ul><li>CulturaItalia is a descriptive catalogue that indexes metadata and redirects to resources. </li></ul><ul><li>Resources remains distributed and under management of the owner. </li></ul><ul><li>Each institution can establish which data will be harvested by the Portal. </li></ul>CulturaItalia experience
  14. 14. <ul><li>Standards </li></ul><ul><li>CulturaItalia is based on international standards : </li></ul><ul><ul><li>OAI-PMH </li></ul></ul><ul><ul><li>DCMI </li></ul></ul><ul><ul><li>HTTP </li></ul></ul><ul><ul><li>XML </li></ul></ul><ul><ul><li>XHTML </li></ul></ul>CulturaItalia experience
  15. 15. <ul><li>Metadata Schema: PICO DC Application Profile </li></ul><ul><li>Designed for CulturaItalia by Irene Buonazia, M. E. Masci, Davide Merlitti et alii (Scuola Normale Superiore - Pisa) </li></ul><ul><li>Dublin Core has been adopted as metadata standard </li></ul><ul><li>a DC Application Profile has been developed according DCMI recommendations for this specific application and domain </li></ul>CulturaItalia experience
  16. 16. <ul><li>Metadata Schema: PICO DC Application Profile </li></ul><ul><li>The PICO DC Application Profile joins in one metadata schema: </li></ul><ul><ul><li>All DC Elements ; </li></ul></ul><ul><ul><li>All DC Element Refinements and Encoding Schemes from the Qualified DC; </li></ul></ul><ul><ul><li>Other Qualifiers (refinements and encoding schemes) specifically conceived for the CulturaItalia domain. </li></ul></ul><ul><li>Namespaces included into this metadata schema: </li></ul><ul><ul><li>dc: </li></ul></ul><ul><ul><li>dcterms: </li></ul></ul><ul><ul><li>pico: </li></ul></ul>CulturaItalia experience
  17. 17. <ul><li>PICO AP Added Qualifiers – Element Refinements </li></ul><ul><li>Elements added Element Refinements </li></ul><ul><li>CREATOR author, commissioner </li></ul><ul><li>DESCRIPTION information, contact, service </li></ul><ul><li>PUBLISHER distributor, printer </li></ul><ul><li>CONTRIBUTOR editor, performer, responsible, producer, translator </li></ul><ul><li>FORMAT material and technique </li></ul><ul><li>RELATION promotes / is promoted by, manages / is managed by, is owner of / is owned by, produces / is produced by, performs / is performed by, is responsible for/ has as responsible, contributes to / has as contributor, digitizes / is digitized by </li></ul><ul><li>COVERAGE place of birth, place of death, date of birth, date of death </li></ul>CulturaItalia experience
  18. 18. <ul><li>PICO AP - Extensions to DCMI Type Vocabulary </li></ul><ul><li>The element DCType , with its controlled vocabulary ( DCMI Type Vocabulary ), can describe the greatest part of resources to be managed within CulturaItalia. </li></ul><ul><li>PICO Type Vocabulary integrates three more resource types. </li></ul>dcmtype:Collection dcmitype:Dataset dcmtype:Event dcmtype:Image dcmtype:MovingImage dcmtype:StillImage dcmtype:PhysicalObject dcmtype:InteractiveResource dcmtype:Service dcmtype:Software dcmtype:Sound dcmtype:Text picotype:Institution picotype:PhysicalPerson picotype:Project CulturaItalia experience
  19. 19. <ul><li>PICO AP – Further Extensions </li></ul><ul><li>PICO AP can be further extended: </li></ul><ul><ul><li>By adding new encoding schemes : they must be defined and published as xsd schemas, </li></ul></ul><ul><ul><li>Using DCSV (Dublin Core Structured Values), defined in: </li></ul></ul><ul><ul><ul><li>Simon Cox - Renato Iannella </li></ul></ul></ul><ul><ul><ul><li>DCMI DCSV: A syntax for writing a list of labelled values in a text string, 2000-07-28 </li></ul></ul></ul><ul><ul><ul><li>http://es.dublincore.org/documents/dcmi-dcsv/ </li></ul></ul></ul>CulturaItalia experience
  20. 20. CulturaItalia experience SIL “Museum” NAL “In” NAL “Out” Web Service Web Service CulturaItalia Database Tuscany Repository JDBC OAI-PMH <ul><li>CART </li></ul>Adapter OAICat
  21. 21. <ul><li>Publishing process </li></ul><ul><li>Building the envelope: the elements </li></ul><ul><ul><li>Typology </li></ul></ul><ul><ul><li>Publisher </li></ul></ul><ul><ul><li>Local identifier </li></ul></ul><ul><ul><li>Set </li></ul></ul><ul><ul><li>Metadata </li></ul></ul><ul><li>Building the envelope: serialization </li></ul>OAC MUSEUM oac_09_00000001_0 OAC_COMUNE_FIRENZE CulturaItalia experience
  22. 22. CART NAL “Out” Adapter CART WS Tuscany Repository <ul><li>Software on NAL “Out” sends: </li></ul><ul><ul><li>records to Data Provider </li></ul></ul><ul><ul><li>return receipts to publishers </li></ul></ul>CulturaItalia experience
  23. 23. <ul><li>Publishing process </li></ul><ul><li>Crosswalk from original profile to PICO </li></ul><ul><li>Storage on database </li></ul>CulturaItalia experience NAL “Uscita” Web Service Tuscany Repository JDBC Adapter Database
  24. 24. <ul><li>Transformer </li></ul><ul><li>Based on XSLT 2.0 language </li></ul><ul><li>Different profiles: </li></ul><ul><ul><li>OA, OAC (ICCD) </li></ul></ul><ul><ul><li>MFN (Fondazione Memofonte /Museo del Bargello - Firenze) </li></ul></ul><ul><ul><li>GIOMM (Museo Marino Marini – Pistoia) </li></ul></ul><ul><li>Character encoding: </li></ul><ul><ul><li>OAI-PMH UTF-8 </li></ul></ul>CulturaItalia experience
  25. 25. <ul><li>Predefined Entity References NOT ALLOWED! </li></ul><ul><li>Numerical Character References ALLOWED! </li></ul><ul><li>Example: </li></ul><ul><ul><li>[...] si rimanda al volume &quot;Manzù&quot;, 1988 [...] </li></ul></ul><ul><li>Some characters handled this way (beyond 300): </li></ul><ul><li>ê, ½, <, >, &, «, », £, °, `, ´, “,” </li></ul>CulturaItalia experience [...] si rimanda al volume "Manzù", 1988 [...]
  26. 26. <AU> <AUT> <AUTN>Manzù Giacomo</AUTN> <AUTA>1908/1991</AUTA> </AUT> <EDT> <EDTN>Della Ragione Alberto</EDTN> </EDT> </AU> <pico:author xsi:type=&quot;iccd:AUT&quot;> AUTN=ManzùGiacomo; AUTA=1908/1991 </pico:author> <dc:publisher xsi:type=&quot;oac:EDT&quot;> EDTN=Della Ragione Alberto </dc:publisher> Ref : Mapping PICO – ICCD , http://www.iccd.beniculturali.it/Catalogazione/ standard-catalografici /metadati CulturaItalia experience
  27. 27. <ul><li>DATA PROVIDER </li></ul><ul><li>Open source software: </li></ul><ul><ul><li>OAICat </li></ul></ul><ul><ul><li>Apache Axis </li></ul></ul><ul><ul><li>Apache Tomcat </li></ul></ul><ul><ul><li>MySQL </li></ul></ul><ul><li>Personalization: </li></ul><ul><ul><li>Use of Tomcat DataSource </li></ul></ul><ul><ul><li>JDBC2Pico crosswalk </li></ul></ul><ul><li>SERVICE PROVIDER </li></ul><ul><li>CulturaItalia harvested more than 14000 records </li></ul>CulturaItalia experience OAICat PICO harvester Database Tomcat JDBC OAI-PMH
  28. 28. H arvesting & M etadata Enrich experience Paolo Mazzanti [email_address]
  29. 29. Enrich experience <ul><li>An european experience : the ENRICH Project http://enrich.manuscriptorium.com/ </li></ul><ul><li>ENRICH Project goal : create seamless access to information about the vast collections of manuscripts and incunabula distributed across major European libraries </li></ul><ul><li>Italian Partners : MICC (Media Integration and Communication Center) BNCF (The National Librabry of Florence) </li></ul>
  30. 30. <ul><li>ENRICH Project: </li></ul><ul><ul><li>Based on MANUSCRIPTORIUM Digital Library http://www.manuscriptorium.eu </li></ul></ul><ul><ul><li>( National Library of the Czech Republic, AIP-Beroun Ltd) </li></ul></ul>Enrich experience <ul><li>ENRICH Conceptual Model : </li></ul><ul><li>OAI-PMH </li></ul><ul><li>XML </li></ul><ul><li>TEI </li></ul>
  31. 31. <ul><li>Report on the Development and Validation of Migration Tools 28 February 2009 http://enrich.manuscriptorium.com/files/ENRICH_WP3_D3_3_Migration_Tools_01.pdf Migration routes for a number of different data formats to the ENRICH specification. </li></ul>Enrich experience <ul><li>Recommendations for Migration Routes: </li></ul><ul><ul><li>- mature, open source, cross-platform technologies; </li></ul></ul><ul><ul><li>- human-readable, text-based scripting languages. </li></ul></ul>
  32. 32. <ul><ul><li>The metadata format transformation can be operated by the Service Provider or by the Data Provider and it depends on the XSLT skills of the Data Provider; </li></ul></ul><ul><ul><li>The project offers a tool, named M-Tool, that guides the Data Provider to map its proprietary fields into the TEI-P5 ones . </li></ul></ul>Enrich experience <ul><ul><li>Migration of the metadata to the ENRICH: </li></ul></ul>
  33. 33. <ul><li>Data Format: </li></ul>Enrich experience <ul><li>MANUSCRIPTORIUM </li></ul><ul><ul><li>Based on MASTER (Manuscript Access through Standards for Electronic Records) XML data format (extension to TEI P4 Guidelines) </li></ul></ul><ul><ul><li>MASTER Reference Manual (available at http:// www.teic.org.uk/Master/Reference/oldindex.html ) </li></ul></ul><ul><ul><li>The MASTER data format was updated and modified and eventually incorporated as a module into the Text Encoding Initiative TEI P5 Guidelines </li></ul></ul><ul><li>ENRICH </li></ul><ul><ul><li>Based on TEI P5 (ratified by the TEI Technical Council) </li></ul></ul><ul><li>MASTER to ENRICH transformation </li></ul><ul><ul><li>XSL (released by Creative Commons Attribution license) </li></ul></ul>
  34. 34. <ul><ul><li>over 1300 pages </li></ul></ul><ul><ul><li>23 chapters </li></ul></ul><ul><ul><li>Over 500 XML elements </li></ul></ul><ul><li>ENRICH format specification is based on chapters for: </li></ul><ul><ul><li>Manuscript Description </li></ul></ul><ul><ul><li>Digital images </li></ul></ul><ul><ul><li>Non-Unicode characters </li></ul></ul><ul><ul><li>Paleographic or trascriptional data </li></ul></ul>Enrich experience <ul><li>TEI P5: http://www.tei-c.org/Guidelines/P5/ </li></ul>
  35. 35. <ul><ul><li>Metadata describing the original source manuscript; </li></ul></ul><ul><ul><li>metadata describing digitized images of the original source manuscript; </li></ul></ul><ul><ul><li>a transcription of the text contained by the original source manuscript ( not required in Manuscriptorium). </li></ul></ul>Enrich experience ENRICH TEI P5 schema contains three distinct aspects of a digitized manuscript:
  36. 36. Contents of Biblioteca Nazionale di Firenze (BNCF) planned for aggregation via OAI-PMH Enrich experience 58387 183 Galileo Galilei printed books 8 98650 307 Galileo Galilei manuscripts 7 211618 52096 Magliabechi 6 3765 810 Carte Geografiche I 5 159381 377 Bibliotheca Universalis I 4 233 137 Carte Geografiche II 3 63980 183 Bibliotheca Universalis II 2 3865 33 Manoscritti in rete 1 # images # documents set
  37. 37. <ul><li>to aggregate the content and to keep the aggregated information unconstrained as much as possible </li></ul><ul><li>to harvest the original primary metadata contents </li></ul><ul><li>The italian case of the BNCF : - MARCXML (slim) records ( historical metadata ) - MAG records ( structural metadata ) </li></ul>Enrich experience The goal :
  38. 38. Enrich experience Example of the mag profile record BNCF
  39. 39. <ul><li>Two harvests: one for MAG and the other for MARCslim . </li></ul><ul><li>To match appropriate records together and to perform an automated processing of both the input files in order to produce a single XML record using the TEI P5 ENRICH schema. </li></ul><ul><li>This TEI record is further processed in the Manuscriptorium platform (which will be TEI P5 based) for the purposes of searching and presentation (via end-users interface or the OAI-PMH interface). </li></ul><ul><li>Migration to TEI P5 in progress… </li></ul>Enrich experience
  40. 40. Enrich experience Example ONLINE Mag record BNCF
  41. 41. Enrich experience Example ONLINE Mag record BNCF
  42. 42. Enrich experience Enrich Harvesting Information: - AIP Beroun, Beroun, Czech Republic Tomas Psohlavec tomas.psohlavec@aipberoun.cz http://www.aipberoun.cz Enrich Metadata Information: - Oxford University Computing Services , Oxford, United Kingdom James Cummings [email_address] Sebastian Rahtz sebastian.rahtz@oucs.ox.ac.uk http://www.oucs.ox.ac.uk/
  43. 43. THANK you ! Rudy Becarelli Fabio Lanzi Paolo Mazzanti MICC - LCI Lab. - Media Integration and Communication Center Viale Morgagni, 65 50134 Florence (Italy) Tel. +39.055.4237404 http://lci.micc.unifi.it