Europeana Newspapers -MetadataAnkara, 3rdMay 2013Günter Mühlberger, Innsbruck University
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innova...
Upcoming SlideShare
Loading in...5
×

Metadata

544

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
544
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Metadata

  1. 1. Europeana Newspapers -MetadataAnkara, 3rdMay 2013Günter Mühlberger, Innsbruck University
  2. 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspAgenda•Introduction•General considerations on metadata•Metadata and newspaper digitisation•EU Newspaper Project - Profile2
  3. 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspIntroduction•Innsbruck University•Digitisation and Digital Preservation Group• Since 1995 involved in Digital Library Projects• Coordinated several EU R&D projects, currently 8 FTEs• Introduced ALTO (Analyzed Layout and Text Object) to thelibraries community in 2002• Fostered Optical Character Recognition for blackletter fonts(Gothic – Fraktur) in 2004• Initiated and coordinated E-Books on Demand Network (EOD)• Member of the Executive Board of the IMPACT Project 2008-2012(Large scale project for mass-digitisation and text recognition)• Development of rule based document understanding platform3
  4. 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspIntroduction•Digitisation and full-text recognition• Several projects since 1995, all with OCR processing• Newspaper clippings (650.000 clippings)• Index cards from libraries (31 catalogues, several millions ofcards)• German dissertations (215.000 dissertations, 24 mill. pages)•Currently• Three EU Projects, among them partner of FP7 tranScriptoriumproject (=Handwritten Text Recognition)• OCR processing of 8 mill. newspaper pages for EU Newspaper• Digitisation of the regional newspaper from Tyrol/Austria4
  5. 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspGeneral considerations• Definition• “Data about data”• Example 1• Data: I am taking part in this event which is a workshop of the EU Newspaperproject in Ankara. The event is currently going on.• Metadata: On May, 3rdGünter Mühlberger took part in the EU NewspaperWorkshop in Ankara.• Example 2• Data: We are digitising a newspaper. We cut the binding and use a documentscanner and we produce digital image files.• Metadata: A Kodak i620 Scanner with 24 bit colour information and automaticdocument feeder provides JPEG files with low compression (90% informationfrom original file is kept)5
  6. 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspGeneral considerations• Example 3• Data: A service provider delivers the scanned images from a newspaper tothe library via a hard disc.• Metadata: A Windows File System, where a root directory must be found withthe identity number of the newspaper, subdirectories with years and furthersubdirectories on issue/day level. An XML file is expected on root level withmetadata on the files.• Observations• With metadata we are introducing a “new view” on data• Metadata are like a summary or a table of contents of data than “new” data• Often they are implicit and people will say “This is clear to us anyway”.• Metadata need some conventions and agreements• We can produce metadata on metadata6
  7. 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspAnalogue vs. digital library world•Can books survive without libraries (or more generallywithout an organisation that takes care about them)?•Can books survive without index cards?• YES!•Why?• Most often they contain their own library card = title page• They can be read and understood by human beings• Their physical condition is rather stable as long as they are storedin a dry environment and no disaster takes place (fire, water,...)7
  8. 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspAnalogue vs. digital• Can digital works survive without libraries (or organisations takingcare about them?)• Can digital works survive without index card?• NO!• Why?• Digital data need a technical system to keep them alive. If the technicalsystem is away, also the digital data are gone. E.g. a world withoutelectricity for e.g. 20 years would lead to heavy data loss.• Digital data cannot be read by human beings – we need a device to makethem visible for us.8
  9. 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspSome other differencies• Is a machine able to read a book?• NO!• Is a machine able to read a digital book?• YES and NO (only first attempts)• YES, it is able to automatically process the document, to extractthe content, to index it, to print it out, to publish it on the Internet,etc.• NO, not in the sense of a human being who will understand thecontent of a book, but already in the sense that the machine willunderstand a lot of the content, e.g. person names, institutions,geographical names, etc.)9
  10. 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspObservations•Metadata are data about data.•This game can be played several times.•Metadata are structuring unstructured information.•Metadata are helpful if they appear directly nearby the data.•Metadata are especially important for digital data sincedigital data are invisible for human beings.•Digital metadata can be understood by machines, analoguemetadata not.10
  11. 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspConsequences•Good metadata must• record data that are helpful for the two main tasks of libraries:preservation and access• structure data in a meaningful way• be readable to human beings• be readable to machines• be acknowledged and maintained by the community• be available in explicit form with explanations, examples andguidelines11
  12. 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspMetadata in a (newspaper) digitisation project•Analogue material•Digitisation process•Text recognition process•Structural enhancement•File naming and structuring•Ownership and Intellectual Property rights•Digital provenance•Intellectual substance12
  13. 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspAnalogue newspaper•“Typical metadata”• Start and (maybe) end date of a newspaper• Place of publication• Titles and variations• Publishers and editors• Frequency of publication• Language• Material aspects, such as size of the paper•Does a digitisation project need to recapture all thisinformation?•Library catalogue is the authoritative source for thisinformation – a link (Identifier) the obvious solution13
  14. 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspAnalogue newspaper•Most of these data are within library catalogues, but a lot isalso missing:• E.g.: very rarely a complete directory of all newspaper issues isavailable, or for all special editions (e.g. for historical events, etc.),supplements, etc.• Also missing issues, or missing pages are very rarely recorded• Newspaper digitisation project managers always make theexperience that with their work the first complete edition (includingalso gaps) will be available•The natural structure is therefore the issue and this is one ofthe very important metadata also within EU Newspaper• We expect all files structured according to issues/days (not tovolumes or years)14
  15. 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspDigitisation process•A lot of artefacts: Image captured by the scanning system isquite different from images finally stored. Internal processes,enhancement software, deskewing, cropping, etc.•Which metadata need to be kept?•Type of scanner used?• For a whole run, for single pages?• Cameras often replace now scanners: They provide typically a lotof metadata (EXIF) but the resolution is a problem (distance mustbe known!)•Microfilm• Cameras!15
  16. 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspOptical Character Recognition (OCR)•Has a long history, but many libraries were (and often are)sceptical•OCR data are produced automatically, will have errors•The error rate will differ on the type of printing, the age of thenewspaper, the way it has been scanned, etc. but also onthe software used, the version and the parameters of thesoftware•OCR engines provide not only text, but also information onthe layout of a page: e.g. coordinates of words•OCR data may be corrected in parts, e.g. title of an article,but not the body of the full-text•Also crowd sourcing may play a role16
  17. 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspStructural enhancement• The physical unit of a newspaper are the issue and the singlepage.• But for the reader the natural unit is the article – e.g. a piece ofcontent.• Articles may consist of titles, subtitles, leads, photos connectedwith them, caption lines, etc.• But apart from articles we will find also announcements,advertisements, charts, weather reports, tables with stock notes,etc.• Structural enhancement may be done completely automatically(as OCR) but in most cases a manual process (outsourced)• Time consuming – expensive and therefore especially important toknow what has been done with which accuracy, etc.17
  18. 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspIntellectual property rights•Several groups/persons have IP rights on newspapers•The publisher/editor• Responsible for the whole newspaper – usually distribution rightsstay with the publisher•Journalists as permanent staff of the newspaper• Copyright will stay with the journalists but access rights usuallybelong to the newspaper publisher•Free lancers• Especially photgraphers, IP Rights stay with them18
  19. 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspIP Rights• Problem of “orphan works”• Digitisation and making available of a newspaper is a new kind of usageand not covered by “old” contracts• This IP Right stays therefore in principle with the copyright owners• But it is impossible to find out all IP Right holder of an old newspaper• EU Directive on Orphan Works• Puts libraries in a privileged position: Will be allowed to digitised orphanworks under special conditions• Required to make a diligent search• To document this search and to register the item in question• To remunerate right holders who will return• New kind of metadata are needed as well19
  20. 20. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspDigital provenance•What happens with the digital data when time goes by?•E.g. migration of one image format to the other?•Or update of OCR data?•Or further structural enhancement with the support of users(=crowd)?•Data need to stay coherent and changes should betransparent20
  21. 21. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspIntellectual substance•“The Times”, “Le Monde”, “Frankfurter Allgemeine Zeitung”,“Washington Post”, etc.•Publishers, well-known journalists, famous articles andheadlines, history of journalism, etc.•Political attitude•Layout, structuring and general appearance•Objective must be: A holistic understanding of a newspapertaking into account all aspects which we have mentionedbefore.•Datamining technology21
  22. 22. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspMain challenge: Putting it all together22Intellec-tual sub-stanceDigitalprove-nanceAnalogenews-paperEnhance-mentIP rightsDigiti-sationprocessFilenamingOCR andenhance-ment???
  23. 23. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspEU Newspaper ENMAP•Europeana Newspaper METS ALTO Profile•Objective• Provide a robust metadata model for the digitisation ofnewspapers that can be used by libraries for preservation, accessand interoperability (delivery of data to Europeana)•Roadmap• Set up an internal format until M12• Implement it within the project and deliver information packages toEuropeana according to this format• Extend the internal format and make it more general so that it caneasily be used outside the project as well (until M18/July 2013)23
  24. 24. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspENMAP – Main approach•METS (Metadata Encoding and Transmission Standard)• Library of Congress• Open format: Editorial board for maintaining the format• Container format: Provides a frame for all kinds of metadata• Goes back to Making of America II (=late 90ies digitisation project)• XML Format (readable for machines as well as human beings –with a simple text editor)• THE dominant format in the libraries world for digitisation projects• Within OAIS (Open Archive Information System) it serves as anInformation Package (AIP, SIP, DIP)24
  25. 25. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspMETS•Cons of METS• The standardisation level is rather low: Profiles are needed tospecify the actual usage of the format• It introduces an extra complexity to digitisation projects• The role of METS within the digital preservation process is notalways clear: Is it just for delivery of data (e.g. SubmissionInformation Package) or for “real” preservation, e.g. as ArchivalInformation Package (AIP)?•My personal opinion• A rich METS file as AIP together with all content data (=images,OCR files, etc.) on a storage server are good prerequisites fordigital preservation25
  26. 26. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspENMAP•Usage of METS within EU Newspaper• We use METS as a submission format (SIP) for delivery data fromthe libraries and the enhancement processes (OCR, structuralenhancement) to Europeana• But we also will provide a format that may serve as ArchivalInformation Package as well – as a concept.•Descriptive metadata• Are kept in MODS (Metadata Object Description Schema)• Dublin Core would have been an alternative, but MODS is richerand there is a relationship to MARC21•Technical data• MIX (Metadata Metadata for Images in XML Standard)• Extended format26
  27. 27. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspENMAP• OCR Data• ALTO (Analyzed Layout and Text Object)• Are kept “outside” METS in XML files (one file per image) but are linked• ALTO allows to store not only text, but also information from the OCRengine, such as coordinates of blocks, lines, words as well as type ofblocks, e.g. text or pictures• Connection between image and text is important for e.g. producing PDFs,or highlighting search results on an image or for further enhancement• One of the main achievements of the IMPACT project was to convinceindustry to provide native ALTO export, e.g. ABBYY FineReader• Therefore rather simple to produce and the effect on standardisationshould be rather high27
  28. 28. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspENMAP•Structural enhancement• METS provides a “structural map” that allows to manage singlearticles which may come from an enhancement process• E.g. titles of articles, reading order of sections and pictures can berecorded within METS and than linked to the image via the ALTOfiles• In this way it is possible to index or to display a single article•Our ambition• To contribute to the standardisation of structural data by providinga data dictionary for structural enhancement.• We believe that a clear structuring will support especially full-textsearching and further text mining28
  29. 29. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspENMAP•IP Rights• Currently there are no standards available• Within ENMAP we state only the ownership of the library on thedigital files – but this must not be mixed up with the actual IP rights•Digital provenance• PREMIS (Preservation Metadata Information System)• An attempt to provide a general framework for “events” within thelife cycle of digital objects•Intellectual content• Tagging of Named Entities is a first step into data mining• No standards are currently available29
  30. 30. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspMake the format living!•Is there a chance that libraries outside/after the EUNewspaper project will use and take up ENMAP?•We believe “yes”!30
  31. 31. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspUptake of the format•Reasons• 10 Mill. pages of newspapers will be enhanced within EUNewspaper, this means some hundred thousands of ENMAPpackages• 12 libraries from all over Europe will receive data in ENMAP• Europeana will naturally use this format for further integration ofnewspaper information• Software tools are available to support the process of generatingmetadata as well as on validating and delivering the data• A workflow is available for putting everything together and toproduce ENMAP packages• Documentation and examples will be available31
  32. 32. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspContribute•Further steps• You are invited to review ENMAP once it is out (during 2013)• You will find a public version on the website of EU Newspaper:•http://www.europeana-newspapers.eu/•Thank you for your attention!32

×